Datasets

CDC Dataset

class aix360.datasets.CDCDataset(custom_preprocessing=<function default_preprocessing>, dirpath=None)

The CDC (Center for Disease Control and Prevention) questionnaire datasets [5] are surveys conducted by the organization involving 1000s of civilians about various facets of daily life. There are 44 questionnaires that collect data about income, occupation, health, early childhood and many other behavioral and lifestyle aspects of people living in the US. These questionnaires are thus a rich source of information indicative of the quality of life of many civilians. More information about each questionaire and the type of answers are available in the following reference.

References

[5]NHANES 2013-2014 Questionnaire Data

CelebA Dataset

class aix360.datasets.CelebADataset(dirpath=None)

Images are based on the CelebA Dataset [6] [7]. Specifically, we use a GAN developed by Karras et. al [8] in order to generate new images similar to CelebA. We use these generated images in order to also store the latent variables used to generate them, which are required for generating pertinent negatives in CEM-MAF [9].

References

[6]Liu, Luo, Wang, Tang. Large-scale CelebFaces Attributes (CelebA) Dataset.
[7]Liu, Luo, Wang, Tang. Deep Learning Face Attributes in the Wild. ICCV. 2015.
[8]Karras, Aila, Laine, Lehtinen Progressive Growing of GANs for Improved Quality, Stability, and Variation. ICLR. 2018.
[9]Luss, Chen, Dhurandhar, Sattigeri, Shanmugam, Tu. Generating Contrastive Explanations with Monotonic Attribute Functions. 2019.

CIFAR Dataset

class aix360.datasets.CIFARDataset(dirpath=None)

The CIFAR-10 dataset [10] consists of 60000 32x32 color images. Target variable is one amongst 10 classes. The dataset has 6000 images per class. There are 50000 training images and 10000 test images. The classes are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship ,truck. We further divide the training set into train1 (30000 samples) and train2 (20,000 samples). For ProfWt, the complex model is trained on train1 while the simple model is trained on train2.

References

[10]Krizhevsky, Hinton. Learning multiple layers of features from tiny images. Technical Report, University of Toronto 1 (4), 7. 2009

Fashion MNIST Dataset

class aix360.datasets.FMnistDataset(batch_size=256, subset_size=50000, test_batch_size=256, dirpath=None)

Fashion-MNIST [11] is a large-scale image dataset of various fashion items (T-shirt/top, Trouser, Pullover, Dress. Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot. The images are grayscale and 28x28 in size with each image belong to one the above mentioned 10 categories. The training set contains 60000 examples and the test set contains 10000 examples.

References

[11]Xiao, Han, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.

HELOC Dataset

class aix360.datasets.HELOCDataset(custom_preprocessing=<function default_preprocessing>, dirpath=None)

HELOC Dataset.

The FICO HELOC dataset [12] contains anonymized information about home equity line of credit (HELOC) applications made by real homeowners. A HELOC is a line of credit typically offered by a US bank as a percentage of home equity (the difference between the current market value of a home and the outstanding balance of all liens, e.g. mortgages). The customers in this dataset have requested a credit line in the range of USD 5,000 - 150,000.

The target variable in this dataset is a binary variable called RiskPerformance. The value “Bad” indicates that an applicant was 90 days past due or worse at least once over a period of 24 months from when the credit account was opened. The value “Good” indicates that they have made their payments without ever being more than 90 days overdue.

This dataset can be used to train a machine learning model to predict whether the homeowner qualifies for a line of credit or not. The HELOC dataset and more information about it, including instructions to download are available in the reference below.

References

[12]Explainable Machine Learning Challenge - FICO Community.

MEPS Dataset

class aix360.datasets.MEPSDataset(custom_preprocessing=<function default_preprocessing>, dirpath=None)

The Medical Expenditure Panel Survey (MEPS) [13] data consists of large scale surveys of families and individuals, medical providers, and employers, and collects data on health services used, costs & frequency of services, demographics, health status and conditions, etc., of the respondents.

This specific dataset contains MEPS survey data for calendar year 2015 obtained in rounds 3, 4, and 5 of Panel 19, and rounds 1, 2, and 3 of Panel 20. See aix360/data/meps_data/README.md for more details on the dataset and instructions on downloading/processing the data.

References

[13]Medical Expenditure Panel Survey data

TED Dataset

class aix360.datasets.TEDDataset(dirpath=None)

The goal of this synthetic dataset is to predict employee attrition from a fictious company. The dataset is generated by a python file called GenerateData.py in the aix360/data/ted_data/ directory.

Like most datasets, each instance consists of a feature vector and a Y label, which represents whether the employee associated with the feature vector will leave the company. However, unlike most datasets, each instance will also have an Explanation (E). This is motivated by the TED framework, which requires explanations in its training data, but can be used by other explainability algorithms as a metric for explainability.

See also

  • AIES’19 paper by Hind et al. [14] for more information on the TED framework.
  • The tutorial notebook TED_Cartesian_test.ipynb for information about how to use this dataset and the TED framework.
  • GenerateData.py for more information on how the dataset is generated or to created a tailored version of the dataset.

References

[14]Michael Hind, Dennis Wei, Murray Campbell, Noel C. F. Codella, Amit Dhurandhar, Aleksandra Mojsilovic, Karthikeyan Natesan Ramamurthy, Kush R. Varshney, “TED: Teaching AI to Explain its Decisions,” AAAI /ACM Conference on Artificial Intelligence, Ethics, and Society (AIES-19), 2019.
load_file(fileName='Retention.csv')

Open dataset file and populate X, Y, and E

Parameters:fileName (String) –

filename of dataset, a structured (CSV) dataset where

  • The first N-2 columns are the features (X).
  • The next to last column is the label (Y) {0, 1}
  • The last column gives the explanations (E) {0, 1, …, MaxE}. We assume the explanation space is dense, i.e., if there are MaxE+1 unique explanations, they will be given IDs from 0 .. MaxE
  • first row contains header information for each column and should be “Y” for labels and “E” for explanations
  • each row is an instance
Returns:
  • X – list of features vectors
  • Y – list of labels
  • E – list of explanations
Return type:tuple

eSNLI Dataset

class aix360.datasets.eSNLIDataset

The e-SNLI dataset [15] contains pairs of sentences each accompanied by human-rationale annotations as to which words are in each pairs are most important for matching.

The sentence pairs are from the Stanford Natural Language Inference dataset with labels that indicate if the sentence pair is a logical entailment, contradiction or neutral.

References

[15]Camburu, Oana-Maria, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom, “E-SNLI: Natural Language Inference with Natural Language Explanations.”, 2018
get_example(example_id: str) → Dict

Return an e-SNLI example.

The example_id indexes the “docs.jsonl” file of the downloaded dataset.

Parameters:example_id (str) – the example index.
Returns:e-SNLI example in dictionary form.