How Data Mining Discriminates
By definition, data mining is always a source of rational discrimination. Adverse effects associated with data mining are hardly traceable to human bias, conscious or unconscious. There are five mechanisms that may support the emergence of ill outcomes.
Defining the “Target Variable” and “Class Labels”
Target variables define the outcomes of interest data miners are looking for, while class labels assign all the variables into exclusive categories. During the decision-making process concerning employment, the managers want to know who the best candidate is. The focus on different variables may lead to various suggestions, since people who have higher sales may have lower assessment grades. Therefore, irrelevant choice of target variables can lead to discrimination against worthy potential employees.
Training Data
The data that functions as an example for a data mining model is called training data. Biased training data may lead to discriminatory models leading to two common outcomes. When a model treats examples with prejudice as valid, it is prone to reproduce the preconception. If the sample of a population is biased, the model can discriminate against those who are under- or overrepresented in the sample.
Labeling Examples
All the examples need to be manually labeled by the users or data miners. All the errors that occurred during labeling will be reproduced in the model and may lead to unintentional bias. Mislabeling may lead to inheriting prior discrimination or reflecting the ingoing prejudice depending on what dataset is used.
Data Collection
If data is collected from sources that fail to represent various groups in adequate proportions are prone to bias. This can lead to failure to serve the needs of entire socially protected classes. Both over- and underrepresentation classes may lead to disproportionately high adverse outcomes for members of protected classes.
Feature Selection
Feature selection is the process of choosing attributes that influence the outcomes. This process may harm socially protected classes if the features fail to represent the factors that better account for pertinent statistical variation. Even though data miners and managers may be aware of the situation, they may be willing to use the features based on their availability.
Proxies
When criteria for making a rational decision is also a proxy to class membership, the model may be prone to discrimination. Therefore, decision-makers’ prejudice is often not due to their beliefs. Instead, they unintentionally reproduce the injustice that is present in society.
Masking
Decision-makers may discriminate intentionally and mask their prejudice by exploiting the mechanisms mentioned above. Employers may prefer to hire data miners to reinforce unjust beliefs by providing a biased data sample. However, data mining and is a costly procedure, and employers are unlikely to spend the money to mask their intentions.