Use representative training data
Many AI algorithms are trained on benchmark datasets that are not representative of the real world. So while they perform well in the lab, they fail in practice.
For example, researchers Joy Buolamwini and Timnit Gebru found that gender classification systems—algorithms that guess whether a person in a picture is male or female—sold by IBM, Microsoft, and Face++ had an error rate as much as 34.4 percentage points higher for darker-skinned females than lighter-skinned males. An important cause of this phenomenon is that the datasets used to train these algorithms had many more white-skinned faces than dark-skinned faces. While some of these algorithms were subsequently updated to reduce the extent of the problem, the exercise highlighted the importance of data quality and representativeness in building AI systems.
Another example is the use of algorithms in criminal justice. Consider the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) system, used to assess potential recidivism risk. In the United States, COMPAS risk scores are widely used by judges during sentencing. ProPublica journalist Julia Angwin led an investigation into the algorithm, and found that “blacks are almost twice as likely as whites to be labeled a higher risk but not actually re-offend.”
In other words, the algorithm was making more mistakes for black people, leading to unfair treatment. What’s worse, the algorithm “makes the opposite mistake among whites: They are much more likely than blacks to be labeled lower-risk but go on to commit other crimes.” The algorithm was making mistakes that hurt black people, while also making mistakes that favored white people. Unfortunately, subsequent research revealed that this problem is, in principle, impossible to solve.
On a final note, using representative data is not always a good idea. For example, if you train language models—used in chatbots and language translation tools—on text from the real world, then you would inevitably inherit any stereotypes (e.g. nurses being more typically women, doctors being typically male) that natural language contains. Which raises the question: who gets to decide which associations to keep and which to correct?
References
Buolamwini, J. & Gebru, T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. in Proceedings of the 1st Conference on Fairness, Accountability and Transparency (eds. Friedler, S. A. & Wilson, C.) vol. 81 77–91 (PMLR, 2018).
Julia Angwin and Jeff Larson and Surya Mattu and Lauren Kirchner. Machine Bias. ProPublica (2016).
Totty, M. How to Make Artificial Intelligence Less Biased. Wall Street Journal (2020).
Kleinberg, J., Mullainathan, S. & Raghavan, M. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609. 05807 (2016).
Angwin, J. & Larson, J. Bias in Criminal Risk Scores Is Mathematically Inevitable, Researchers Say. ProPublica (2016).
Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).