All Big Data is equal, but some Big Data may be more equal than others

We are in the era of Big Data in human genomics: a vast treasure-trove of information on human genetic variation either is or will soon be available. This includes older projects such as the HapMap, and 1000 Genomes to the in-progress 100,000 Genomes UK. Two technologies have made this possible: the advent of massively parallel “next generation” sequencing where each individuals’ DNA is fragmented and amplified into billions of pieces; and powerful computational algorithms that use these fragments (or “reads”) to identify all the “variants” – any changes that are different to the “reference genome” – in each individual.

Heterozygous_SNV_call,_from_aligned_NGS_reads — Single nucleotide variant (SNV) relative to the reference genome

With existing tools this has become a relatively straightforward task. Identification of single nucleotide polymorphisms or variants (SNVs) – single base differences between the individual and the reference genome – especially medically relevant ones – is beginning to become routine. A project I worked on with a client recently, involved examining the accuracy of SNV identification in individuals from populations that are less well-sampled. How well do the algorithms work in these cases?

It’s often widely assumed that decisions made by algorithms are more “neutral” and “fair” than those made by humans. However all algorithms ultimately rely on some notion of ground-truth to sort the wheat from the chaff. For example, identifying variants in next-generation sequencing data relies on the supply of test-sets: previously identified and validated SNVs drawn typically from databases including the Online Mendelian Inheritance in Man (OMIM) and others. These test sets are themselves assembled from information originally derived from statistically dominant (mainly European) populations. Another source of “ground truth” is the reference genome itself, it’s construction was also heavily derived from individuals with European backgrounds. Most algorithms that identify variants use a statistical procedure that relies on some version of Bayes’s theorem: taking both the training set, together with the input sequencing data. This procedure does the best it can, but there is always the potential to identify variants that don’t exist (false positives) and miss other that do (false negatives).

This much is well-known and to be expected, but in this era of Big Data, there is a also tendency for people to hope that by lumping all data together in a massive bolus, that we can just let the computer figure it all out. Being a population geneticist, I am attuned to the fact that human populations have distinct signatures of variation in their genomes, and while combining data from individuals can improve accuracy of variant identification it has to be done carefully as it may miss true, but rare, variants. This got me thinking: how much do theses issues arise in other, non-genetic, “Big Data” analyses? These kinds of questions have repercussions far beyond human genomics, given that many important and consequential decisions such as employment, health insurance, credit or education are increasingly made by algorithms.

Moritz Hardt has an excellent post How Big Data is Unfair, unpacking how machine learning algorithms, specifically “classifier” systems, trained on statistically dominant populations, can sometimes lead to erroneous classifications. One of the arguments made for the neutrality of Big Data analysis is that by including ever more data points, you end up improving your classifier and become better at spotting the wheat and eliminating the chaff. Hardt says that this is can be more difficult than it looks, because in the real world, the minority population may follow a different model than the dominant population. Simple linear classifiers may miss these differences:

Here’s a toy example. There might be a simple linear function that classifies the majority group correctly and there might be a (different) simple linear function that classifies the minority group correctly, but learning a (non-linear) combination of two linear classifiers is in general a computationally much harder problem. There are excellent algorithms available for learning linear classifiers (e.g., SVM), but no efficient algorithm is known for learning an arbitrary combination of two linear classifiers.

So getting to fairer decision-making in Big Data may end up being computationally expensive, and it puts the onus on those designing the algorithms to be cognizant of these issues. Getting it “right” therefore depends on the financial and political incentives to do so. As Hardt also points out:

Since some of the most interesting applications of AI tend to be at the limit of what’s currently computationally and humanly feasible, the additional resources necessary for achieving fairness may be limited.

If the costs of getting it wrong (i.e. misclassifying an individual) are borne by the statistically minority population, there could be much less incentive to make certain kinds of Big Data analyses fairer. For many kinds of analysis, say, recommendation engines for restaurants or music, the issues at stake are not life-or-death, but as Frank Pasquale, author of The Black Box Society: The Secret Algorithms That Control Money and Information, notes in an article in Aeon:

when algorithms start affecting critical opportunities for employment, career advancement, health, credit and education, they deserve more scrutiny.

It is encouraging that there are moves towards rigorously taking into account the financial, economic and social contexts into which algorithms are deployed, through events such as the Fairness, Accountability and Transparency in Machine Learning (FAT ML) workshops. It is incumbent upon all of us, especially those of us in the life sciences, that are increasingly dealing in personalized genetic information, to be aware of the potential for inherent biases during algorithm design and deployment, to train users to be similarly aware, and to make them open and available for public inspection.