Warning: Big Data Can Be Biased, and even “Troublingly Wrong”—Just Like Data Years Ago

As valuable as Big Data can be, Big Data still has the potential to mislead, as I said in previous writing. Recognizing potential problems may help users apply Big Data more effectively, since awareness of a problem can be a first step toward addressing it and eventually obtaining better outcomes. So, it’s encouraging to see the media move away from extolling Big Data’s benefits, and instead point out problems that might arise.

A recent Wall Street Journal article, “Social Bias Creeps Into New Web Technology” by Elizabeth Dwoskin on August 21, 2015, did just that. As the article points out, the results of algorithms can be biased, and they “can go embarrassingly, sometimes troublingly wrong.” The article describes how algorithms that seemed to satisfactorily identify white peoples’ faces, incorrectly identified the faces of black men as apes or gorillas. In the article, this kind of error was attributed to the way computers do machine learning, which entails training on an initial data set and updating as additional real world data is encountered. The article adds that “machine learning software adopts and often amplifies biases in either data set.”

According to the article, the algorithms performed better with white faces because there were more of them in the initial training data, and there are more of them in the real world updates (the images found online). The article points out that “data scientists say software bias can be minimized by what amounts to building affirmative action into a complex statistical model”, which in this example would entail “introducing more diverse faces”. This means including more black faces in the data.

I see important lessons here. As this example illustrates, Big Data has some of the same problems today that data analysis had back in the 1970s, when my corporate job resembled the work of what we call data scientists today. I know from experience that, just like in the 1970s and before, merely feeding data into the computer today and expecting the algorithm to automatically spit out the right answer may or may not happen.

Back in the 1970s, I developed statistical models for predictive purposes. Back then, I saw the importance of looking for where the algorithms might not perform well. Perhaps, the relationship between previously correlated variable was not the same as before. Or, perhaps a model might be slow to identify turning points, just as the faces algorithms had trouble identifying blacks.

I find that, whether for today’s analytics or yesteryears’, a good grasp of the predictive problem can help. In fact, human thinking by someone who understood the problem is what leads to improved approaches like adding more diverse faces to the data, so the algorithm might eventually be trained to better identify black faces. The computer algorithm does not come up with that improvement by itself. That’s why, as I’ve said before, the technical skills of a data scientist are not enough. Someone still must understand what the data means. That was true years ago, it’s still true today, and it will continue to be true as the early stages of applying Big Data unfold.

Furthermore, just as with data from the past, narrow segments can pose challenges for today’s Big Data. When data is collected about a broad population, information about narrower segments within that population may not be adequate. Today, as in years past, this can happen with data collected via a traditional market research survey. And, it happens with Big Data today, when algorithms are constructed based on a broad populations (in the above example, the broad population is all faces), but predictions are desired for a narrower segment (in the above example, black faces are a narrower segment).

Whether it’s survey methodology developed years ago or Big Data’s algorithms today, the way to get enough information about narrower segments is to specifically collect more data about those narrow segments. With a traditional survey, this can mean continuing to screen for and add more respondents from the narrower segment until a large enough sample of that narrow segment is obtained. In the faces example, if the misidentification is due to too few blacks in the training data, it means adding more of the narrower segment (black faces) until there are enough for the algorithm to work.

In conclusion, Big Data has often been said to be free of many challenges associated with traditional data analysis. Yet, as the faces example in the Wall Street Journal article illustrates, problems traditionally associated with data analysis in the past still plague today’s algorithms. So, as more and more data becomes available and as companies increasingly tap data resources, understanding what the data means and avoiding bias in algorithms is crucial.

That’s why my work today concentrates heavily upon understanding what business success patterns mean. That emphasis is reflected in the material on this web site, and in what I cover when I speak to groups. This is crucial because today, just as in years past, it is essential to understand what the data means.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *