With a barrage of media coverage heralding big data’s ability to bring previously undetectable findings to light, it’s easy to forget that data can be fallible. It is, after all, just information and the wrong context, integration issues or faulty algorithms that can cause information to suggest misleading conclusions.
An excellent example of this unreliability is documented by Jer Thorp, Innovator-in-Residence at the Library of Congress, in this exploration of a sentiment analysis project of a New York City high school that was ultimately disproven. The project analyzed 6,000 public tweets, flagged those determined to be sad by placing them on a map of the city, and then ascribed more sadness to those locations with a high number of tweets. The map depicted the saddest spot to be a high school on the Upper East Side. When further analysis revealed that the data collection period coincided with the end of spring break, the researchers touted their conclusion: students taking to Twitter to lament their return to school.
The findings, covered by Nature, The New York Times and other media outlets, were ultimately disproved three weeks later. The researchers, as Thorp writes, “had made a big mistake,” or really, many mistakes. The geocoding code was faulty but, perhaps more critically, they “had missed a deeper error in their ‘sad high school’ hypothesis that left the premise completely indefensible.”
Twitter was not permitted within the building.
This illustrates the fundamental challenge facing big data analytics. Organizations need trusted, accurate and integrated information in order to avoid mistakes in its interpretation — but that in and of itself is not enough. Companies must connect data across all channels and moments, and apply intelligence at the edge, otherwise, big data projects can become a bright, shiny toy devoid of any ROI.
While a fascinating read, Thorp’s “saddest high school” article is far from the only story illustrating big data’s challenges. In fact, the issue was apparent even before big data was a thing—with Simpson’s Paradox originating in the 1950’s. But while it may be a well-known problem to statisticans the solution is often less readily apparent, primarily because there are so many ways in which organizations can be duped by data.
This KDnuggets guest post by Simon Whittcock outlines four of the chief data fallacies with an accompanying infographic on other common mistakes. It’s important that anyone working frequently with data is aware of these pitfalls and understand the parameters that must be in place to help avoid them.
Ultimately, both data and the humans who work with it are fallible. Mistakes will occasionally happen and incorrect hypotheses will be drawn, but having the right tools and technologies in place can greatly help companies avoid some of big data’s chief stumbling blocks.