The traditional approach is to formulate a hypothesis and then look for data to validate that hypothesis. Yet many data scientists consider it a virtue to come to a trove of data with unbiased eyes and simply find patterns already present in it.
The idea is not to try to explain why there are correlations but to document what could be a causal relationship between variables. While this method could prove useful, it is also a path strewn with pitfalls.
Gary Smith explains why unbiased data mining is severely flawed in this article from Wired:
The Feynman trap—ransacking data for patterns without any preconceived idea of what one is looking for—is the Achilles heel of studies based on data mining. Finding something unusual or surprising after it has already occurred is neither unusual nor surprising. Patterns are sure to be found, and are likely to be misleading, absurd, or worse.
In his best-selling 2001 book Good to Great, Jim Collins compared 11 companies that had outperformed the overall stock market over the previous 40 years to 11 companies that hadn’t. He identified five distinguishing traits that the successful companies had in common. “We did not begin this project with a theory to test or prove,” Collins boasted. “We sought to build a theory from the ground up, derived directly from the evidence.”
He stepped into the Feynman trap. When we look back in time at any group of companies, the best or the worst, we can always find some common characteristics, so finding them proves nothing at all. Following the publication of Good to Great, the performance of Collins’ magnificent 11 stocks has been distinctly mediocre: Five stocks have done better than the overall stock market, while six have done worse.
Good research begins with a clear idea of what one is looking for and expects to find. Data mining just looks for patterns and inevitably finds some.
The problem has become endemic nowadays because powerful computers are so good at plundering Big Data. Data miners have found correlations between Twitter words or Google search queries and criminal activity, heart attacks, stock prices, election outcomes, Bitcoin prices, and soccer matches. You might think I am making these examples up. I am not. There are even stronger correlations with purely random numbers. It is Big Data Hubris to think that data-mined correlations must be meaningful.