The Big Problem with Big Data
Without a doubt, Big Data holds a lot of promise. But, Nate Silver reminds us that the mere availability of data will not change anything, even if it’s coming in large servings
The Signal and the Noise: Why Most Predictions Fail but Some Don'tAuthor: Nate SilverPublisher: PenguinPages: 534These days it’s hard not to hear someone or the other talk about Big Data, especially if you are a journalist covering IT in Bangalore. The term always comes up in press conferences, seminars, in the power point presentations, and sometimes, even in casual conversations. The pronouncements on Big Data are often delivered with the passion of an evangelist, and arise from an awareness that every day, astronomical amounts of data get generated.A McKinsey report, released last year, glowingly quoted an IDC analysis, saying that in 2009, 800 exabytes of data was created - “enough to fill a stack of DVDs reaching to the moon and back.” Nearly all sectors in the US economy, it said, had at least an average of 200 terabytes of stored data per company with more than 1,000 employees.
The consulting firm studied five domains - healthcare, public sector administration, personal location data, retail and manufacturing - and in each of these it found Big Data can generate significant financial value.
It’s no wonder, then, that everyone’s excited. The other day, I heard an executive from a mid-sized IT firm speak about what they found out after an exercise in this field. When they placed pharma products sales data on the top of weather data, they saw that the sales of band-aid went up on rainy days. “I don’t know what causes it, but there is a correlation. And that’s the key. Imagine the value this insight gives to the clients, to their supply chain”, he said.The idea that data is all you need to navigate the rough waters - and who cares about the mechanics of causation - has been around for some time. In 2008, Chris Anderson, editor of Wired and author of Long Tail and Free, wrote gushingly about how Google uses data and said, “The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.” His piece was called the End of Theory.The geek who destroyed punditrySimilar sentiments were expressed recently, after Nate Silver, a geeky blogger at New York Times, accurately predicted US presidential elections, with nothing more than his data sets, his statistical models and his computer. The political pundits who mocked his data-driven model during the elections had to eat their words later on. When the results were out, media was ready with its tributes: Nate Silver-Led Statistics Men Crush Pundits in Election | Bloomberg; Has Nate Silver destroyed punditry? | Christian Science Monitor; The Statisticians on the Bus | How a nerd named Nate Silver changed political reporting forever | Newsweek
Silver’s success was not a fluke. Before he got into in electoral predictions, Silver made a name for himself in baseball and poker. He designed a system called Pecota to predict the scores of major league baseball players, and sold it to Baseball Prospectus. For some time, he made a living on online poker - his income ran into six figures then, by one account. If Congress hadn’t banned online poker, he would have continued to play it, he joked at a Google event recently. As it happened, he turned his attention to US presidential elections. He started a blog called FiveThirtyEight (it refers to the number of electoral college votes) to share the findings of his analysis. In 2008, he was right about 49 of the 50 states. For 2012 elections, he moved his blog to New York Times, which became one of the biggest draws for the newspaper’s website, and turned Silver into a superstar.So, when I picked up Nate Silver’s book I expected to read a strong case for data driven approach to everything, a set of arguments to demonstrate supremacy of data. But, it turned out to be different.Now, Big Data is not the main theme of his book - and Silver touches on it only now and then. The book is about using data to make predictions, and that, in some ways, is at the core of Big Data. After all, people are primarily interested in future - what will be the price of a produce that's growing on your field now, how much of a particular will we sell in a particular market, what kind of products should we develop, or even what will be the traffic on the route to airport in three hours from now. And the promise of Big Data is that it will give the answers by studying reams and reams of data.From this perspective there are three big takeaways from the book. One, in the field of predictions, failures rule, and successes are rare. Two, more data will not solve this problem, and data alone is not sufficient. And three, there’s a way to improve your chances of getting your predictions right. Yogi Berra was rightThat Nate Silver achieved a kind of super-stardom - he has gathered several badges: Ted Talk, Talks @ Google, interviews with Jon Stewart and Stephen Colbert - and as I write this the book is 14th on Amazon’s bestsellers list, and the most wished for book in the Money & Markets category - just by predicting election results right tells something about how rarely such a thing happens. It’s not the sole exception. The book itself talks about a number of examples. Weather forecasts, for example, have become better and better, and is fairly reliable in the US today. There’s a fascinating chapter on how IBM’s machine won against Kasparov.But, tales of failures abound. Silver talks about Tohoku earthquake that led to Fukushima disaster in his book, and as if to remind it will take longer than a year to get better at these things, another earthquake hit Japan last week without a warning. There are terrorist attacks, economics - and even a good deal of academic research. Even where it has seen some success, it's somewhat limited. Silver attributes his own success to well chosen battles, and has spoken about the limitations of his model. (If you listen to his interviews, some of which are available on Youtube, you can’t help but notice how disarmingly honest he is.) His model didn’t work in parliamentary elections in UK, and you only have to take a look at the historic data in India’s election commission website and compare the number of pre-election polls here with the range and variety that Silver had access to in the US, to see why it won’t work in India either. Making predictions, like Yogi Berra said, is difficult, especially if it’s about the future. That has hardly changed. Doing his research, Silver says, “I came to realize that prediction in the era of Big Data was not going very well” Context mattersNow, one might argue more data could solve these problems - and in fact, that’s one of the reasons why people are very excited about Big Data. Silver draws our attention to the forecasting firm ECRI, which in September 2011, predicted that world is headed for a “double dip” recession (if it wasn’t already in one), and threw at its customers several leading indices that suggested so. He writes:Theirs was a story about data—as though data itself caused recessions—and not a story about the economy. ECRI actually seems quite proud of this approach. “Just as you do not need to know exactly how a car engine works in order to drive safely,” it advised its clients in a 2004 book, “You do not need to understand all the intricacies of the economy to accurately read those gauges." This kind of statement is becoming more common in the age of Big Data. Who needs theory when you have so much information? But this is categorically the wrong attitude to take toward forecasting, especially in a field like economics where the data is so noisy. Statistical inferences are much stronger when backed up by theory or at least some deeper thinking about their root causes.”
Besides, the problem with Big Data could exactly what its name suggests, big data. To separate signal from noise, to deal with false positives, and to test hypothesis - all these will be difficult, because more data will also produce more noise. “For instance, the U.S. government now publishes data on about 45,000 economic statistics. If you want to test for relationships between all combinations of two pairs of these statistics—is there a causal relationship between the bank prime loan rate and the unemployment rate in Alabama?—that gives you literally one billion hypotheses to test”As of ECRI’s double dip recession, it never happened. But there’s a way to improve your chances of getting your predictions right: Be a fox, rather than a hedgehog. And even more importantly, join Bayesian Club.Be a foxIn 1953, philosopher Isaiah Berlin published an essay called The Hedgehog and the Fox, in which he spoke about two types of men - foxes, who knew a lot of things and hedgehogs, who knew one big thing. It was definitely not his most serious works, but it turned out to be very popular, and the hedgehog and the fox became an enduring metaphor for certain types of thinkers and writers. Here’s a seven minute video that explains the difference between a fox and a hedgehog.[youtube]http://www.youtube.com/watch?v=WIbbFfz8nEQ[/youtube]Silver says foxes are better at predicting too.
Join Bayesian clubBayes’s Theorem is a recurring theme in Silver’s book. The theorem was proposed by Thomas Bayes, a 18th century mathematician. Probability buffs speak about Bayes in the same way physics buffs speak about Richard Feynman. Here's a video that might help you jog your memory. [youtube]http://www.youtube.com/watch?v=E2pOJwSwWDk[/youtube] Math apart, Silver highlights three habits of mind that Bayesian approach encourages. - Think ProbabilisticallyConsider these two sentences.
- No investor can beat the stock market
- It is hard to tell how many investors beat the stock market over the long run, because the data is very noisy, but we know that most cannot relative to their level of risk, since trading produces no net excess return but entails transaction costs, so unless you have inside information, you are probably better off investing in an index fund.
First Published: Dec 11, 2012, 22:08
Subscribe Now- Home /
- Blog /
- Technology /
- The-big-problem-with-big-data
