by Muhammad Aurangzeb Ahmad
Much has already been written about the failure of data science in predicting the outcome of the 2016 US election but it is always good to revisit cautionary tales. The overwhelming majority of the folks who work in election prediction including big names like New York Times' Upshot, Nate Silver's FiveThirtyEight and Princeton Election Consortium predicted Clinton's chance of winning being more than 70 percent. This is of course not what happened and Donald Trump is the president elect. And so on the night of November 9th people started asking if there was something wrong with Data Science itself. The Republican strategist Mike Murphy went as far as to state, “Tonight, data died.” My brush with election analytics came in in late 2015 when I was looking for a new job and talked to folks in both the Republican and the Democratic Data Science teams about prospective roles but decided to pursue a different career path. However this experience forced me to think about the role of data driven decision in campaigning and politics. While data is certainly not dead, Mike Murphy observation does lay bare the fact that those interpreting the data are all too human. The overwhelming majority of the modelers and pollsters had implicit biases regarding the likelihood of a Trump victory. One does not even have to torture the data to make it confess, one can ask the data the wrong questions to make it answer what you want to hear.
We should look towards the outcome and modeling approaches for the 2016 US presidential elections as learning experiences for data science as well as acknowledging it as a very human enterprise. In addition understand what led to selectively choosing the data and to understand why the models did not as well as they should have, it would help us to unpack some of the assumptions that go in creating these models in the first place. The first thing that comes to mind is systematic errors and sampling bias which was one of the factors that results in incorrect predictions, a lesson that pollsters should have learned after the Dewey vs. Truman fiasco. That said, there were indeed some discussions about the unreliability of the pollster data run up to the election. Although the dissenting voice rarely made it to the mainstream data. Obtaining representative samples of the population can be extremely hard.
It is notoriously difficult to predict which registered voters are going to actually vote in the elections. Fewer registered Democrats actually went to the polls to vote for Hillary Clinton than they had voted for their Democratic nominee in the last few elections. It is already well known that Hillary would have comfortably won had that not been the case. The opposite is also true, many people who are on the alt-right who normally do not engage with the electoral process voted for Donald Trump. There are many factors that determine how does one obtain a representative sample of the population. The Investors Business Daily (IBD) correctly predicted the outcome of the elections and in their own words they were able to do so because most of the other pollsters collected most of their data by calling smartphones while they polls that they conducted were representative sample even the types of phones that were used. It may be the case that IBD may have gotten lucky because even their approach, as far as we know, does not take into account voter apathy.
The real story about Data Science and the elections may be that even in the age of Big Data we have preciously little data to make robust predictions about the electorate even though we may pretend that that is not the case. Just because a simple model predicted that Trump would win the presidency doesn’t mean the model is correct, there are just too few data points to make predictions with reasonable confidence. Many folks in the data science community observed that the Republicans were far behind the Democrats in terms of building a strong data science and may lose the elections because of this reason. Of course they were dead wrong about this. Cambridge Analytica is the British analytics company that led the Data Science efforts at the Trump campaign. It is now being touted by many outlets as the engine behind Trump’s success after the fact, while others have decried that most of it is just post victory myth making. One of Cambridge Analytica’s claims to fame is that they use psychographic data to make predictions about election choice. Many outsiders observe that even a sample size of a few million is not enough to generalize over a population of 350 million. The PR folks at Cambridge Analytica has played up the media fascination with the idea of data science team winning the elections What is however left in these accounts is that before the election day Cambridge Analytica put the chance of winning of its candidate to be 20 percent which they upgraded 30 percent as voting began. This does not exactly sound like predictions of winning in advance or actionable insight for strategizing. Thus, many journalists have stated that the claims of data science winning the elections are vastly exaggerations with there being no secret sauce to their data science approach.
If we are to take a critical eye to Cambridge Analytica then it is only fair that we apply the same critical eye retroactively to the previous elections and the success of Nate Silvers of the world. It may well be the case that the success of the predictions of the last elections was a fluke but there are important lessons that one can learn from flukes. One of the most insightful comments came from Pradeep Mutalik “that aggregating poll results accurately and assigning a probability estimate to the win are completely different problems.” The former is relatively straightforward while the later involves a host of assumptions that are not always clear and many a times are more art than science. Lastly there is the issues of how the populace and the media, both of which are not rocket scientists when it comes to interpreting the probability of winning or losing elections. Those with some know how of probability would be surprised to learn how many people are out there who think that a probability of 60 percent of winning implies almost certainly winning. Pradeep Mutalik of Yale has rightly pointed out that probabilistic forecasts should be done away with or if we to use them they should with margins of error disclaimers. Perhaps our predictive technology is not as good as we think. It is as good as or bad as the way targeting ads work, which is another way of saying not that well. One cannot really blame the data when the data that we select already have the conclusions that we want built into it. Alternatively we should stop worrying about the predicting the weather as much. Perhaps the outcomes don’t matter as much as we like to think, certainly Nicholas Nassim Talib thinks so.