by Muhammad Aurangzeb Ahmad
Much has already been written about the failure of data science in predicting the outcome of the 2016 US election but it is always good to revisit cautionary tales. The overwhelming majority of the folks who work in election prediction including big names like New York Times' Upshot, Nate Silver's FiveThirtyEight and Princeton Election Consortium predicted Clinton's chance of winning being more than 70 percent. This is of course not what happened and Donald Trump is the president elect. And so on the night of November 9th people started asking if there was something wrong with Data Science itself. The Republican strategist Mike Murphy went as far as to state, “Tonight, data died.” My brush with election analytics came in in late 2015 when I was looking for a new job and talked to folks in both the Republican and the Democratic Data Science teams about prospective roles but decided to pursue a different career path. However this experience forced me to think about the role of data driven decision in campaigning and politics. While data is certainly not dead, Mike Murphy observation does lay bare the fact that those interpreting the data are all too human. The overwhelming majority of the modelers and pollsters had implicit biases regarding the likelihood of a Trump victory. One does not even have to torture the data to make it confess, one can ask the data the wrong questions to make it answer what you want to hear.
We should look towards the outcome and modeling approaches for the 2016 US presidential elections as learning experiences for data science as well as acknowledging it as a very human enterprise. In addition understand what led to selectively choosing the data and to understand why the models did not as well as they should have, it would help us to unpack some of the assumptions that go in creating these models in the first place. The first thing that comes to mind is systematic errors and sampling bias which was one of the factors that results in incorrect predictions, a lesson that pollsters should have learned after the Dewey vs. Truman fiasco. That said, there were indeed some discussions about the unreliability of the pollster data run up to the election. Although the dissenting voice rarely made it to the mainstream data. Obtaining representative samples of the population can be extremely hard.