Donald Trump is the new president of the United States. What was once improbable is now reality.
As it happened with Brexit most polls got it completely wrong. The graph below shows the poll history for the 2016 US presidential election.
This is what the betting odds looked like:
On a similar note, this is what the polls for Brexit looked like:
So, the question is how could everyone get it so wrong?
Many articles will pop up in the next few days saying that the sampling in most polls was not right, or that statistical forecasting is not an exact science. In fact, just a few hours after the election results were received, an article was published on The Guardian about this topic. There have been many similar articles. A blog post by YouGov research director Anthony Wells, very nicely summarized by Business Insider, mentions a few reasons behind the Brexit polling fail such as failing to take turnout into account or the over-representation of university graduates. Some other outlets, such as GQ go for a softer analysis, providing reasons such as that America maybe was not ready for a female president.
These points are indeed correct. However, as data scientists and statisticians we need to ask ourselves: “how could have we actually predicted this outcome?”. It is easy to run a post-hoc analysis and find reasons as to why things went wrong, but the real question is how can we get it better next time. These two consecutive failures in forecasting two of the most important political events of our times demand for a revision of our methodology and way of thinking.
In short, problems in forecasting political events can be broken down in three different categories:
1) Problems in sampling: A non-representative sample, people not telling the truth (e.g. because they didn’t want to say they are Trump supporters), etc.
2) Problems in the models being used: Maybe the models used for predicting the election are just not powerful enough. We’ve seen in Kaggle model ensembles being the new standard. It is just impossible to win most competitions by using something simple like logistic regression.
3) Problems in the variables being used: For example, Nate Silver in his methodology used polls alongside economic indicators. This is a reasonable approach, but there is nothing to suggest these are the best variables for forecasting elections. Based on the failure of the model, it looks like this approach was actually not very good.
Nevertheless, there were a few people that actually predicted the Trump presidency. One was Professor Allan Licthman, who used a simple model of 13 Yes/No questions. Another successful prediction was conducted by the company Genic.ai, which used social media for its predictions. Also, 90% on Paddy Power on the last few days were for Trump. Comparing what worked and what didn’t will allow us to study each one of the three problems mentioned earlier on independently.
So, let’s break this down:
Problems in sampling
Yes, they do exist, as again mentioned earlier in this article. They are easy to define, but they are not always easy to discover, and it’s not always clear how they can be solved. For example, if online polls can’t be trusted, then what solution can be taken? Maybe we can avoid using online polls altogether, but this approach would sacrifice information. Another fix is to understand whether there is some kind of stable bias (e.g. there is a proportion of voters who will not tell the truth) and add some post-hoc fix.
Problems in the models being used
This could be an important factor, but there are no clear answers. Most machine learning competitions nowadays are won by using model ensembles, random forests and XGBoost. These models are far more powerful than most statistical models like the generalized linear model. However, there is a very important problem with referendums and elections. That is, they are semi-unique events. Semi-unique events, are events like sports finals or the next financial meltdown. The have taken place in the past, so in theory we can gather a dataset to analyse, but the underlying variables between repetitions has changed so much, that the dataset contains less information than it would in other domains (e.g. predicting whether an individual has a disease based on symptoms). Complex machine learning models require a large number of training examples. There is not a dataset out there of 1000 US presidential elections or Brexit referendums which we could use. This is why using models that include domain knowledge might actually give the model an edge. This gets us to the next point.
Problems in the variables being used
Nate Silver’s methodology is an example of a standard statistical analysis. Take some variables which you believe correlate with the outcome, choose a particular statistical model and make a prediction providing some error margins along with it. Two of the models that predicted the US presidential elections did something different to standard statistical analysis. Professor Licthman’s model simply uses 13 questions (which he calls keys). This is an example of a super simple model, based solely on qualitative analysis and domain knowledge. Genic.ai’s model is gathering data from Twitter, Facebook and Google. This is a different take, based on extracting public opinion from social media.
The conclusion is simple, albeit not a happy one for most data scientists and statisticians:
Election and polls are semi-unique events taking place within a social domain. This impacts the ability to use polling, because its effectiveness can depend on social and psychological factors where recent changes to them might have gone unnoticed. It also affects the ability to use other variables (e.g. economic variables), because their effect on voting behaviour can change between events. Therefore, a large percentage of the variance can remain unaccounted for, either due to noisy input (as it is in the case of polls) or due to insufficient information to estimate the effect of a variable (as it is in the case of economic indicators). Under that light, domain knowledge of the problem or novel methodologies (e.g. social media analysis) which allow the extraction of opinions in indirect ways become more important.
In simple words, our ability to forecast elections and referendums is limited as long as we stick to a traditional way of analysing these problems as statistical problems. This is basically very similar to the problem Nassim Nicholas Taleb described in his book The Black Swan, which dealt with financial crises. Trump was in essence a Black Swan that few people predicted, because they assumed that there is some kind of regularity in history which would allow models that worked well in the past to also work in the future. It is also related to the unknown unknowns, such as discovering that polls are broken after they failed to predict an election.
Some people would advocate to go back to qualitative analysis. However, reverting back to qualitative analysis, which is subjective to a large extent, and throwing decades of advancement in computing and statistics in the process, would be a waste of great tools that have worked wonderfully in many other cases.
Instead, what can be done is:
1) Create models that take domain knowledge better into account, like Lichtman’s model, or Bayesian models. People in the sports betting industry hire staff to watch games full-time and extract useful information from the game (such as a player’s behaviour) which can’t be identified in simple stats. In theory some of this could be automated (e.g. by using deep neural networks), but I expect that this kind of technology will take some time to develop.
2) Create models that extract the unknown unknowns from unconventional sources which might be a better indication of people’s preferences instead of just asking them their opinion. Social media seems to be one such source, but I can imagine there might be others, such as lifestyle patterns, or maybe analysis of the political climate through other means (e.g. by crawling online newspapers). I have already done similar work in the context of using Twitter data to predict sports outcomes.
Of course, these ideas can always be combined, e.g. a Bayesian model that includes expert knowledge, along with social media analysis.
Does that mean that we’ve solved the problem? No. Semi-unique events by their nature might require new techniques to be developed through time. So, while the aforementioned models might have worked for now, the future might require new types of analysis. The only thing that’s certain is that domain knowledge and awareness of the unknown unknowns are two issues that will have to be taken into account.