Some time ago I researched into whether Twitter can be used for predicting football outcomes. The article can be found on arXiv: Using Twitter to predict football outcomes.
The research used data for 3 month period and it demonstrated that indeed, it is possible to use data to predict Premier League outcomes. When the twitter data were combined with team data, the performance of the model beat the odds benchmark.
I have been asked various questions about this article. Here I am outlining some important points that I have been asked repeatedly. In the meantime, if you want to learn how to predict sports outcomes make sure to check out my courses.
1. Why Premier League?
There are three reasons Premier League was chosen among other leagues. First, there are lots games taking place in the season. Secondly, Premier League, along with La Liga and Bundesliga, is one of the most popular leagues in the world. So, there is lots of discussion about Premier League on Twitter. Finally, the vast majority of tweets about the Premier League are in English, so it was easier to analyse them.
2. What kind of information exists on Twitter that could be used for predicting outcomes?
The initial concept is that the following pieces of information might exist in twitter
a) Predictions about the score on behalf of the fans
b) Overall sentiment to how the team is doing
c) Information that can’t be retrieved directly from statistics, such as injuries.
This is still unclear, since we did not have the time in the original study to look into these questions. I would be willing to research it further, so if anyone is interested, drop me an e-mail.
3. What would be the benefits and the drawbacks of using Twitter for predictions?
There are two clear benefits. First, there might information (as described previously) which might not be accessibly otherwise (e.g. sentiment). Secondly, there might information which might be accessible (e.g. injured players) but might be time consuming to collect.
Regarding drawbacks, the main two drawbacks are the increase in the number of variables and the increase in the time required for processing, since analysing text and sentiment requires lots of feature engineering and experimentation.