With the increasing popularity of data science, there are now countless online tutorials on machine learning, data science and statistics, Python and R. Most of these tutorials follow the same pattern: learn some basic commands, go through a simple use case, apply some algorithms and discuss the results. However, is this what the job of data scientist looks like on a daily basis?
A data scientist will have to do (amongst other things):
- Discuss the business problem with the stakeholders and convert it into a data science problem.
- Understand the different metrics and how they relate to business outcomes.
- Understand the algorithms, and how the various trade-offs in terms of explanatory power/predictive power/implementation.
- Present and communicate results.
So, data science is a lot more than just loading data and playing around with algorithms in scikit-learn and R. It is easy to become very good in one of those skills, while missing the rest.
Many others resort to competitions, and that is a fine way to learn how to use the tools properly, but you are not going to face the same challenges you face in real life. Competitions is a great way to learn how to code pipelines and experiment with different algorithms. However, in a machine learning competition, the metric of the problem is given to you. You won’t have to present outcomes. You can create a solution of arbitrary complexity, as long as it drives you up the leaderboard, leading to monster ensembles of many models mixed together.
This is why in my course I took a different turn in teaching data science. Not only I teach the most popular tools (R, Python and Weka), and the basic principles behind machine learning and statistics, but I do that through 3 different real-world use cases which came form my experience working on the field of sports.
For example, in one of the lectures (“injury prediction based on exposure records”), I teach you how to get a dataset and transform it so that you can answer a particular problem, while at the same time taking into account the uncertainty and data issues associated with a problem. I another lecture (“predicting the recovery time”) I teach about some of the issues you might meet when presenting sensitive results to a non-technical audience.
Also, the course is not just for people who want to become data scientists. While R and Python and the most popular choices of languages for data scientists, Weka is an amazing tool that can be used to do machine learning through a graphical user interface. You don’t have to know how to code to analyse data. Clearly very useful for people outside of tech who want to use data science, without having to go deep into it.