[Note: build the model yourself here using our fully interactive notebook. No prior coding experience required.]
If you’re like me, you need to play with something and “do it yourself” to really understand it. Here we’ll explain how machine learning really works, by example.
You’ll build your own machine learning model to predict the likelihood of passengers on the Titanic surviving. The model will learn patterns by itself, just by looking at data.
Understanding the steps for doing machine learning
Follow along to:
- Load the data and explore it with visualisations;
- Prepare the data for the machine learning algorithm;
- Train the model – let the algorithm learn from the data;
- Evaluate the model – see how well it performs on data it has not seen before;
- Analyse the model – see how much data it needs to perform well.
To build the machine learning model yourself, open the companion notebook. You’ll run real machine learning code without needing any set-up – it just works.
Understanding the tooling for machine learning
There are lots of options when it comes to machine learning tooling. In this guide, we use some of the most popular and powerful machine learning libraries, namely:
- Python: a high-level programming language known for its readability, and the most popular machine learning language worldwide.
- Pandas: a Python library that brings spreadsheet-like functionality to the language.
- Seaborn: A library for plotting charts and other graphics.
- Scikit learn: A machine learning library for Python, offering simple tools for predictive data analysis.
- DRLearn: Our own DataRevenue Learn module, built for this dataset.
These are good tools to start with, since they’re used by both beginners and huge companies (like J.P. Morgan).
Exploring our dataset
We’ll use the famous “Titanic” dataset – a slightly morbid but fascinating dataset containing details of the passengers on the Titanic. We have a bunch of data for each passenger, including:
- name,
- gender,
- age,
- ticket class.
Our data takes a standard form of rows and columns, where each row represents a passenger and each column an attribute of that passenger. Here’s a sample:
Visualizing our dataset
Machine learning models are smart, but they can only be as smart as the data we feed them. Therefore an important first step is gaining a high-level understanding of our dataset.
When it comes to analyzing the data, a good starting point is testing a hypothesis. People with first-class tickets were probably more likely to survive, so let’s see if the data supports that.
You can see and run the code to produce this visualization in the companion notebook.
Over 60% of the people in first class survived, while less than 30% of those in third class did.
You might also have heard the phrase "women and children first." Let's take a look at how gender and survival rate interact.
Again, we see that our hypothesis was right. Over 70% of women survived, while only around 20% of men did.
Just like that, we’ve created two basic visualizations of our dataset. We could do a lot more here (and for production machine learning projects, we certainly would). For example, multivariate analysis would show what happens when we look at more than a single variable at a time.
Preparing our data
Before we feed our data into a machine learning algorithm to train our model, we need to make it more meaningful to our algorithm. We can do this by ignoring some columns and reformatting others.
Ignoring unhelpful columns
We already know there will be no correlation between a passenger’s ticket number and their chance of survival, so we can explicitly ignore that column. We delete it before feeding the data into the model.
Reformatting our data
Some features are useful, but not in their raw form. For example, the labels "male" and "female" are meaningful to a human but not to a machine, which prefers numbers. Therefore we can encode these markers as "0" and "1" respectively.
Once we're done preparing our dataset, the format is more machine friendly. We’ve provided a sample below: we’ve eliminated many useless columns, and the columns that are left all use numbers.
Splitting our dataset in two
Now we need to train our model and then test it. Just like school children are given examples of test questions as homework but then unseen questions under exam conditions, we’ll train the machine learning algorithm on some of the data and then see how well it performs on the remainder.
Let’s train our model!
And now for the fun part! We’ll feed the training data into our model and ask it to find patterns. In this step, we give the model both the data and the desired answers (whether or not the passenger survived.)
The model learns patterns from this data.
Testing our model
Now we can test our model by giving it only the details of the passengers in the other half of our dataset, without the answer. The algorithm doesn’t know whether these passengers survived or not, but it will try to guess based on what it learned from the training set.
Analyzing our model
To better understand how our model works, we can:
- Look at which features it relied on the most to make predictions;
- See how its accuracy changes if we use less data.
The first helps us understand our data better, and the second helps us understand whether it’s worth trying to source a larger dataset.
Understanding what our model finds important
Machine learning knows that not all data is equally interesting. By weighting particular details differently, it can make better predictions. The weights below show that gender is by far the most important factor in predicting survival rate.
We can also look at which aspects of the data the algorithm paid attention to when predicting the survival of a specific passenger. Below we see a passenger who the algorithm thought was very likely to survive. It paid special attention to the fact that:
- The passenger was not in third class;
- The passenger was female.
It lowered the chance of survival slightly because the passenger was also not in first class, resulting in a final survival prediction of 93%.
Understanding how data quantity affects our model
Let’s train the model multiple times, seeing how much it improves with more data. Here we plot both the training score and the test score. The latter is much more interesting, as it tells us how well the model performs on unseen data.
The training score can be thought of as an “open-book” test: the model has already seen the answers, so it looks higher than the “Test score” but it’s much easier for the model to perform well on data it saw during the training phase.
Here we see that the more data the model has, the better it performs. This is much more noticeable at the start and thereafter adding more data results in only small improvements.
Machine learning models don’t have to be “black box” algorithms. Model analysis helps us understand how they work, and how to improve them.
Conclusion
That’s it – you've built your own machine learning model. You’ll now be able to:
- Understand the day-to-day work data science teams do;
- Communicate better with your data science or machine learning team;
- Know what kinds of problems machine learning is best at solving;
- Realize that machine learning is not so intimidating after all.
The complex part of machine learning is getting into all the nitty-gritty details of building and scaling a customized solution. And that’s exactly what we specialize in. So if you need help with the next steps, let us know.