It’s common for machine learning teams to get great results on a test set, deploy the model to a real-world setting, spend two weeks watching to ensure it still gets good results, then move onto another project.
However, because machine learning models often interact with real-world events, not just static datasets, their accuracy can degrade over time. And without automated monitoring, it’s very possible that no one will notice for quite some time.
At some point, probably far later than ideal, you might pick this up. Your team will go back, retrain the model, push out an update to production, monitor manually for a couple of weeks, then move on again.
While this optimistic mindset of “assume things are working as expected” can work in cases where the results aren’t that important, it can be disastrous in higher-impact scenarios.
Here’s how you can automatically detect model decay, update a model, and deploy it to production using MLOps.
How to think about automated monitoring
The first step to automated monitoring is to detect the issues. Proactively identifying model decay, not simply reacting when problems arise, is key. You can prevent a lot of issues, even if you need to invest some time to manually retrain and redeploy a model.
In theory, this is straightforward. You set up a monitoring system and it pushes you an alert when your system’s accuracy falls below a certain threshold.
In reality, it’s more difficult. You might have a manually annotated “gold” dataset for training, but in production, you don’t know if your model is getting things wrong: if you did, you wouldn’t need the model in the first place.
Three strategies for monitoring models
To calculate your predictions’ accuracy, you can do one of the following:
- Monitor a metric that correlates with accuracy. If a spam classifier generally predicts 30% of messages as spam but one day predicts 75%, for example, there’s probably not been a huge rise in actual spam; it’s more likely the model is broken.
- Human-in-the-loop sample. For example, you might manually check 5% of all predictions, and monitor the accuracy of those..
- Monitor retrospectively. If you make predictions about the future, like whether a patient will recover in a month, then you’ll be able to verify your accuracy later on.
Once you have a metric that acts as a reasonable quality check, you’ll need a tool to monitor it. This can either be a generic monitoring tool like DataDog or a specialized machine-learning monitoring tool. A specialized tool makes it easier to get more useful metrics for machine learning scenarios, like an error distribution or Mean Absolute Error. But if your use case is simpler and you’re already set up with another monitoring tool, it’s probably faster to start with that.
Automatically retraining and deploying your models
Detecting decay is the most important step, but it’s not that valuable unless you can do something about it. Manually retraining a model and deploying it is one option, but then your team needs to be constantly “on call”, waiting for the alarms to trigger.
Manual retraining involves a lot of effort and carries more risk of human error, especially if you need to do this frequently.
If you have a full MLOps architecture set up, you should have everything in place to automatically retrain your models and deploy them – either on a fixed schedule or when you detect model decay.
In software engineering, a CI/CD or “continuous integration/continuous deployment” framework is common. This means you’ve set things up so that new code can be automatically integrated into existing code, tested, and deployed.
In MLOps, you also want to retrain your models, so you need CT/CI/CD, or “continuous training/continuous integration/continuous deployment”.
You’ll need to invest time upfront to set up a CT/CI/CD pipeline, but in our experience, most teams do things manually for too long.
Tools you can use against model decay
There are plenty of MLOps tooling options, but here’s what we use in automatically retraining our models to avoid model decay:
- Prefect as a workflow and dataflow tool. This lets us define a graph of tasks to be run and then automatically execute them, monitoring and re-running if necessary. You can read more about what we love about Prefect. Prefect not only handles the automatic retraining of models, but also the scheduled monitoring to help us analyse and detect decay.
- Feast as a feature store. With Feast, you can track changes to data and share features between models. Learn more about feature stores in our article, Choosing a Feature Store.
- Seldon for serving models. With Seldon, you can easily turn any model into an API and serve it at scale, without writing serving code from scratch.
Do you need help setting up MLOps for your research team?
What often makes teams hesitate to invest in a proper architecture is the number of unknowns: they don’t know how long it will take or how to start. We’ve done it before and would enjoy helping you skip the grunt work, so contact us if you want to chat.