Machine learning teams often face the same challenges. MLOps is a set of processes most machine learning teams should follow to address these problems.
To make MLOps more concrete, we’ll look at what problems it solves for research teams. You can also use our free open-source architecture to set up everything described in this article yourself.
Why does your team need MLOps?
Many machine learning teams focus on building novel algorithms or on training high-performing models. While these are fundamental components of any machine learning solution, they’re relatively small compared to all the surrounding processes needed in a real solution, such as data engineering and monitoring.
Research teams tend to focus more on the machine learning code than the surrounding infrastructure. With an increasing need to collaborate, share results, and re-use work done on large datasets, research teams quickly reach the limit of this approach. MLOps solves these problems and allows research teams to achieve their goals despite the complexity that comes from dealing with large datasets, code, and machine learning models.
What challenges might you face without an MLOps architecture?
Challenge #1: You can’t rebuild existing models
Your team has a trained model that works well but can’t be rebuilt. Data and software versions have changed, or the team member who set up the training pipeline on their local machine has left. Your team can share the trained model file directly but you can’t improve or update it, because some of the steps to build it are now lost.
Challenge #2: You can’t effectively audit or monitor your models
The results of your team’s model were excellent at an evaluation stage. But there is no ongoing auditing or monitoring to ensure the model’s predictions still make sense. Your team needs to exercise a blind faith that the model is operating as expected.
Challenge #3: You can’t easily share models
Each member of your team has built their own pipeline to manage data and train models. It’s difficult to collaborate or share results because you can’t easily work on each other’s models, or re-use each other’s intermediate data.
Challenge #4: Your proof-of-concept looked promising but training at scale crashes your hardware
You were able to train your model on a small dataset on your laptop, but it’s difficult to scale, either by pushing it out to a more powerful machine or by taking advantage of compute clusters.
This means translating the proof-of-concept to a production solution takes significant effort.
Challenge #5 You find errors too late
Because your team builds infrastructure and glue code from scratch for each project, bugs and other issues occur frequently and are often only caught after results have been released. The same problems reoccur for each project because you can’t easily re-use code.
How can MLOps architecture address these challenges?
Implementing MLOps practices and using a standardized architecture can help solve all of these challenges. Most machine learning teams should have an architecture that includes the following:
An experimentation hub
An experimentation hub lets your researchers develop notebooks, experiment with new models and architectures, and validate hypotheses. It helps team members share code and ensures they can reproduce each other’s models.
You only need to set things up once because you’re all working in the same environment. The shared hub prevents headaches over hardware-specific issues, like team members using different operating systems or having underpowered development machines.
We use JupyterHub.
A model registry and experiment tracker
A model registry stores each model your team produces with a name and one or more versions. An experiment tracker is similar but works at a higher level: defining names, versions, and metadata for your team’s experiments.
By storing and versioning every single model and experiment your team produces, you can always reproduce results performed in a specific experiment. You won’t be blocked because you don’t know which model produced specific results or what parameters you used for a historical experiment.
This will also help you comply with any necessary legal requirements to justify your results.
We use MLFlow.
A model serving tool
A model serving tool automatically deploys your model to staging or production environments and makes a unified API accessible to your team and your end users.
A unified API and infrastructure across all of your models make it easier to use and monitor them. Automatically deploying your models to staging and production environments means the latest versions can be put to use immediately.
We use Seldon.
A dataflow tool
A dataflow tool keeps track of every step in your pipelines, and can monitor and rerun steps as required. This helps prevent cases in which a model can’t be rebuilt, because you can easily rerun the exact same steps you used each time.
It also saves your teams’ time and prevents human errors in any repetitive work.
We use Prefect.
A feature store
Sometimes your team needs the exact features used to train a specific model but the underlying data has since changed. Or maybe you’ve spent time building specific features and now want to re-use these for a new model. A feature store keeps versions of every feature for your team to re-use and therefore collaborate more easily.
We use FEAST.
Is MLOps ‘nice-to-have’ or is it essential?
While some research teams still operate without MLOps tools or best practices, we believe MLOps has become an essential ingredient for nearly all teams. Unless your team is very small or working only on trivial problems, your machine learning research also needs to be reproducible, accountable, collaborative, and continuous. Without MLOps, meeting these goals is very challenging.
Do you need help setting up MLOps in your research team?
We love finding the right MLOps architecture for machine learning research teams. Contact us to find out more.