A common way to serve models goes something like this:
- Your data scientists build a model and test it. It works well and you ship it to production.
- Your engineering team has plenty of API experience – they take their toolkit of Flask, Django, FastAPI, or another web framework and write out some common API endpoints.
- This works ... kind of. The team realises the model file is larger than assets they work with in most APIs, and it takes several seconds to load from disk. Under heavier load, the hardware is strained as each request has to load in the entire model file.
- After a while, your data scientists release a new model. They’re confident it beats the old model, but not completely sure until they’ve tested it on production data. You engineers write a quick A/B testing script to slowly roll out the new model while monitoring results.
- Scaling is still a problem so your engineers hack in some memory mapping solution.
- After several months, you have a large codebase surrounding your model file – it started as a 100 line script as a simple server, but as more and more features were added, things got of hand.
None of these “extra” components exist because of your unique needs. All machine learning teams have nearly the exact same requirements, yet many rebuild the same code from scratch for each project.
But you can avoid this with a model serving tool. This is a framework that handles all the common requirements for you, but you can easily adapt it to suit your own use cases. One great model serving tool is Seldon.
How does Seldon solve these problems for us?
As part of our Open MLOps architecture, we use Seldon-core, an open source model serving tool that runs on Kubernetes.
This allows us to easily turn our model files into a REST API, almost completely automatically.
If we want to roll out a new model gradually, as part of a test, we can do that with a simple setting, instead of writing logic to handle this from scratch.
Most importantly, Seldon lets us fully automate our machine learning pipeline. It finds the latest version of a trained model using our model registry, turns this into an API, and serves it to end users without us manually changing any code or configuration.
Seldon handles rolling-out, serving, monitoring, and optimizing our models. This means it helps us build machine learning solutions that are fully automatic and reproducible.
Seldon not only helps us optimize our existing manual processes; it also lets us do things that we couldn’t before, like monitor our models for bias, make our models more interpretable, and give us easy access to relevant accuracy and performance metrics.
How does Seldon integrate with the rest of our architecture?
Model serving is only one component in our Open MLOps architecture. It integrates with our model registry, our feature store, and our experimentation hub.
This means our team can build experiments in a notebook environment, track every feature and model they make in a feature store and model registry, and then serve the models to production via our model serving tool.
End users can then access the models as an API.
Do you need help with your machine learning architecture?
We’ve tried most of the tools out there and chosen the ones that work best for us. If you’re setting up a machine learning team and want to discuss your architecture for building and deploying models, we’d love to help.
Set up a free call with our CEO.