Machine learning is becoming ubiquitous. More and more services rely on it, and while embedded systems increasingly can run the models themselves, there are good reasons to execute models in the cloud. Models can be huge, like the multi-billion weight GPT models; there is a chance they contain sensitive data, and their output needs to be sanitized or something third. This post is on the fundamentals of a scalable ML architecture.

As a demo, I have implemented the core of a scalable system to build on online API to execute a model. The architecture is relatively straightforward: We use an Elixir frontend, which is expected to carry out responsibilities like authentication etc. We use RabbitMQ as a queue system and a Python service to execute the models.

The Walk Though

First, the code is available at GitHub. Principally, there are three components to pay attention to: The Elixir GraphQl API server, the Python model runner, and the infrastructure defined in respective docker files and the docker-compose.yml.

The application server is a standard Elixir Phoenix setup using Absinthe to handle the GraphQl interface and AMQP to handle the interface to the RabbitMQ message broker. One of the things to keep in mind for services that are a part of a more extensive setup is that we need to develop it and test it in isolation. Hence the module that implements functions for RPC over RabbitMq id mocked and has an IO implementation that merely echoes what it gets. This yields a solution where the only code that can not be tested on the service level is the production implemented of the RPC module, which is sufficient.

We use a machine-learned model written in Python using PyTorch. The example code is moved over and augmented by the glue, making it responsive to new requests in the queue.

Lastly, everything is orchestrated using docker-compose. this includes Docker files for each project and a docker-compose.yml file to start everything.

Concerns

When developing complex solutions, there are a plethora of concerns to be thoughtful about. These are simply some of them.

API design: The API has a single query to translate. This choice is mainly for convenience and can certainly be challenged. We would probably use mutations for ML model invocations as they have sufficient side effects. For some models, it might return different outputs per invocation, and for all cases, it incurs a const to invoke why we probably want to avoid letting clients re-invoke several times as retry strategies.

Indirection: Using a queue system introduces indirection. Signals are sent, and we wait for the return. These systems do not have any runtime call stack, and it is even less so possible to follow a call in the code editor. Ie. Indirection makes code harder to read. Use indirection sparingly and only when there is a good reason.

Scaling: The core decisions have been made for this architecture to scale. We can freely add as many model runners to the system as we see fit. A natural next step would be to add auto-scaling to the model runners in order to respond fast in peak time and save money when demand is low.

Models: In this example, the model weights have simply been added to the docker image. If the model is updated, it is an orchestration concern to make sure that the running containers are updated. This should be as straightforward as just starting new containers and signal the old ones to shut down.

However, if we have large enough churn on the models, say we have individual models per user, we probably need another strategy. Depending on the use case, we might let all the models run in parallel and have multiple topics in RabbitMQ or we can download models as files on demand and cache them.

Final Thoughts

This implementation is representative and by no means production-ready. A lot of tweaking is needed to be done. However, it shows the basic structure for services that has a disconnect between resource-intensive parts and light resources.

When developing these types of infrastructures, it is expected that they will change when exposed to real-life usage patterns. Furthermore, this simple example touches on many different technologies. In a scale-up scenario, this system would probably be managed by at least three people: the data scientist/engineer developing the model runner, an application engineer developing the API service, and somebody with the infrastructure as their primary concern.