Let's set up a scenario: We have a company that builds models for machine translation. The data scientists in work in their own Python notebooks and the data engineers take these notebooks and implement them as production-ready models that can do inference. This process looks something like the following: The data engineer understands the architecture of the production system. She rewrites the notebook into an architecturally coherent unit that is being packaged as an image. This image is being deployed to the production system.

As an experiment, we are going to do just that. As the example notebook, we use the PyTorch tutorial on sequence to sequence learning. This notebook needs to be made production-ready for the scalable ml service project as the inference engine for doing French to English translations.

The commit introducing this change is available here.

Hard Constraints

Step one is to make it possible to run the model on the infrastructure using the lowest lift. This requires that we export the notebook to a stand-alone Python file, we make wrapper code that makes a minimal interface to the transport layer and we package it in a way so that we can deploy it to the infrastructure.

We prioritize getting the system to a working state. This means that we defer concerns about architectural soundness of the code, brittle parsing, readability, etc. We simply need to some something working in a production-ready state (Hey! Sometimes this is even enough, why we should never anticipate work).

The transport layer consists of a RabbitMQ message broker. There is a queue for doing RPC calls which we hook into. In particular, we are listening to the rpc_queue-queue. As for the current architecture, the messages sent over the queue are merely the string to be translated and the translation is sent back.

To respond to messages we simply use the evaluate function the original author had already put in place and make sure it never fails.

def translateSentence(input_sentence):    try:        output_words, attentions = evaluate(encoder1, attn_decoder1, input_sentence)        return ' '.join(output_words)    except:        return "Could not translate. Try 'je vais domir .'"

Lastly, we package everything using a Dockerfile in order to run it as a container on the infrastructure. This yields and up and running state and we can translate.

Clean Up

Playing around with the minimal setup clearly yields a number of problems.

  • Translation of je vais dormir . translates into i am going to sleep . <EOS>: We definitely do not want these special symbols in the output string.
  • Translation of Je vais dormir. simply fails. The tokenization is wonky and needs to be updated.
  • The code is kind of messy. We carry over source data files in order to build dictionaries etc. on the go. We want to clean up that a bit also.
  • We need to distribute the training data in the image even though we a not training anything.

To carry out these refactorings, we do the following:

  1. Remove unused code. The notebook is filled with code to train and evaluate. At this point, we have decided, that the model is something we want to put into production, and the training and evaluation is done.
  2. Isolate model architecture from sentence parsing and translation, and transport layer code. We do this mostly to ensure good architectural practices. This will increase readability.
  3. Implement parsing for input strings. This is simply done by using the same normalize function as was used on the training data.
  4. Serialize the Lang objects: In order not to re-parse the raw training data to build an internal representation every time an image boots, we merely save the dumped representations and deserialize it whenever the image boots.

We now have an image that translates strings such as Je vais dormir. which seems much more natural.


Should we test this code? Or probably a better question: How should we test this code?

As for unit testing, there are arguments for and against it. The arguments for are in particular those of dogmatic nature: We should test because it is best practice. The arguments against are those that this should probably be seen as an asset. It is mostly model weights packaged just with enough code to run inference and push it back to the client. The production prepared image can be sent back to the data scientist for her to do some final verifications that the code is indeed preserving semantics.

However, we do want to test this image. In particular we want to test it as a black box. That we can spn it up, that it returns sensible results when invoked, and that it panics properly when subjected to erroneous inputs, ie. that it closes down so that the infrastructure can spin new instances up.

Final Remarks

This is by far a final implementation of a translation model. The output is still not syntactically up to speed. It misses proper casing and spacing between symbols. Furthermore, it would be advantageous to implement a system that guesses misspelled words. Eg. when writing Je vais formir the system should be able to correct the f to a d_in the formir word. These changes are not going to be implemented in the post.

Furthermore, this image is self-contained and non-persistent. In the industry, it is seen that inference images pull their model weights from a data store before spinning up. In my opinion, the better solution is to use the image registry for this. There are a number of reasons:

  1. We get versioning for free
  2. Model weights and the model architecture are tightly coupled. This will reduce this risk of runtime errors by mismatching weights and architecture code.
  3. Fewer components in the code.