Machine Learning

Structuring your Python machine learning projects: An opinionated review

Jul 15, 2021

Building Python Machine Learning projects that are both maintainable and easy to deploy is a hard job. To name a few topics, things like managing data pipelines, training multiple models, not to mention production deploys, and versioning can become a pain in the neck. As you can imagine, everything can get out of control very quickly if not handled properly. That’s why it is important to learn the correct way of structuring machine learning projects.

Thankfully, today we have so many options to choose from when it comes to cloud services for training and deploying your models into production, namely AWS SageMaker, Google’s AI and Machine Learning or Azure’s Machine Learning. However, all of these services encourage the use of Jupyter Notebooks and isolated Python scripts when using their services, leading to low maintainability of projects with repeated boilerplate code, and lacking good practices of teamwork and collaboration.

In this article, we share an opinionated overview of how to structure the code and processes of an ML project while following good old-fashioned software engineering practices. In particular, we are going to focus on using AWS SageMaker for training and production inference purposes. If you happen to not use AWS SageMaker, we still believe you’ll find the content of this article handy.

Needless to say, we like our ML projects as any other software engineering (SE) project: modular, easy to maintain and deploy. By the end of this article, we hope you get some tips and tricks to approach your upcoming ML projects.

Structure of a machine learning project

Why do we treat these projects differently than regular SE projects? We usually write software to manipulate data in a typical software-making environment, whether building a back-end or a mobile application.

Another important fact about these projects is that the resulting output is almost every time predictable. For example, if a user fills a form with data and clicks a submit button, everyone expects that form data to be stored somewhere and that the user will see some feedback for the action they've taken.

In other words, making software which goal is to manipulate data. However, in ML projects, the data is a substantial part of what we're building because if you literally change your data, you change your final program.

This last thing leads everyone in ML projects to run data-science reports and ML experiments to understand how to maximize their data utility, often losing focus in other essential parts of the project. Hence, it's crucial to prevent that from happening.

To mitigate some of the issues we mentioned above, we build our projects as a composite of three things:

data manipulation libraries: loading datasets or samples of datasets, tools for building new datasets, etc.
training and deployment libraries: tools for training and deploying models to reduce extensive boilerplate across multiple models
CLI and CI/CD setup: we want our models running outside an experimental environment! And we want them to be easy to deploy/update
models definitions and configurations: the juicy part of the project, and typically what most people pay attention to

To put it super clear, we like our projects to be structured as follows:

The exciting part of this approach is that, if modularization is done correctly, you don't need to have all model's definitions in just one repository. Still, you can put different models assigned to other teams to maximize productivity and have a better collaboration environment.

So when it comes to defining each model, what do we actually need to do to make them easy to deploy and maintain? We’ll show you our approach when we do this for AWS SageMaker.

Encapsuling models' definitions

Getting all your model's definitions spread throughout your code inside Python scripts or Jupyter Notebooks is a terrible practice. As soon as the project starts getting larger and larger, managing that code becomes a nightmare that makes your code prone to bugs and hard to review.

For us, it's important to emphasize that having a consistent way of modularizing your ML models and effectively reusing code for inference and training processes is a key part of making your whole project successful in the long term.

Principally, deploying models to SageMaker is a very structured process since many conventions, and predefined structures must be followed. We will focus specifically on the PyTorch framework for SageMaker because it's our favorite one and what we use daily. Although, similar considerations can be made for TensorFlow, Scikit-Learn, or others.

To make a brief introduction to those unfamiliar with the service, the two typical uses for SageMaker are deploying a model endpoint for inference and for training the model at scale. To do so, we need to set up at least 3 main factors of a PyTorch model:

Writing the model files and their inference or training entry points
Bundling sources and weights and uploading the model to AWS S3 as a model.tar.gz file
Deploying the endpoint through either the SageMaker Python SDK, a CloudFormation template, or using the AWS web client GUI

Note that this structure might be a little different for different frameworks, but essentially they all share pretty much the same format for deployments.

Going deeper in detail, for the scenario of deploying an inference endpoint in SageMaker, the basic structure required to use the PyTorch framework is the following:

In the case of PyTorch, the structure is pretty simple: all we need is a weights file at the top level of the compressed file and an additional inference script to describe how to load the model and to process incoming requests. Optionally, you can add a requirements.txt file declaring other third-party libs. Apart from that, you can upload your own source library contained in the code folder. By doing that, you can make use of the great tools you've created for building pipelines, loading models, and so on.

So, in the end, the question now is how do we manage our code so that it's easy for prototyping, training, and deploying while being compliant with SageMaker's way of bundling models. For that, we have our opinionated way of doing it. It might change from project to project and depends on the complexity level you're looking at. However, we think the following structure might work for all mid/large-sized ML projects that involve the training and deploying of multiple models.

Without further ado, here's the models structure we are talking about:

A setup like this makes it very easy to make a quick bundling function using tarfile to create the model.tar.gz with the required structure, as all the model source files are in predictable locations. On top of that, we don't add any complex code or extensive generalization structure for any potential future modularization. Starting your codebase from small and growing complexity is often better than starting with a vast generalization.

Going deeper into what this structure tries to communicate. We want each individual model of our project to be fully contained in a module so that it is easier to deploy. Conveniently, we structured the inner modules like inference or train all in the same way so that regardless of what model we are talking about, the process of deploying or training is basically the same. On top of that, each inference or train module contains very little boilerplate, making use of our core library tools providing all datasets and processing utils. At the end of the day, deploying a model would basically imply putting the corresponding folder into a model.tar.gz file that satisfies all of SageMaker's requirements for training or inference.

Final checks

Importance of configuration files and trackable configurations

Every model should have a configuration file associated with it. This is very important because changing settings directly in the code is hard to track when you do many experiments. Luckily, there are plenty of tools to choose from to solve this problem. You can go and use a simple PyYAML library to read your configurations from YAML files. However, we often like to use OmegaConf, a library to create configuration files from Python's dataclasses std library. You can also check out Hydra from Facebook Research, which is ultimately the best option for bigger projects and uses OmegaConf under the hood.

Your goal is to put your model live

Some people forget that ML projects are not only about running experiments and trying new things. At the end of the day, you want to add value with your project and make that visible to your audience. For this in particular, we don't have to forget that ML projects are SE projects in the end, and we can use all the tooling systems available for team collaboration, code reviews, tests, continuous integration and deployments that are great for any kind of project.

There's no perfect recipe for this point as each team/company has their SE pipelines and common practices, but it's really important to not treat ML projects differently within your company for no apparent reason. This means that tests, CI/CD, and more still apply to ML projects!

For our scenario of running everything in SageMaker, we like to focus a lot on getting all our infrastructure coded in CloudFormation stacks. All you need from your library stand point is to bundle your models correctly and SageMaker will do the rest, as simple as that. Note that every framework has its differences on the bundling requirements, but overall they are pretty similar. In summary, having your infrastructure coded and having the right tools to support the infrastructure makes the deployment processes way easier and reproducible in different environments if necessary.

Tracking, tracking, tracking

Finally, we cannot ignore the fact that we will almost every time have research and experimentation periods in ML projects. Whether it is to understand the underlying data or try a new fancy tool that might increase your model's performance, experiments in ML projects happen. Having your model modularized and configurable from config files is an excellent step towards tracking your experiments.

Tracking your experiments might sound like a waste of time on your first iterations of the project. However, very quickly, it can become a headache to remember the best settings for the best results you've got or the performance of a model when you changed a particular configuration. For those reasons and more, tracking every single detail of your experiments is a great practice.

There are plenty of tools available for this purpose. To name a few of them, you can opt to integrate with Weights and Biases or CometML for an all-in-one platform that fastly tracks your experiments. You can also customize metric reports, write summaries of experiments, and more.

Lately, we've been experimenting with a new tool from Replicate.ai called Keepsake. Keepsake alleviates the process of running experiments and fetching the code and weights from your best model. It's a key tool that is easy to add to your project, so you might want to give it a try.

Conclusions

ML projects can get messy sometimes, but we need to add a little extra effort and be conscious to not make it super hard for anyone new to start and understand the code.

On the same line, it's essential to start small and gradually increment the project's complexity if necessary. If you plan ahead and foresee that you'll need to generalize a lot of your code and add extra tools just in case you reach that point, you're likely going to change plans in the middle, and it gets more complicated for the project to move fast. Notably, we like using SageMaker because it is easier to manage a clear structure of what you need for training and inference purposes. And more specifically, once you do it once, it becomes way easier to replicate the same process for the rest of your models.

Also, we're not saying you will not run experiments. That's a crucial part! However, always be conscious of adding the central part of your experimental scripts or Jupyter Notebooks into your library and into your models' folder structure to continuously improve your model while having a set of tools that support training and deploying models without a few commands. You have to think of these scripts and Notebooks as a one-use only tools, that can be discarded after you put everything you needed in your project.

And finally, focus on your project's goals and put your models live as soon as possible! If you have everything set up, having a new model deployment could be as easy as having the correct configuration files and merging a pull request into a branch that automatically deploys your model into SageMaker, and starts up all the infrastructure you need.

Thanks for reading this article! We hope it has triggered some questions on how you think ML projects should be handled or helped you get started with a new ML project. As always, we welcome you to read the rest of our blog if you are interested.

stay in the loop