Machine Learning lifecycle challenges
Machine learning is still difficult, but with well-defined APIs abstracting away the complexities of algorithms it has become less complicated. However, managing the Machine Learning Lifecycle is not easily streamlined and remains really difficult and complex for many organisations. Today, we discuss MLflow on Databricks and illustrate the comprehensive framework machine learning lifecycle management requires at every stage.
Notebook for this blog post can be downloaded here: Jupyter Notebook
Exploring MLflow on Databricks
Let’s have a peek at the complexity of the problem.
Many data teams working in a data-driven company might update machine learning models in production on a regular basis. Creating a good workflow to keep track of what goes in and out of each stage that leads a model to production and make it reproducible is a complex and difficult engineering task.
If your organisation needs to be truly productive and do machine learning at scale and also be good with governance then you need a comprehensive framework to manage this lifecycle at every stage.
Proprietary platforms like Uber’s Michelangelo have been developed to address these problems internally. They have been successful in standardising the process of data preparation, model training, and deployment via APIs built for data scientists. Today we are going to talk about an open-source platform that can address these concerns and much more and present a case study on its usage.
MLflow on Databricks as a managed service
MLflow is an open-source, modular, framework that can be deployed locally or on any cloud platform of choice. Designed to work with all popular ML frameworks and developed by a growing number of contributors, it provides many useful features for ML Lifecycle management. You can easily integrate MLflow to your existing ML code immediately. For the scope of this case study, we will work with managed MLflow on Databricks.
Case study: New York taxi fair prediction challenge
Recently, we published a blog post on how to do data wrangling and machine learning on a large dataset using the Databricks platform. We built a benchmark model and did some predictions. Now let’s see how we can use some of the features of MLflow to refine the model over many hyperparameter tuning iterations and package it for distribution in a way that allows results recreation. We will also register the best of the models in a model-registry and deploy it to different environments.
The context for our MLflow on Databricks project
The selected dataset is from a Kaggle competition. The dataset has over 55 million taxi trips and is over 5GB in size.
In that previous blog post, we checkpointed the code into 3 sections. For this blog post, we will start after Checkpoint-2 where we have completed data wrangling. The TensorFlow-Keras and MLflow code presented in this blog post will depend upon the data inputs produced before this checkpoint. Please see the attached notebook at the beginning of this blog post for the full code including the Apache Spark data-wrangling section.
To give more context to this case study we have to re-visit our previous blog. There we discussed how to set up a Databricks Standard Edition cluster with ML support and AWS S3 integration. When you set up your cluster with 6.6 ML Runtime all the MLflow modules will be available to be used with the APIs as well as the web interfaces.
Let’s consider MLflow Tracking as MLflow modules can be used independently from one another, which his makes its adoption much easier. We can pick the module that best suits our initial work and then combine it with other modules to add more lifecycle features as we get a better understanding.
Here, we will show you how to start with the tracking module and progress further towards packaging, model selection, and deployment to production step-by-step.
Illustration of MLflow module packaged with MLflow projects recreating one or more models
Each run with MLflow Tracking can record the following information.
Code version: If it was executed from an MLflow project, the git commit version.
Start & End: Start and end times for the run.
Source: The name of the file executed to launch the run.
Parameters: Key-value input parameters, this will include any hyperparameters.
Metrics: Keys and numeric values of metrics produced during the run. MLflow will record and let you visualize the history of the metric within the run and between runs.
Artifacts and Models: The trained model for each run or other artifacts such as prediction results.
Below is an extract of the python class NYorkTaxiFairPrediction.py. For the full code, please see the attached notebook or the Github link.
def mlflow_run => This is where the magic happens. This method will train, compute metrics, and will log all metrics, parameters, and artifacts for the current run using the MLflow Tracking API. See inline comments below for MLflow Tracking specific code.
Training with a range of hyperparameters
This is how we have defined the input data and other hyperparameters. We will run multiple iterations form the notebook in a loop.
Now let us have a look at the MLflow tracking UI to check how our runs have been recorded. Click on Runs on the top right of the Databricks notebook. You will see a summary of the iterations. Further, you can expand upon each iteration to check the input parameters and metrics produced by the run itself.
Towards the bottom of the list of runs, we can see the link to the Experiments UI. Let’s dive a step further, here we can select and compare multiple runs in great detail. We can also plot graphs for input parameters vs metrics.
Packaging and distributing the project with MLflow Projects
MLflow-Projects is mainly a convention to package and distribute Machine Learning projects. It also provides an API to re-run them so that the experiments and their results can be re-created by different stakeholders.
Let us now see how we can package the ML code we have discussed so far. Code for this project can be downloaded from https://github.com/priyanlc/NewYorkTaxiPredictMLflow.git.
- Name: Just a human-readable name for the project.
- Environment: Here we have chosen Conda as the environment. The conda.yaml will have the package configuration. Docker is also an option.
- Entry points: The parameters and the entry point for the ML code need to be given here.
You can find the conda.yaml file here.
This file describes the package configuration and all the dependencies.
This is the entry point for the ML code. The MLflow Projects API will use this entry point to run the code and re-create the experiment.
This is how we can load the project in the notebook and use the MLflow Projects API to run the experiment with a given parameter combination.
Register the best Keras model with the MLflow Model Registry
Now that the model to predict taxi fares has been trained and tracked with MLflow, the next step is to register it with the MLflow Model Registry. You can register and manage models using the Experiments UI.
From the Experiments UI, pick the best model and register it with the Model Registry.
Perform a model stage transition
The Model Registry facilitates the transition of models to stages: None, Staging, Production, and Archived. Progression of a model from one stage to another can be subject to an approval process. Each of the stages can have different functions. For example, Staging can be used for model testing, while Production is for models that have completed the testing or review processes and have been deployed to applications. When a model needs to be retired from Production it can be moved to Archive.
Let’s select a version of the model and transition it to Production.
Finally, we can load the model from production, the Keras model will be loaded as a Python function. Pass in some test data and make some predictions. MLflow’s python_function provides a consistent API across machine learning frameworks, ensuring that the same application inference code continues to work even after the introduction of a new model version.
MLflow, together with the Databricks Unified Analytics platform creates a uniquely powerful environment to create and manage end to end ML Lifecycle efficiently. In this blog post, we discussed some of the features related to Machine Learning experiments:
- Tracking inputs and metrics with MLflow Tracking.
- Packaging, distributing and reproducing results with MLflow Projects.
- ML Model governance and lifecycle management with MLflow Model Registry.
At Infinite Lambda we architect end-to-end integrated solutions with the Databricks Platform on the cloud. We will also help with all aspects of technical implementation.
Have a chat with us at firstname.lastname@example.org to find out more!
- Hermann, J., & Balso, M. D. “Meet Michelangelo: Uber’s Machine Learning Platform.” https://eng.uber.com/michelangelo-machine-learning-platform/
- Chandrapala, P. “How to Use Databricks Notebooks and AWS: Infinite Lambd.” https://infinitelambda.com/post/predictive-analytics-on-large-datasets-with-databricks-notebooks-on-aws
- “New York City Taxi Fare Prediction.” https://www.kaggle.com/c/new-york-city-taxi-fare-prediction
- “MLflow Tracking.” https://www.mlflow.org/docs/latest/tracking.html
- Chandrapala, P. 2020: “NewYorkTaxiPredictMLflow” https://github.com/priyanlc/NewYorkTaxiPredictMLflow
- “MLflow Projects.” https://www.mlflow.org/docs/latest/projects.html
- “MLflow Model Registry” https://www.mlflow.org/docs/latest/model-registry.html
This blog post was inspired by this workshop on MLflow by Jules Damji: https://www.youtube.com/watch?v=x3cxvsUFVZA&t=578s