Deploying and monitoring machine learning projects is a complex undertaking. In addition to the consistent documentation of model parameters and the associated evaluation metrics, the main challenge is to transfer the desired model into a productive environment. If several people are involved in the development, additional synchronization problems arise concerning the models’ development environments and version statuses. For this reason, tools for the efficient management of model results through to extensive training and inference pipelines are required. In this article, we present the typical challenges along the machine learning workflow and describe a possible solution platform with MLflow. In addition, we present three different scenarios that can be used to professionalize machine learning workflows:
- Entry-level Variant: Model parameters and performance metrics are logged via a R/Python API and clearly presented in a GUI. In addition, the trained models are stored as artifacts and can be made available via APIs.
- Advanced Model Management: In addition to tracking parameters and metrics, certain models are logged and versioned. This enables consistent monitoring and simplifies the deployment of selected model versions.
- Collaborative Workflow Management: Encapsulating Machine Learning projects as packages or Git repositories and the accompanying local reproducibility of development environments enable smooth development of Machine Learning projects with multiple stakeholders.
Depending on the maturity of your machine learning project, these three scenarios can serve as inspiration for a potential machine learning workflow. We have elaborated each scenario in detail for better understanding and provide recommendations regarding the APIs and deployment environments to use.
Challenges Along the Machine Learning Workflow
Training machine learning models is becoming easier and easier. Meanwhile, a variety of open-source tools enable efficient data preparation as well as increasingly simple model training and deployment.
The added value for companies comes primarily from the systematic interaction of model training, in the form of model identification, hyperparameter tuning and fitting on the training data, and deployment, i.e., making the model available for inference tasks. This interaction is often not established as a continuous process, especially in the early phases of machine learning initiative development. However, a model can only generate added value in the long term if a stable production process is implemented from model training, through its validation, to testing and deployment. If this process is implemented correctly, complex dependencies and costly maintenance work in the long term can arise during the operational start-up of the model . The following risks are particularly noteworthy in this regard.
1. Ensuring Synchronicity
Often, in an exploratory context, data preparation and modeling workflows are developed locally. Different configurations of development environments or even the use of different technologies make it difficult to reproduce results, especially between developers or teams. In addition, there are potential dangers concerning the compatibility of the workflow if several scripts must be executed in a logical sequence. Without an appropriate version control logic, the synchronization effort afterward can only be guaranteed with great effort.
2. Documentation Effort
To evaluate the performance of the model, model metrics are often calculated following training. These depend on various factors, such as the parameterization of the model or the influencing factors used. This meta-information about the model is often not stored centrally. However, for systematic further development and improvement of a model, it is mandatory to have an overview of the parameterization and performance of all past training runs.
3. Heterogeneity of Model Formats
In addition to managing model parameters and results, there is the challenge of subsequently transferring the model to the production environment. If different models from multiple packages are used for training, deployment can quickly become cumbersome and error-prone due to different packages and versions.
4. Recovery of Prior Results
In a typical machine learning project, the situation often arises that a model is developed over a long period of time. For example, new features may be used, or entirely new architectures may be evaluated. These experiments do not necessarily lead to better results. If experiments are not versioned cleanly, there is a risk that old results can no longer be reproduced.
Various tools have been developed in recent years to solve these and other challenges in the handling and management of machine learning workflows, such as TensorFlow TFX, cortex, Marvin, or MLFlow. The latter, in particular, is currently one of the most widely used solutions.
MLflow is an open-source project with the goal to combine the best of existing ML platforms to make the integration to existing ML libraries, algorithms, and deployment tools as straightforward as possible . In the following, we will introduce the main MLflow modules and discuss how machine learning workflows can be mapped via MLflow.
MLflow consists of four components: MLflow Tracking, MLflow Models, MLflow Projects, and MLflow Registry. Depending on the requirements of the experimental and deployment scenario, all services can be used together, or individual components can be isolated.
With MLflow Tracking, all hyperparameters, metrics (model performance), and artifacts, such as charts, can be logged. MLflow Tracking provides the ability to collect presets, parameters, and results for collective monitoring for each training or scoring run of a model. The logged results can be visualized in a GUI or alternatively accessed via a REST API.
The MLflow Models module acts as an interface between technologies and enables simplified deployment. Depending on its type, a model is stored as a binary, e.g., a pure Python function, or as a Keras or H2O model. One speaks here of the so-called model flavors. Furthermore, MLflow Models provides support for model deployment on various machine learning cloud services, e.g., for AzureML and Amazon Sagemaker.
MLflow Projects are used to encapsulate individual ML projects in a package or Git repository. The basic configurations of the respective environment are defined via a YAML file. This can be used, for example, to control how exactly the conda environment is parameterized, which is created when MLflow is executed. MLflow Projects allows experiments that have been developed locally to be executed on other computers in the same environment. This is an advantage, for example, when developing in smaller teams.
MLflow Registry provides a centralized model management. Selected MLflow models can be registered and versioned in it. A staging workflow enables a controlled transfer of models into the productive environment. The entire process can be controlled via a GUI or a REST API.
Examples of Machine Learning Pipelines Using MLflow
In the following, three different ML workflow scenarios are presented using the above MLflow modules. These increase in complexity from scenario to scenario. In all scenarios, a dataset is loaded into a development environment using a Python script, processed, and a machine learning model is trained. The last step in all scenarios is a deployment of the ML model in an exemplary production environment.
1. Scenario – Entry-Level Variant
Scenario 1 uses the MLflow Tracking and MLflow Models modules. Using the Python API, the model parameters and metrics of the individual runs can be stored on the MLflow Tracking Server Backend Store, and the corresponding MLflow Model File can be stored as an artifact on the MLflow Tracking Server Artifact Store. Each run is assigned to an experiment. For example, an experiment could be called ‘fraud_classification’, and a run would be a specific ML model with a certain hyperparameter configuration and the corresponding metrics. Each run is stored with a unique RunID.
In the screenshot above, the MLflow Tracking UI is shown as an example after executing a model training. The server is hosted locally in this example. Of course, it is also possible to host the server remotely. For example in a Docker container within a virtual machine. In addition to the parameters and model metrics, the time of the model training, as well as the user and the name of the underlying script, are also logged. Clicking on a specific run also displays additional information, such as the RunID and the model training duration.
If you have logged other artifacts in addition to the metrics, such as the model, the MLflow Model Artifact is also displayed in the Run view. In the example, a model from the sklearn.svm package was used. The MLmodel file contains metadata with information about how the model should be loaded. In addition to this, a conda.yaml is created that contains all the package dependencies of the environment at training time. The model itself is located as a serialized version under model.pkl and contains the model parameters optimized on the training data.
The deployment of the trained model can now be done in several ways. For example, suppose one wants to deploy the model with the best accuracy metric. In that case, the MLflow tracking server can be accessed via the Python API mlflow.list_run_infos to identify the RunID of the desired model. Now, the path to the desired artifact can be assembled, and the model loaded via, for example, the Python package pickle. This workflow can now be triggered via a Dockerfile, allowing flexible deployment to the infrastructure of your choice. MLflow offers additional separate APIs for deployment on Microsoft Azure and AWS. For example, if the model is to be deployed on AzureML, an Azure ML container image can be created using the Python API mlflow.azureml.build_image, which can be deployed as a web service to Azure Container Instances or Azure Kubernetes Service. In addition to the MLflow Tracking Server, it is also possible to use other storage systems for the artifact, such as Amazon S3, Azure Blob Storage, Google Cloud Storage, SFTP Server, NFS, and HDFS.
2. Scenario – Advanced Model Management
Scenario 2 includes, in addition to the modules used in scenario 1, MLflow Model Registry as a model management component. Here, it is possible to register and process the models logged there from specific runs. These steps can be controlled via the API or GUI. A basic requirement to use the Model Registry is deploying the MLflow Tracking Server Backend Store as Database Backend Store. To register a model via the GUI, select a specific run and scroll to the artifact overview.
Clicking on Register Model opens a new window in which a model can be registered. If you want to register a new version of an already existing model, select the desired model from the dropdown field. Otherwise, a new model can be created at any time. After clicking the Register button, the previously registered model appears in the Models tab with corresponding versioning.
Each model includes an overview page that shows all past versions. This is useful, for example, to track which models were in production when.
If you now select a model version, you will get to an overview where, for example, a model description can be added. The Source Run link also takes you to the run from which the model was registered. Here you will also find the associated artifact, which can be used later for deployment.
In addition, individual model versions can be categorized into defined phases in the Stage area. This feature can be used, for example, to determine which model is currently being used in production or is to be transferred there. For deployment, in contrast to scenario 1, versioning and staging status can be used to identify and deploy the appropriate model. For this, the Python API MlflowClient().search_model_versions can be used, for example, to filter the desired model and its associated RunID. Similar to scenario 1, deployment can then be completed to, for example, AWS Sagemaker or AzureML via the respective Python APIs.
3. Scenario – Collaborative Workflow Management
In addition to the modules used in scenario 2, scenario 3 also includes the MLflow Projects module. As already explained, MLflow Projects are particularly well suited for collaborative work. Any Git repository or local environment can act as a project and be controlled by an MLproject file. Here, package dependencies can be recorded in a conda.yaml, and the MLproject file can be accessed when starting the project. Then the corresponding conda environment is created with all dependencies before training and logging the model. This avoids the need for manual alignment of the development environments of all developers involved and also guarantees standardized and comparable results of all runs. Especially the latter is necessary for the deployment context since it cannot be guaranteed that different package versions produce the same model artifacts. Instead of a conda environment, a Docker environment can also be defined using a Dockerfile. This offers the advantage that package dependencies independent of Python can also be defined. Likewise, MLflow Projects allow the use of different commit hashes or branch names to use other project states, provided a Git repository is used.
An interesting use case is the modularized development of machine learning training pipelines . For example, data preparation can be decoupled from model training and developed in parallel, while another team uses a different branch name to train the model. In this case, only a different branch name must be used as a parameter when starting the project in the MLflow Projects file. The final data preparation can then be pushed to the same branch name used for model training and would thus already be fully implemented in the training pipeline. The deployment can also be controlled as a sub-module within the project pipeline through a Python script via the ML Project File and can be carried out analogous to scenario 1 or 2 on a platform of your choice.
Conclusion and Outlook
MLflow offers a flexible way to make the machine learning workflow robust against the typical challenges in the daily life of a data scientist, such as synchronization problems due to different development environments or missing model management. Depending on the maturity level of the existing machine learning workflow, various services from the MLflow portfolio can be used to achieve a higher level of professionalization.
In the article, three machine learning workflows, ascending in complexity, were presented as examples. From simple logging of results in an interactive UI to more complex, modular modeling pipelines, MLflow services can support it. Logically, there are also synergies outside the MLflow ecosystem with other tools, such as Docker/Kubernetes for model scaling or even Jenkins for CI/CD pipeline control. If there is further interest in MLOps challenges and best practices, I refer you to the webinar on MLOps by our CEO Sebastian Heinz, which we provide free of charge.
- Hidden Technical Debt in Machine Learning Systems (2014) von D. Sculley et al., (https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf)