MLOps Principles

As machine learning and AI propagate in software products and services, we need to establish best practices and tools to test, deploy, manage, and monitor ML models in real-world production. In short, with MLOps we strive to avoid “technical debt” in machine learning applications.

SIG MLOps defines “an optimal MLOps experience [as] one where Machine Learning assets are treated consistently with all other software assets within a CI/CD environment. Machine Learning models can be deployed alongside the services that wrap them and the services that consume them as part of a unified release process.” By codifying these practices, we hope to accelerate the adoption of ML/AI in software systems and fast delivery of intelligent software. In the following, we describe a set of important concepts in MLOps such as Iterative-Incremental Development, Automation, Continuous Deployment, Versioning, Testing, Reproducibility, and Monitoring.

Iterative-Incremental Process in MLOps

Agile ML Workflow

The complete MLOps process includes three broad phases of “Designing the ML-powered application”, “ML Experimentation and Development”, and “ML Operations”.

The first phase is devoted to business understanding, data understanding and designing the ML-powered software. In this stage, we identify our potential user, design the machine learning solution to solve its problem, and assess the further development of the project. Mostly, we would act within two categories of problems - either increasing the productivity of the user or increasing the interactivity of our application.

Initially, we define ML use-cases and prioritize them. The best practice for ML projects is to work on one ML use case at a time. Furthermore, the design phase aims to inspect the available data that will be needed to train our model and to specify the functional and non-functional requirements of our ML model. We should use these requirements to design the architecture of the ML-application, establish the serving strategy, and create a test suite for the future ML model.

The follow-up phase “ML Experimentation and Development” is devoted to verifying the applicability of ML for our problem by implementing Proof-of-Concept for ML Model. Here, we run iteratively different steps, such as identifying or polishing the suitable ML algorithm for our problem, data engineering, and model engineering. The primary goal in this phase is to deliver a stable quality ML model that we will run in production.

The main focus of the “ML Operations” phase is to deliver the previously developed ML model in production by using established DevOps practices such as testing, versioning, continuous delivery, and monitoring.

All three phases are interconnected and influence each other. For example, the design decision during the design stage will propagate into the experimentation phase and finally influence the deployment options during the final operations phase.


The level of automation of the Data, ML Model, and Code pipelines determines the maturity of the ML process. With increased maturity, the velocity for the training of new models is also increased. The objective of an MLOps team is to automate the deployment of ML models into the core software system or as a service component. This means, to automate the complete ML-workflow steps without any manual intervention. Triggers for automated model training and deployment can be calendar events, messaging, monitoring events, as well as changes on data, model training code, and application code.

Automated testing helps discovering problems quickly and in early stages. This enables fast fixing of errors and learning from mistakes.

To adopt MLOps, we see three levels of automation, starting from the initial level with manual model training and deployment, up to running both ML and CI/CD pipelines automatically.

  1. Manual process. This is a typical data science process, which is performed at the beginning of implementing ML. This level has an experimental and iterative nature. Every step in each pipeline, such as data preparation and validation, model training and testing, are executed manually. The common way to process is to use Rapid Application Development (RAD) tools, such as Jupyter Notebooks.
  2. ML pipeline automation. The next level includes the execution of model training automatically. We introduce here the continuous training of the model. Whenever new data is available, the process of model retraining is triggered. This level of automation also includes data and model validation steps.
  3. CI/CD pipeline automation. In the final stage, we introduce a CI/CD system to perform fast and reliable ML model deployments in production. The core difference from the previous step is that we now automatically build, test, and deploy the Data, ML Model, and the ML training pipeline components.

The following picture shows the automated ML pipeline with CI/CD routines:

Automated ML Pipeline

Figure adopted from “MLOps: Continuous delivery and automation pipelines in machine learning”

The MLOps stages that reflect the process of ML pipeline automation are explained in the following table:

MLOps Stage Output of the Stage Execution
Development & Experimentation (ML algorithms, new ML models) Source code for pipelines: Data extraction, validation, preparation, model training, model evaluation, model testing
Pipeline Continuous Integration (Build source code and run tests) Pipeline components to be deployed: packages and executables.
Pipeline Continuous Delivery (Deploy pipelines to the target environment) Deployed pipeline with new implementation of the model.
Automated Triggering (Pipeline is automatically executed in production. Schedule or trigger are used) Trained model that is stored in the model registry.
Model Continuous Delivery (Model serving for prediction) Deployed model prediction service (e.g. model exposed as REST API)
Monitoring (Collecting data about the model performance on live data) Trigger to execute the pipeline or to start a new experiment cycle.

After analyzing the MLOps Stages, we might notice that the MLOps setup requires several components to be installed or prepared. The following table lists those components:

MLOps Setup Components Description
Source Control Versioning the Code, Data, and ML Model artifacts.
Test & Build Services Using CI tools for (1) Quality assurance  for all ML artifacts, and (2) Building packages and executables for pipelines.
Deployment Services Using CD tools for deploying pipelines to the target environment.
Model Registry A registry for storing already trained ML models.
Feature Store Preprocessing input data as features to be consumed in the model training pipeline and during the model serving.
ML Metadata Store Tracking metadata of model training, for example model name, parameters, training data, test data, and metric results.
ML Pipeline Orchestrator Automating the steps of the ML experiments.

Further reading: “MLOps: Continuous delivery and automation pipelines in machine learning”

Continuous X

To understand Model deployment, we first specify the “ML assets” as ML model, its parameters and hyperparameters, training scripts, training and testing data. We are interested in the identity, components, versioning, and dependencies of these ML artifacts. The target destination for an ML artifact may be a (micro-) service or some infrastructure components. A deployment service provides orchestration, logging, monitoring, and notification to ensure that the ML models, code and data artifacts are stable.

MLOps is an ML engineering culture that includes the following practices:


The goal of the versioning is to treat ML training scrips, ML models and data sets for model training as first-class citizens in DevOps processes by tracking ML models and data sets with version control systems. The common reasons when ML model and data changes (according to SIG MLOps) are the following:

Analogously to the best practices for developing reliable software systems, every ML model specification (ML training code that creates an ML model) should go through a code review phase. Furthermore, every ML model specification should be versioned in a VCS to make the training of ML models auditable and reproducible.

Further reading: How do we manage ML models? Model Management Frameworks

Experiments Tracking

Machine Learning development is a highly iterative and research-centric process. In contrast to the traditional software development process, in ML development, multiple experiments on model training can be executed in parallel before making the decision what model will be promoted to production.

The experimentation during ML development might have the following scenario: One way to track multiple experiments is to use different (Git-) branches, each dedicated to the separate experiment. The output of each branch is a trained model. Depending on the selected metric, the trained ML models are compared with each other and the appropriate model is selected. Such low friction branching is fully supported by the tool DVC, which is an extension of Git and an open-source version control system for machine learning projects. Another popular tool for ML experiments tracking is the Weights and Biases (wandb) library, which automatically tracks the hyperparameters and metrics of the experiments.


Testing in ML Systems

Figure source: “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction” by E.Breck et al. 2017

The complete development pipeline includes three essential components, data pipeline, ML model pipeline, and application pipeline. In accordance with this separation we distinguish three scopes for testing in ML systems: tests for features and data, tests for model development, and tests for ML infrastructure.

Features and Data Tests

Tests for Reliable Model Development

We need to provide specific testing support for detecting ML-specific errors.

ML infrastructure test


Once the ML model has been deployed, it need to be monitored to assure that the ML model performs as expected. The following check list for model monitoring activities in production is adopted from “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction” by E.Breck et al. 2017:

The picture below shows that the model monitoring can be implemented by tracking the precision, recall, and F1-score of the model prediction along with the time. The decrease of the precision, recall, and F1-score triggers the model retraining, which leads to model recovery.

ML Model Decay

“ML Test Score” System

The “ML Test Score” measures the overall readiness of the ML system for production. The final ML Test Score is computed as follows:

After computing the ML Test Score, we can reason about the readiness of the ML system for production. The following table provides the interpretation ranges:

Points Description
0 More of the research project than a productionized system.
(0,1] Not totally untested, but it is worth considering the possibility of serious holes in reliability.
(1,2] There has been first pass at basic productionization, but additional investment may be needed.
(2,3] Reasonably tested, but it is possible that more of those tests and procedures may be automated.
(3,5] Strong level of automated testing and monitoring.
>5 Exceptional level of automated testing and monitoring.

Source: “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction” by E.Breck et al. 2017


Reproducibility in a machine learning workflow means that every phase of either data processing, ML model training, and ML model deployment should produce identical results given the same input.

Phase Challenges How to Ensure Reproducibility
Collecting Data Generation of the training data can't be reproduced (e.g due to constant database changes or data loading is random) 1) Always backup your data.
2) Saving a snapshot of the data set (e.g. on the cloud storage).
3) Data sources should be designed with timestamps so that a view of the data at any point can be retrieved.
4) Data versioning.
Feature Engineering Scenarios:
1) Missing values are imputed with random or mean values.
2) Removing labels based on the percentage of observation.
3) Non-deterministic feature extraction methods.
1) Feature generation code should be taken under version control.
2) Require reproducibility of the previous step "Collecting Data"
Model Training / Model Build Non-determinism 1) Ensure the order of features is always the same.
2) Document and automate feature transformation, such as normalization.
3) Document and automate hyperparameter selection.
4) For ensemble learning: document and automate the combination of ML models.
Model Deployment 1) Training the ML model has been performed with a software version that is different to the production environment.
2) The input data, which is required by the ML model is missing in the production environment.
1) Software versions and dependencies should match the production environment.
2) Use a container (Docker) and document its specification, such as image version.
3) Ideally, the same programming language is used for training and deployment.

Loosely Coupled Architecture (Modularity)

According to Gene Kim et al., in their book “Accelerate”, “high performance [in software delivery] is possible with all kinds of systems, provided that systems—and the teams that build and maintain them — are loosely coupled. This key architectural property enables teams to easily test and deploy individual components or services even as the organization and the number of systems it operates grow—that is, it allows organizations to increase their productivity as they scale.”

Additionally, Gene Kim et al., recommend to “use a loosely coupled architecture. This affects the extent to which a team can test and deploy their applications on demand, without requiring orchestration with other services. Having a loosely coupled architecture allows your teams to work independently, without relying on other teams for support and services, which in turn enables them to work quickly and deliver value to the organization.”

Regarding ML-based software systems, it can be more difficult to achieve loose coupling between machine learning components than for traditional software components. ML systems have weak component boundaries in several ways. For example, the outputs of ML models can be used as the inputs to another ML model and such interleaved dependencies might affect one another during training and testing.

Basic modularity can be achieved by structuring the machine learning project. To set up a standard project structure, we recommend using dedicated templates such as

ML-based Software Delivery Metrics (4 metrics from “Accelerate”)

In the most resent study on the state of DevOps, the authors emphasized four key metrics that capture the effectivenes of the software development and delivery of elite/high performing organisations: Deployment Frequency, Lead Time for Changes, Mean Time To Restore, and Change Fail Percentage. These metrics have been found useful to measure and improve ones ML-based software delivery. In the following table, we give the definition of each of the metricts and make the connection to MLOps.

Metric DevOps MLOps
Deployment Frequency How often does your organization deploy code to production or release it to end-users? ML Model Deployment Frequency depends on
1) Model retraining requirements (ranging from less frequent to online training). Two aspects are crucial for model retraining
1.1) Model decay metric.
1.2) New data availability.
2) The level of automation of the deployment process, which might range between *manual deployment* and *fully automated CI/CD pipeline*.
Lead Time for Changes How long does it take to go from code committed to code successfully running in production? ML Model Lead Time for Changes depends on
1) Duration of the explorative phase in Data Science in order to finalize the ML model for deployment/serving.
2) Duration of the ML model training.
3) The number and duration of manual steps during the deployment process.
Mean Time To Restore (MTTR) How long does it generally take to restore service when a service incident or a defect that impacts users occurs (e.g., unplanned outage or service impairment)? ML Model MTTR depends on the number and duration of manually performed model debugging, and model deployment steps. In case, when the ML model should be retrained, then MTTR also depends on the duration of the ML model training. Alternatively, MTTR refers to the duration of the rollback of the ML model to the previous version.
Change Failure Rate What percentage of changes to production or released to users result in degraded service (e.g., lead to service impairment or service outage) and subsequently require remediation (e.g., require a hotfix, rollback, fix forward, patch)? ML Model Change Failure Rate can be expressed in the difference of the currently deployed ML model performance metrics to the previous model's metrics, such as Precision, Recall, F-1, accuracy, AUC, ROC, false positives, etc. ML Model Change Failure Rate is also related to A/B testing.

To improve the effectiveness of the ML development and delivery process one should measure the above four key metrics. A practical way to achieve such effectiveness is to implement the CI/CD pipeline first and adopt test-driven development for Data, ML Model, and Software Code pipelines.

Summary of MLOps Principles and Best Practices

The complete ML development pipeline includes three levels where changes can occur: Data, ML Model, and Code. This means that in machine learning-based systems, the trigger for a build might be the combination of a code change, data change or model change. The following table summarizes the MLOps principles for building ML-based software:

MLOps Principles Data ML Model Code
Versioning 1) Data preparation pipelines
2) Features store
3) Datasets
4) Metadata
1) ML model training pipeline
2) ML model (object)
3) Hyperparameters
4) Experiment tracking
1) Application code
2) Configurations
Testing 1) Data Validation (error detection)
2) Feature creation unit testing
1) Model specification is unit tested
2) ML model training pipeline is integration tested
3) ML model is validated before being operationalized
4) ML model staleness test (in production)
5) Testing ML model relevance and correctness
6) Testing non-functional requirements (security, fairness, interpretability)
1) Unit testing
2) Integration testing for the end-to-end pipeline
Automation 1) Data transformation
2) Feature creation and manipulation
1) Data engineering pipeline
2) ML model training pipeline
3) Hyperparameter/Parameter selection
1) ML model deployment with CI/CD
2) Application build
Reproducibility 1) Backup data
2) Data versioning
3) Extract metadata
4) Versioning of feature engineering
1) Hyperparameter tuning is identical between dev and prod
2) The order of features is the same
3) Ensemble learning: the combination of ML models is same
4)The model pseudo-code is documented
1) Versions of all dependencies in dev and prod are identical
2) Same technical stack for dev and production environments
3) Reproducing results by providing container images or virtual machines
Deployment 1) Feature store is used in dev and prod environments 1) Containerization of the ML stack
3) On-premise, cloud, or edge
1) On-premise, cloud, or edge
Monitoring 1) Data distribution changes (training vs. serving data)
2) Training vs serving features
1) ML model decay
2) Numerical stability
3) Computational performance of the ML model
1) Predictive quality of the application on serving data

Along with the MLOps principles, following the set of best practices should help reducing the “technical debt” of the ML project:

MLOps Best Practices Data ML Model Code
Documentation 1) Data sources
2) Decisions, how/where to get data
3) Labelling methods
1) Model selection criteria
2) Design of experiments
3) Model pseudo-code
1) Deployment process
2) How to run locally
Project Structure 1) Data folder for raw and processed data
2) A folder for data engineering pipeline
3) Test folder for data engineering methods
1) A folder that contains the trained model
2) A folder for notebooks
3) A folder for feature engineering
4)A folder for ML model engineering
1) A folder for bash/shell scripts
2) A folder for tests
3) A folder for deployment files (e.g Docker files)