XenonStack Recommends

Understanding MLOps Processes and its Best Practices

Acknowledging Data Management
          Best Practices with DataOps Image

Subscription

XenonStack White Arrow Image

Introduction to ML Project Life Cycle

The steps involved in the ML Life cycle, and MLOps is all about advocating for automation and monitoring at all the above steps. Machine learning project development is an iterative process that means we continue to iterate from each of the above processes (except scoping) during a life cycle of a model to improve the efficiency of the process.

  • For instance, We improve the data when new data comes in or feature engineering new features out of the existing data.
  • We iterate through the modeling process according to its performance in production.
    Accordingly, the deployed model gets replaced with the best model developed while iteration.
  • This process goes on with the iteration, but one should follow some best practices while iterating through the process. We will talk about these here.
MLOps have mixed data scientists and services designed to provide automation in ML pipelines and get more precious insights in production systems. Click to explore about, MLOps Platform - Productionizing Machine Learning Models

What are the Best Practices of MLOps Processes?

Every process of MlOps is defined below with their best practices:

  1. Scoping
  2. Data processing Stage
  3. Modelling Stage

What are the Best Practices for Scoping?

Scoping is defining the project goals in terms of Machine learning development goals. For instance, the business team might ask us to develop a conversational AI or agent for our website that will answer the FAQs of the user. Now the development of a FAQ answering agent is a business goal. Once this is clear, we need to define our goal: developing a question answering algorithm based on the FAQs present.

Best Practices to follow while scoping

  • Understanding the Business Problem

This is a crucial step though it seems like a simple step, due to the lack of understanding of the business problem, all development processes may go in vain. So the development team needs to be on the same page with the business team (or the team handing out the problem). Understand the problem properly clearly and get it verified with the stakeholders. Note: Do not proceed with the development plan until the problem is clear.

  • Brainstorming within the team

Once the problem is defined, one should brainstorm and accumulate all the solutions' ideas. The goal here is to think outside the box and explore all the ideas suggested by the team members.

  • Research About the problem

At this stage, we have clarity of the problem and ideas from the team, now do thorough research at your end about the problem, the research should be solution-oriented, keeping in mind that we need to come up with a road map and approach doc for the solution (elaboration to these are given in next sections)

  • Define the Development plan concretely, aka “Roadmap.”

Once the problem is defined, one needs to come up with a Roadmap, i.e., visual representation for flow for the development of the solution to the problem. The roadmap should contain the following things:

  1. Proposed processes and steps to deliver the solution.
  2. Estimated time for each process, i.e., Timeline.
  3. Special remarks that you think should be given with each process. For example, some dependencies need to be fulfilled, such as data dependency from the data engineering team before the EDA process in data preparation steps.
  4. Once the roadmap is developed, get it verified with the concerned person. In your case, it might be Subcoach, Coach, etc., and get the inputs.
  5. The template can be found here.
  • Prepare Approach Doc
  1. Once the Roadmap is clear, one needs to prepare an Approach doc. This document contains information about the approach you will use to solve the business problem you are given. For example, suppose you are given a business problem that involves classification, then in the approach doc. In that case, you need to tell the initial algorithm(s) you are going to select for the implementation with the implementation flow.
  2. The purpose of Approach Doc is to give visibility of our approach to the stakeholders so that we can take them in our confidence for the development process we are going to follow.
  3. An example template of the Approach Doc can be found here. Once Approach Doc is prepared, get it verified and get the inputs from the stakeholders.
The market for MLOps solutions is expected to reach $4 billion by 2025. Click to explore about, MLOps: What You Need To Know

What are the Best Practices for Data Processing?

Here, we will discuss the best practices while processing the data before the modeling stage.

Types of Data problem

The data types for any machine learning problem can be divided into the below categories.

The above figure shows the datasets we can see while developing the ML solution for a business problem. Let’s see the best practices while handling both types.

Best Practices for Defining the dataset for Structured data

Here we will see the best practices for defining the dataset.

  • Information of each column: Maximum efforts should be put in getting the information on each column of the dataset if it’s not present to remove the ambiguity from the dataset if the dataset is present in the tabular format. If data is Unstructured, metadata(information on each field of the dataset) should be fetched and asked from the team providing the dataset to you. It’s solely the responsibility of the MLOps team to get the info on the dataset if it’s not present.
  • A clear distinction between features and labels: The first important step in data processing should be defining the dataset, i.e., for the ML problems, we should know what the features(X) need to be considered and what should be a label(Y) for the problem if this is not clear don’t proceed for the other steps this is a prerequisite. For unstructured data, the labels must also be defined. For example, if it’s an image classification problem, the images become features, and the labels should be given.
  • Consistency in Labelling format for Unstructured data: Sometimes, what happens with Unstructured data (text, image, audio) is that we need to label it manually or give the task of labeling to the labelers (these can be anyone who is assigned with the task of labeling the dataset). If more than one labeler is involved in the dataset, we must ensure a consistent labeling strategy. For instance, consider labeling the image of Smartphones with defects or not. In case 1, the labeler has been labeled as given in figure 1, and for the similar case, the other labeler has labeled it as it is given in figure 2. So there is inconsistency in labeling, which must be avoided by providing clear instructions to the labelers.

Best Practices while preprocessing the dataset

Remember This “Always Keep track of the dataset aka Data Versioning,” let’s dive into the best practices of it.

  1. Use Data versioning tools: For data versioning with each experiment done with any dataset version, we should use the data versioning tools like DVC.
  2. Text files for data versioning: If due to some reasoning the data versioning tools can’t be used, use text files or google sheets to maintain the records of the dataset used in the experiments but maintaining versioning records is the responsibility of the developer and he/she needs to reproduce it when asked.
  3. Tracking and Reproducible Experiments: The main purpose of data versioning is that when required, one can easily reproduce the experiments conducted with any version of the dataset, this is not possible if one never does the versioning of the dataset.

Consistency in Data pipelines

  1. Make data pipelines consistent both for development, testing, and production: It is tempting for ML developers to kick start the development process without giving focus to data pipelines, for instance, the data preprocessing script used for the training model in the development stage can’t be used in production or even during scoring Always keep in mind to make consistency data pipelines which means you can one pipeline everywhere for data processing.
  2. Fault tolerance capacity of production pipeline: Give these pipelines the ability to handle any exceptions that may occur while the model is deployed in production. For instance, one needs to handle the scenario if one or more values go missing from the inference data( data in production).
The debate about Continuous Integration vs Continuous Deployment has recently been the town's talk, and there are quite mixed thoughts on which one is better. Click to explore about, Continuous Integration vs Continuous Deployment

Other Miscellaneous points to keep in mind for the data processing stage

  • Balanced Train/Val/test: The train/dev/test should represent the dataset. Let us understand it with an example, consider a dataset with 100 examples of smartphones, and out of 100, 30 are positive(defective) other negatives:

Row 2 shows how split can be non-representative of the actual dataset as every.
Set must contain the 30% samples from the positive class. But Row 3 shows the correct way in the table.

  • Prevent Data leakage: When your training data contains information about the target, but similar data is not available when the model is used for prediction, data leakage (or leaking) occurs. This results in an excellent performance on the training set (and potentially even the validation data), but poor performance in production. To put it another way, leakage makes a model appear correct until you start making decisions. See more here.

What are the best practices for data Modelling?

The best practices of data modelling are described below:

Define Baseline and Benchmark the model

Once you reach the Modelling part, we need to set up a baseline to compare the performance of our model in different experimentations.

  1. Human-Level Performance (HLP) as a baseline: For unstructured data like images, humans can be used to set the baseline accuracy of the model (if data is small enough and you have labelers). For example, For the computer vision problem of detecting defects in smartphone images, the human can detect the defect in smartphones screen then be tested with a model.
  2. Quick implementation: The other most-followed option is a quick implementation with a basic algorithm and considers it as a baseline. But the baseline is necessary.

Model Versioning and Tracking

  1. Use Model versioning tools: For Model versioning, with each experiment done with any model version, we should use the model versioning tools like mlflow.
  2. Text files for Model versioning: If the model versioning tools can’t be used due to some reasoning, use text files or google sheets to maintain the records of the models used in the experiments.
The system needs continuous learning and training from the real world. Click to explore about, DevOps for Machine Learning , Tensor Flow and PyTorch.

Error Analysis once Model is trained

Once the model is trained, Error analysis is the process of getting visibility about where the model did not perform well. For example, a classification problem model might not be performing in the class. This allows us to improve the model performance and to audit its performance at every iteration. The process can be understood with the below diagram.

Let’s see Best practices for the error analysis process.

  • Accuracy is not always the best checkout confusion matrix: Always consider various evaluation metrics while evaluating the model's performance. Confusion matrix and classification reports give these metrics like precision, recall and f1 score consider these also.
  • Brainstorm how things can go wrong with the model and test it:
  1. Performance on different subsets of dataset known as cross-validation.
  2. Performance on rare class.
  3. Fairness and bias of model (checkout fairness section).

Use Data-centric Approach not Model-centric Approach

It becomes tempting for ML solution developers to use cutting-edge algorithms for solving the problem given on hand. Still, it is always better to have a simple model with better explainability than a complex model on bad data.

Best practices for improving the dataset, i.e., following data-centric approach:

  • Data Augmentation for Unstructured data: For unstructured data like images data and audio data, augmentation is an excellent approach to have more datasets but keep in mind these things while performing augmentation:
  1. Create more examples on which algorithms show poor performance in error analysis.
  2. If possible, see if the baseline model is performing well on this dataset.
  • Feature Engineering for structured data: It might not be possible to create new samples for structured data such as online user data as it is impossible to add new users. For structured datasets creating new features can be a great option to explore.
Java vs Kotlin
Our solutions cater to diverse industries with a focus on serving ever-changing marketing needs. Click to explore our MLOps platform Management Services

Developing Fair and Unbiased ML algorithms

MlOps focuses on building Fair and unbiased ML algorithms so that every end-user using the served by us in production should have equal opportunities. This means they are not discriminated against based on race, sex, religion, socioeconomic status, and other categories. For example, a credit card approval application using the ML model at the backend may reject a person based on his race if Bais was not eliminated from the data. To avoid such unfair events, follow the best practices regarding Bias and Fairness given below:

Analyze the data for biases: One should properly analyze the data, so there is no representational bias in the dataset. This means one group of people is left intentionally for some reason, such as if the dataset used to train the models excludes darker skin tones. We have mentioned bias only. Other biases can be present in ML workflow. We need to reduce all of them. See the figure below and follow this link for more information. Following the above procedure, the model is ready to go for production. For deployment Best practices, see ModelOps best practices section.

Related blogs and Articles

Unit Testing of Machine Learning with Test Driven

Data Science

Unit Testing of Machine Learning with Test Driven

Machine Learning Unit Testing with Test-Driven Development A pattern built for development in performance testing is known as Test-Driven Machine Learning Development. It is a process that enables the developers to write code and estimate the intended behavior of the application. The requirements for the Test-Driven Machine Learning Development process are mentioned below- Detect the change...