Introduction to MLOps
The era which undoubtedly belongs to Artificial Intelligence results in the use of machine learning in almost every field, and whether it is healthcare, business, and technical, machine learning is everywhere. With the availability of the latest tools, techniques, and algorithms the development of machine learning models to solve a problem is not a challenge, the real challenge lies in the maintenance of these models at a massive scale. This blog will give an insight into Productionizing Machine learning models with MLOps Solutions.
The new chasm of the development process of machine learning involves the collaboration of four major disciplines – Data Science, Data Engineering, Software Engineering, and traditional DevOps. These four disciplines have their level of operations and their requirements with different constraints and velocity.
Current Challenges of Productionizing Machine learning models
Feeding the data to training and steps done at the evaluation stage in the data scientist sandbox can dramatically vary in real-world scenarios. Depending on the use case data changes with time and lack of regularity cause poor performance of ML models.
Simple to complex pipelines
Training a simple model and putting it into inference and generating prediction is a simple way of getting business insights. This usually a manual offline training and then use the trained model to generate inference. But mostly in business problems, this is not sufficient. In real-world cases, regularity is needed and with time models needs retraining on new data. A retraining pipeline needs to add to the system that frequently gets the latest data from the data lake.
In retraining pipeline there going to be many models and human approval is needed to decide which model to choose for production. In other cases where ensemble models are used for improving the accuracy, multiple training pipelines are used, and in Federated pipeline, it becomes even more challenging to maintain.
There are scaling issues at different levels of development, and Even if the data pipeline is developed in a scaled way, issues come while feeding the data to ML models. Because ML models are built in a Data scientist sandbox. It was not developed to take scalability in mind; rather, it was just developed to get good accuracies and right algorithm. Different types of ML frameworks were used, and each has its scaling and opportunities issues. On the hardware side Training, a complex neural network requires a powerful GPU, and simple ML models can be processed on a cluster of CPUs.
Production ML Risk
The risk of ML models not doing well is continuous and needs continuous monitoring and evaluation if they are performing within expected bound. On live data metrics like Accuracy, Precision, recall, etc. cannot be used as live data does not have labels. To ensure the health of ML models, different methods such as Data Deviation Detection, Drift Detection, Canary Pipelines, Production A/B tests should be used.
Process and Collaboration
In production, ML requires multiple abilities to handle production-grades ML systems like data scientists, data engineers, business analysts, and operations. Different teams will focus on various outcomes. As a data scientist will focus on improving the accuracy, detecting data deviations and business analyst want to enhance KPI’s. In operations team wish to see uptime and resources.
Unlike the Data scientist sandbox, the production environment has many objects like models, algorithms, pipelines, etc. that are difficult to handle and versioning of these is another issue. Object storage is needed to store the ML models, and source control repository is not the best option.
Continual Analytical Environment and Automation Required for ML Experiments to Production
- Retraining of the models and algorithm periodically
- Ease of re-deployment and configuration modifications in the system
- Out of sample observation should be done in real-time scoring for high performance and efficiency
- Monitor model performance over time
- Scaling to massive data operations
- Adaptive ETL and expertise to manage new data feeds and transactional systems as data sources for AI and machine learning tools
- Security and authorize access levels to different areas of the analytical methods
- Reliable backup and recovery processes/tools
- This goes into the territory traditionally inhabited by DevOps. Data Scientists should ideally discover to handle the part of those conditions themselves or at least be informative consultants to classical DevOps gurus.
What is MLOps?
It is the communication between data scientists and operations teams. MLOps have mix capabilities of Data scientists and services designed to provide automation in ML pipelines and get more precious insights in production systems. It provides reproducibility, visibility, managed access control, and the computing resources to test, train, and deploy AI algorithms to Data engineers, business analyst, and operations team.
It is pretty clear from the above content that what is the need for ‘MLOps’ and what lead to the rise of this hybrid approach in the modern era of Artificial Intelligence. Now moving forward from ‘What’ to ‘Why.’ Let us give some lights on the reasons which led to the use of MLOps in the first place.
Orchestration of multiple pipelines
- The development of machine learning models is not a single code file task. Instead, it involves the combination of the different pipelines which have their roles to perform.
- Pipelines for the primary process such as pre-processing, feature engineering model training and model inference, etc. involved in the big picture of the development of the machine learning model.
- MLOps play an essential role in the simple orchestration of these multiple pipelines to ensure the updating of the model automatically.
Manage Full Life Cycle of MLOps
- The life cycle of a Machine learning model consists of different sub-parts which should be considered as a software entity individually.
- These sub-parts have their own need for management and maintenance, which often handled by DevOps, but it is challenging to manage them using traditional DevOps methods.
- MLOps is the newly emerged technique which includes a combination of people, process, and technology that give an edge to swiftly and safely optimize and to deploy machine learning models.
Scale ML Applications
- As it is said earlier in the topic, the development of models is not an issue to be worried about, and the real problem lies in the management of the models at scale.
- The management of the thousands of models at once is a very cumbersome and challenging task which test the performance of the models at scale.
- With the use of MLOps, it is natural scales the manage thousands of pipelines of models in production.
Maintain ML Health
- To maintain ML health after the deployment of ML models is the most critical part of post-process. It is vital so that ML models can be operated and managed flawlessly.
- MLOps provide latest ML health methods by enabling detection of different drifts (model drift, data drift) in an automated way.
- It can provide the ability to use the latest edge cutting algorithms in the system to detect these drift so that these drifts can be avoided much before they will start to affect ML health.
Continuous Integration and Deployment
- Continuous Integration and Deployment is one of the whole sole purposes, which led to the use of DevOps in any software product development procedures.
- But due to the scale of the operability of ML models, it is difficult to use the same methods of continuous integration and deployment, which are used for other software products.
- MLOps can provide the hands to use different dedicated tools and techniques which are specialized to ensure the continuous integration and deployment services in the field of ML models.
- Under Model Governance, MLOps can provide rich model performance data by applying to monitor the of attributes at a massive scale.
- It can also provide the ability to take snapshots of the pipelines for analyzing critical moments.
- Also, the logging facilities and audit trails under MLOps can be used for reporting and continuity of compliance.
How is MLOps different from DevOps?
- Data/model versioning != code versioning
- Model reuse entirely has different case than software reuse, as models need tuning based on scenarios and data.
- Fine-tuning is needed when to reuse a model. Transfer learning on it, and it leads to a training pipeline.
- Retraining ability requires on-demand as the models decay over time.
How to Adopt MLOps?
Overview of Machine Learning Workflows
Reproducibility in ML models
For fault tolerance and iterative filtration of ML models is essential, and for that reproducibility is needed. Repeatability required to illuminate the source of variation like –
- Inconsistent hyperparameters
- Change to model architecture
- Random initialization of layers weights
- Shuffling of datasets
- Noisy hidden layers
- Change in ML frameworks
- Cpu multi-threading
- Non-deterministic GPU-floating point calculation
As distributed training come across on cluster of GPUs with advanced models and live data streams, this capability becomes very important.
Start packaging ML models to support reproducibility. There are several tools available, or a custom tool can be developed according to the use case. This tool should be able to package the model and then ready to deploy it on a platform according to the use case — the best way to package it using docker in a containerized environment.
There should be a feedback loop that should be analyzed by the model monitoring system and generate feedback on the model performance. It should be there to check if a model is performing wrong due to data drift.
When it comes to control the production updates its really difficult in ML pipelines as there is not only the source code that changes in pipeline but when a new retrained model is selected by human approval or some advance auto-selection method, The new changes should be done with proper control of not having any instability or downtime of ML applications.
ML pipelines are code, and DevOps toolchain pipeline plays an essential role in MLOps. The source code repository automation facilitated by Jenkins and orchestrators such as AirFlow is a classic example. But when it comes to ML pipelines, there are additional challenges that can not address by the typical traditional toolchain. Like ML pipeline can have parallel running multiple pipelines, There are interdependencies like Model Approval and Drift Detection. These additional dependencies need to be integrated with the DevOps toolchain pipeline.
Model Management is the core of MLOps. When it comes to managing complex pipelines where a large number of models, objects, and training pipelines are generated MLOps needed.
Making changes reversibly is necessary for production for stability and fault tolerance. Unlike source code versioning of ML models is an additional step from traditional pipeline.
Complex pipelines of models lead to many models runs in the pipeline like in case of ensembles models runs. Creating this many experiments to select the best champion or challenger model require model tracking. There are different kinds of tools like MLflow available for tracking models, or one can build a custom pipeline according to the use case.
Monitoring in ML systems is not just checking the uptime of different services and resource/compute its taking. As in production, ML brings things to monitor that is directly related to the success of business outcomes.
ML data drift
The change in the relationship between input and output data with time called data to drift. The data drift analysis should be in the ML monitoring. The observability of data drift is essential to analyze if retraining requires any changes in the configuration of ML models. That is the reason why inference monitoring is needed. Let’s discuss inference monitoring in brief.
Continuously monitor the inference and observe if it is behaving according to the expected bounds. This monitoring tells about mismatching of input and output data which detects data drift, and it also provides the performance of ML model using which performance analysis and comparison of the models is possible.
Best Practices for Productionizing ML Models
Implement continuous integration (CI), Continuous Delivery (CD), and retraining the pipeline for a Machine Learning model using different DevOps and Machine Learning Services
Version source code
Package Models for Reproducibility
Provision infra as code
- Compute targets
For new code
- Unit tests
- Data quality checks
- Train model
- Machine Learning Service
- Machine Learning Compute Tools
- Machine Learning Pipelines
- Storage Components
- Containers Components (to run and store different Image components)
- Production Components (such as Kubernetes services)
- Application Components (includes monitoring services)
Machine Learning Operations
- Define the process of observability
- When failure occurs, it should be easily detectable, set alerts
- Check If the model is wrong, does it have a feedback loop?
- Check how does it scale – does it have an automatic load balancing
Build pipeline – For building the code and run a combination of tests which includes –
- Unit test
- Data test
Retraining pipeline –This pipeline subjected to the retraining the model on schedule or when new data becomes available. This pipeline cover the following steps –
- Train the model
- Evaluate model
- Register model
Release pipeline – Operationalizes the scoring image and forward it safely across different environments. This pipeline has two main components –
- QA environment
- Production environment
Which includes the following elements –
- Model Artifact trigger
- Building an image of scoring
- Deploying on Container Instances
- Test web service
Benefits of Enabling MLOps
- Scale ML Models initiatives broadly by swiftly and flawlessly converting the true potential of machine learning into business processes (which already exist) and systems (which exist across the enterprise)
- It provides the advantage to strap and increase investments in existing machine learning and data science tools and technology to the maximum and create a consolidates, a system of maintaining the records in between different teams and different projects
- It helps to manage and maintain the partnership of data science and IT/Ops teams to work together to provide ML-powered applications which can provide some values in results.
- It helps to minimize the hazard to the organization by regularizing and importing in place hefty governance checklist and balances, and by enabling the use of best practices for machine learning models at the production level.
A Holistic Startegy
To know how more about how Machine Learning is changing various Industries we recommend taking the following steps –
- Learn more about” DevOps for Machine Learning , Tensor Flow and PyTorch“
- Read more about Training and Serving “Machine learning Platform“