Kubernetes is growing day by day, and now every organization is keen to adopt the containerization technology and it's gone so for that it can quickly deploy the stateful application like databases and measuring Kube, but it's not useful in managing the stateful set application. As we can see the deployment of the stateful application is easy, but as the time passes your application may require configuration update like resizing, reconfiguration, backup, or healing, so there are lots of reason to update right? Like it might be due to security issues.
This article will cover details regarding Kubernetes Operators Suppose an application has slave and master architecture or databases with the multi-node cluster and we want to add or remove instances then it might require preparation and post-provisioning steps which are done manually, and that result burden on DevOps guys. So to solve this problem and make it automated here comes Operator, Operator is automated software manager for k8s applications - install and lifecycle.The Operator was introduced by coreOS back in 2016.
"An Operator is a means of packaging, deploying, and maintaining a Kubernetes application-coresOS." Anyone can build an operator for their application beside he must be the expertise of their application along with kubernetes, now we have lots of operators out as opensource for different complex applications like for Prometheus, Kafka. So Operators are purpose-built to run a Kubernetes application, with operational knowledge baked in.
In this blog, we are going to see how we can use these operators to manage and automate the training process for any machine learning task. Kubernetes have lots of good things going for ML too, and we can see that how tensorflow has dominated the ML field these days, its give developer ability to train and deploy their models to whatever platform or language they want. We have now many operator present today, which can ease the training process, and some of them are tf-operator and mpi-operator which are used to train the models. Let's see these two operators in details.
ML needs a pretty good infrastructure to handle all the operation which is required in order to train a model, ya its correct that requirements of infrastructure depended on someone's training models, along with these things they have to take care of resource management, monitoring because at end they have to make sure that ML models are scalable and portable which is really painful.
So we have seen lots about Operator and kubernetes, but why we need to move our training things to Kubernetes? From above we might have learned one thing that the tech Containerisation makes things easy to deploy, managed, and monitor and this is what the ML developer needs now a day and make things automate through DevOps.
What is TensorFlow operator?
It's a k8s operator comes under Kubeflow. This Operator makes it easy to run the tensorflow jobs whether its distributed or non-distributed on kubernetes. TFjobs are the kubernetes custom resource that is used to train or run tf-training jobs on kubernetes. Kubeflow maintains all these operators, and we can say that kubeflow is collection such components that make it easy to run machine learning code in various forms within Kubernetes. So we need a tf-operator, for TFJob which will monitor and keep track of it. For example, We can deploy the Kubeflow which will scaleup tf-operator deployments, and then we can define our TFjobs accordingly and run as many as training we what on the kubernetes cluster.
Deploy the tf-operator check out the Kubeflow GitHub for more details "https://github.com/google/kubeflow.git" now after deploying the tf-operator we have to define and deploy the TFJobs. Let's see an example to understand it in details - Generally, a TF cluster consists of workers and parameter servers. Workers run the copies of the training while servers maintain the model parameters. You can find more at Distributed TF Kubeflow. For sample TFJob, you can jump here. Ok, when you have to build your TFJob yaml fine now its time to deploy it on kubernetes cluster and start the training model.
The TFJob Custom Resource defines a TFJob resource for K8s. The below TFJob yaml file consist fo Tfreplicas, and these Tfreplicas establishes a set of TF processes performing a role in the job, i.e., master or worker or ps. As we can see, each TfReplica contains a kubernetes pod template, and on these templates, a process is specified which runs in each replica. TFJob can handle distributed as well as non-distributed training.
So, the training of TensorFlow models using tf-operator, which relies on centralized parameter servers for coordination between workers. An alternative is a decentralized approach in which workers communicate with each other directly without using parameters servers via the MPI allreduce primitive. So let's see how we can use mpi-operator to train our models and run on Kubernetes.
Guide to Mpi-operator
It is the same as tf-operator, the mpi-operator is also a Kubernetes operator which is used to run allreduce-style distributed training. So before starting to deploy the Kubeflow on your Kubernetes cluster and mpi-operato, you can check this link for more details. Deploy the mpi-operator follow this link and git clone the repo and then move to deploy folder and run this command -
kubectl create -f deploy/
The command deploys the mpi-operator with the default values. Now check the CRD with below command.
kubectl get crd
Now that we have successfully deployed the mpi-operator now its turn to define a mpi-jobs for our machine learning training modes. Below YAML file is most simple mpijob -
To verify mpijobs status, check by running the kubectl commands -
kubectl get -o yaml mpijobs
Concluding Kubernetes Operators
The rise of Kubernetes and containerization tech has made the deployments of application or services more accessible, and we can scale up as many as services we want, can monitor it all these without any difficulties. And due to operators its become easier to manage and automate even stateful application.
A Holistic Startegy
Operators are the method of packaging, deploying and managing a Kubernetes application, and someone can use these operators for training their machine learning models on Kubernetes, which give them the ability to automate these models and run smoothly.And we can automate these learning by using operators like tf and mpi. Now Kubernetes is ready for the stateful set application too, and data scientist should use the k8s for managing and automating the training process by defining tf-job and mpi-job. The tfjob and mpijob are one of two operators there are so many other operators too which an ML developer can you to train their models on k8s whether on distributed mode or non-distributed mode.To learn more about Kubernetes we advise taking the following steps