What is Kubernetes Operators?
Kubernetes is growing day by day, and now every organization is keen to adopt the containerization technology and it's gone so for that it can quickly deploy the stateful application like databases and measuring Kube, but it's not useful in managing the stateful set application. As we can see the deployment of the stateful application is easy, but as the time passes your application may require configuration update like resizing, reconfiguration, backup, or healing, so there are lots of reason to update right? Like it might be due to security issues.
This article will cover details regarding Kubernetes Operators Suppose an application has slave and master architecture or databases with the multi-node cluster and we want to add or remove instances then it might require preparation and post-provisioning steps which are done manually, and that result burden on DevOps guys. So to solve this problem and make it automated here comes Operator, Operator is automated software manager for k8s applications - install and lifecycle.The Operator was introduced by coreOS back in 2016.
A Stateless application or process is something that does not save or reference information about previous operations. Click to explore about, Stateful and Stateless Applications Best Practices and Advantages
"An Operator is a means of packaging, deploying, and maintaining a Kubernetes application-coresOS." Anyone can build an operator for their application beside he must be the expertise of their application along with kubernetes, now we have lots of operators out as opensource for different complex applications like for Prometheus, Kafka. So Operators are purpose-built to run a Kubernetes application, with operational knowledge baked in.
In this blog, we are going to see how we can use these operators to manage and automate the training process for any machine learning task. Kubernetes have lots of good things going for ML too, and we can see that how tensorflow has dominated the ML field these days, its give developer ability to train and deploy their models to whatever platform or language they want. We have now many operator present today, which can ease the training process, and some of them are tf-operator and mpi-operator which are used to train the models. Let's see these two operators in details.
A portable open-source platform that helps in managing container services and workloads. Click to explore about, Kubernetes Security - Tools and Best Practices
Kubernetes Operator's Goals for Machine Learning Tasks
ML needs a pretty good infrastructure to handle all the operation which is required in order to train a model, ya its correct that requirements of infrastructure depended on someone's training models, along with these things they have to take care of resource management, monitoring because at end they have to make sure that ML models are scalable and portable which is really painful.
So we have seen lots about Operator and kubernetes, but why we need to move our training things to Kubernetes? From above we might have learned one thing that the tech Containerisation makes things easy to deploy, managed, and monitor and this is what the ML developer needs now a day and make things automate through DevOps.
What is TensorFlow operator?
It's a k8s operator comes under Kubeflow. This Operator makes it easy to run the tensorflow jobs whether its distributed or non-distributed on kubernetes. TFjobs are the kubernetes custom resource that is used to train or run tf-training jobs on kubernetes. Kubeflow maintains all these operators, and we can say that kubeflow is collection such components that make it easy to run machine learning code in various forms within Kubernetes. So we need a tf-operator, for TFJob which will monitor and keep track of it. For example, We can deploy the Kubeflow which will scaleup tf-operator deployments, and then we can define our TFjobs accordingly and run as many as training we what on the kubernetes cluster.
An open-source system, developed by Google, an orchestration engine for managing containerized applications over a cluster of machines. Click to explore about, Kubernetes Deployment Tools and Best Practices
Deploy the tf-operator check out the Kubeflow GitHub for more details "https://github.com/google/kubeflow.git" now after deploying the tf-operator we have to define and deploy the TFJobs. Let's see an example to understand it in details - Generally, a TF cluster consists of workers and parameter servers. Workers run the copies of the training while servers maintain the model parameters. You can find more at Distributed TF Kubeflow. For sample TFJob, you can jump here. Ok, when you have to build your TFJob yaml fine now its time to deploy it on kubernetes cluster and start the training model.
The TFJob Custom Resource defines a TFJob resource for K8s. The below TFJob yaml file consist fo Tfreplicas, and these Tfreplicas establishes a set of TF processes performing a role in the job, i.e., master or worker or ps. As we can see, each TfReplica contains a kubernetes pod template, and on these templates, a process is specified which runs in each replica. TFJob can handle distributed as well as non-distributed training.
So, the training of TensorFlow models using tf-operator, which relies on centralized parameter servers for coordination between workers. An alternative is a decentralized approach in which workers communicate with each other directly without using parameters servers via the MPI allreduce primitive. So let's see how we can use mpi-operator to train our models and run on Kubernetes.
1 apiVersion: kubeflow.org/v1beta1 2 kind: TFJob 3 metadata: 4 generateName: tfjob 5 namespace: kubeflow 6 spec: 7 tfReplicaSpecs: 8 PS: 9 replicas: 1 10 restartPolicy: OnFailure 11 template: 12 spec: 13 containers: 14 - name: tensorflow 15 image: your_imagename 16 command: 17 - python 18 - -m 19 - trainer.task 20 - --batch_size=32 21 - --training_steps=1000 22 Worker: 23 replicas: 3 24 restartPolicy: OnFailure 25 template: 26 spec: 27 containers: 28 - name: tensorflow 29 image: your_imagename 30 command: 31 - python 32 - -m 33 - trainer.task 34 - --batch_size=32 35 - --training_steps=1000 36 Master: 37 replicas: 1 38 restartPolicy: OnFailure 39 template: 40 spec: 41 containers: 42 - name: tensorflow 43 image: your_imagename 44 command: 45 - python 46 - -m 47 - trainer.task 48 - --batch_size=32 49 - --training_steps=1000
Guide to Mpi-operatorIt is the same as tf-operator, the mpi-operator is also a Kubernetes operator which is used to run allreduce-style distributed training. So before starting to deploy the Kubeflow on your Kubernetes cluster and mpi-operato, you can check this link for more details. Deploy the mpi-operator follow this link and git clone the repo and then move to deploy folder and run this command -
The command deploys the mpi-operator with the default values. Now check the CRD with below command.
kubectl create -f deploy/
Now that we have successfully deployed the mpi-operator now its turn to define a mpi-jobs for our machine learning training modes. Below YAML file is most simple mpijob -
kubectl get crd
To verify mpijobs status, check by running the kubectl commands -
1 apiVersion: kubeflow.org/v1alpha1 2 kind: MPIJob 3 metadata: 4 name: tensorflow-benchmarks-16 5 spec: 6 GPUs: 16 7 template: 8 spec: 9 containers: 10 - image: mpioperator/tensorflow-benchmarks:latest 11 name: tensorflow-benchmarks
kubectl get -o yaml mpijobs