AIOps Monitoring for Kubernetes and Serverless
- Kubernetes and OpenShift containers are complicated to set up, monitor and maintain. Kubernetes components involve API server, Kube controller, kubelet and Kube scheduler. Kubernetes develops faster eliminating infrastructure development. Ease of deployment and issue resolution, rapid test cycles delivers high-quality code reducing operational cost.
- Monitoring of Cluster Nodes to monitor resource utilization across the nodes that allow pod scheduling by keeping track of the pod counts, visualization of pods deployment to each node and evicted pods.
- Monitor Pod Evictions to check cluster health, manage garbage collection, review Pod issues, Scale down and Excessive load, Event Management, and Root Cause Analysis.
Challenge for Building Monitoring and Alerting Platform
- Real-Time solution to monitor the load from overview to depth on Kubernetes cluster.
- Real-Time alerting platform to generate Real-Time alerts as soon as data ingested into the platform.
- Centralized dashboard to define the rules by metrics receiving into the platform, alerting platform uses rules dynamically and integrates with Slack, Email, mobile devices, and web dashboard.
- Logs aggregation feature to view all correlated logs at particular timestamp at single place.
- Anomaly detection engine to view Real-Time fluctuations in monitoring data, detect anomalies in cluster health and performance.
- Predictive Analysis Engine, so that we could predict when our cluster usage can go high or down, and we can scale up or down before our cluster nodes crashed.
AIOps Monitoring Service Offerings
Build a platform to define data collection layer to collect metrics stats of clusters, pipelines, applications running on it using REST API, Agent-based collection using SNMP protocols, etc.
Steps followed while developing this platform –
STEP 1: Defining the metrics
To enable monitoring on all levels including ongoing activities and also the overall health of the cluster.
STEP 2: Enable Monitoring
Agent-based metrics collector to collect the metrics from the cluster and emit them to Http streaming service to build an exporter which exports the metrics to Time Series based database, i.e. Prometheus. Prometheus is a Time Series database to stores the Real-Time values of all metrics using exporter integrating charts, graphs with Prometheus to see the overall health of all clusters using dashboard.
STEP 3: Enable alerting in critical situations
Define rules for metrics values to get Real-Time notifications to prevent any errors/issues in the cluster. Using these rules, generate alerts to MicroServices.
STEP 4: Create Alerting platform to get alerts from Prometheus
Create a Microservice to consume alert messages from Prometheus and then according to cluster’s host address. Mobile apps use this API in which user registers cluster and gets notified in case of any critical event Occurrence.
STEP 5: Create an Analytics platform to analyze fluctuations in cluster metrics & log aggregation
Build a platform to run queries on Real-Time metrics and calculate the fluctuations using Machine Learning algorithms and update the database at regular intervals. Visualize data on Dashboards and see the fluctuation points at any particular timestamp. Aggregate all the logs on the to view all the log files info in that specific timestamp saving a lot of time.