Loading directory…

Stage 03 / 07·Platform EngineerSRE·23 tools

Operate. Manage steady-state fleets, orchestrate workloads, and handle capacity.

Operate is the long middle of the infrastructure lifecycle — everything between "it's deployed" and "something's on fire. Tools indexed below help platform teams automate or augment the work in this stage.

Operate tools

Primary

9 tools

Botkube

Kubeshop

verified

AI-based collaborative Kubernetes troubleshooting platform operating within Slack, Teams, Discord, and Mattermost. AI Insights provide troubleshooting and operations guidance for cluster issues. Real-time alert enrichment adds K8s context to notifications. Enables ChatOps remediation — execute kubectl commands and runbooks directly from chat. Converts passive telemetry notifications into human-in-the-loop operational actions inside the collaboration platform.

Operate Observe

Open sourceAI Enhanced

CAST AI

verified

AI-driven Kubernetes automation platform that autonomously manages cluster infrastructure. Continuously analyzes workload patterns to drive intelligent bin-packing, node rightsizing, and spot instance orchestration with automatic on-demand fallback. Zero-downtime live migration moves stateful workloads between nodes for rebalancing. Replaces Cluster Autoscaler with ML-based provisioning decisions — takes infrastructure actions, not just recommendations.

Operate Optimize

Freemium· free tierAI Native

Firefly

down

Cloud asset management platform with Thinkerbell AI agents for autonomous infrastructure operations. Continuously scans multi-cloud and Kubernetes estates for configuration drift, then triggers AI-assisted auto-remediation. One-click remediation, natural language infrastructure queries, and Cloud Resilience Posture Management with automated cross-region failover (RTO under 1 hour). Automated IaC generation brings unmanaged ClickOps resources under Terraform/Pulumi governance.

Provision Operate

Paid· free tierAI Native

K8sGPT

K8sGPT (CNCF Sandbox)

verified

AI-powered Kubernetes cluster analyzer and remediation tool. Built-in analyzers scan pods, services, deployments, ingresses, and events for misconfigurations and failures, providing plain-English explanations via multiple AI backends (OpenAI, Azure, Bedrock, local models). Operator mode enables continuous in-cluster monitoring. Experimental auto-remediation patches supported resources. MCP server exposes cluster operations as tools for AI assistants.

Operate

Open sourceAI Native

kagent

Solo.io

unknown

CNCF Sandbox framework for running AI agents natively in Kubernetes for automated diagnostics and cluster operations.

Operate

Open sourceAI Native

Komodor

verified

Autonomous AI SRE platform for cloud-native infrastructure. Klaudia AI Agents perform autonomous investigation of Kubernetes issues by correlating deployment changes, config drift, alerts, and telemetry to identify root cause. Automated remediation playbooks execute operational actions — restart, scale, cordon, drain — with governance guardrails. Continuous drift detection and dynamic pod rightsizing bridge observability data to operational action.

Operate Optimize

Freemium· free tierAI Native

Metoro

verified

AI SRE platform for Kubernetes built on an eBPF telemetry layer. Autonomously monitors services, spawns AI agents to root-cause issues using complete runtime context (traces, metrics, logs, K8s state, deployments, code changes), performs deployment verification, and generates fix pull requests. The eBPF layer provides the data; AI agents drive operational actions — investigations, verifications, fixes.

Operate Observe

Freemium· free tierAI Native

Robusta

verified

Kubernetes troubleshooting and self-healing platform. Open-source core provides rule-based alert enrichment and auto-remediation playbooks that trigger operational actions — restart pods, scale deployments, rollback, run commands — in response to Prometheus alerts. HolmesGPT adds AI-powered cross-system investigation spanning AWS, GCP, OpenShift, and Kubernetes, generating root cause narratives and fix suggestions.

Operate Respond

Open sourceAI Enhanced

StormForge

StormForge (a CloudBolt company)

verified

ML-based Kubernetes resource optimization platform. Machine learning models trained on historical utilization patterns predict optimal CPU and memory requests and limits. Optimize Live provides automated rightsizing that applies via the Kubernetes API or exports as YAML for GitOps. Node Optimization uses ML to guide cluster autoscaler decisions with predictive algorithms. Java JVM heap size recommendations included.

Grafana Labs

verified

Open-source observability platform with ML-powered Sift investigations and an AI assistant that generates PromQL/LogQL queries from natural language. Adaptive Telemetry automatically drops high-cardinality data before indexing, cutting ingest costs. The open-core model lets you self-host Grafana OSS free or use managed Cloud tiers.

Operate Observe Respond

Open sourceAI Enhanced

Harness CD

Harness, Inc.

verified

Enterprise CD platform with ML-based deployment verification (AIDA). Auto-detects performance and quality regressions during canary deployments by comparing metrics against historical baselines, then triggers rollback when anomalies exceed thresholds. Predictive deployment risk scoring analyzes code change characteristics to flag high-risk releases before they ship.

Deploy Operate Secure

Open sourceAI Enhanced

IBM watsonx Code Assistant for Ansible Lightspeed

IBM / Red Hat

verified

AI-enhanced Ansible content generation trained on Red Hat Ansible Content Collections. Synthesizes playbooks, roles, and modules from natural language prompts, applies organizational content patterns, and validates output against Ansible best practices. Built for platform teams managing heterogeneous fleets who need consistent, auditable automation — not just faster YAML writing.

Provision Deploy Operate

Freemium· free tierAI Native

LaunchDarkly

LaunchDarkly, Inc.

verified

Feature management platform with AI-powered Guarded Rollouts. Sequential testing engine progressively increases traffic while monitoring metrics for regressions — ML detects statistically significant negative impact and automatically pauses or rolls back the rollout. Separates deployment from release, enabling rollback without redeployment. First FedRAMP-authorized feature management solution.

Deploy Operate Observe

Freemium· free tierAI Enhanced

n8n

unknown

Fair-code workflow automation with 500+ integrations, AI agent nodes, and self-hostable deployment for platform teams.

Deploy Operate

Open sourceAI Enhanced

verified

Agentless CNAPP and CSPM solution that uses an AI-powered unified risk graph to correlate vulnerabilities, misconfigurations, exposed secrets, and identity risks across AWS, Azure, GCP, Kubernetes, and Snowflake. Prioritizes risks based on actual exploitability and blast-radius analysis rather than theoretical severity, enabling teams to remediate the 1% of issues that matter.

Operate Observe Secure

PaidAI Enhanced

Zesty

DoiT International (acquired Zesty February 2025)

verified

AI-powered cloud optimization platform that autonomously manages Kubernetes resources via Kompass — continuously rightsizing pod CPU/memory, autoscaling persistent volumes, and managing spot instance migrations with sub-40-second replacement. ML models analyze real-time usage patterns for ongoing adjustments without code changes. The Commitment Manager automates AWS Savings Plan and RI purchasing with micro-commitment strategies.

Operate Optimize

PaidAI Enhanced

AI Assistants for this stage

Cross-cutting

All DevEx tools →

Coding assistants, IDEs, and terminal tools that aren't tied to a single stage but are commonly used by platform engineers working in operate.

Amazon Q Developer

Amazon Web Services

unknown

AWS-native AI coding assistant with deep integration for IAM, CloudFormation, CDK, and Terraform patterns.

Operate. Manage steady-state fleets, orchestrate workloads, and handle capacity.

Operate tools

Primary

Also overlaps with this stage

AI Assistants for this stage

All stages