Operate.Manage steady-state fleets, orchestrate workloads, and handle capacity.
Operate is the long middle of the infrastructure lifecycle — everything between "it's deployed" and "something's on fire. Tools indexed below help platform teams automate or augment the work in this stage.
Operate tools
Primary
9 tools
Botkube
Kubeshop
verified
AI-based collaborative Kubernetes troubleshooting platform operating within Slack, Teams, Discord, and Mattermost. AI Insights provide troubleshooting and operations guidance for cluster issues. Real-time alert enrichment adds K8s context to notifications. Enables ChatOps remediation — execute kubectl commands and runbooks directly from chat. Converts passive telemetry notifications into human-in-the-loop operational actions inside the collaboration platform.
Cloud asset management platform with Thinkerbell AI agents for autonomous infrastructure operations. Continuously scans multi-cloud and Kubernetes estates for configuration drift, then triggers AI-assisted auto-remediation. One-click remediation, natural language infrastructure queries, and Cloud Resilience Posture Management with automated cross-region failover (RTO under 1 hour). Automated IaC generation brings unmanaged ClickOps resources under Terraform/Pulumi governance.
AI-powered Kubernetes cluster analyzer and remediation tool. Built-in analyzers scan pods, services, deployments, ingresses, and events for misconfigurations and failures, providing plain-English explanations via multiple AI backends (OpenAI, Azure, Bedrock, local models). Operator mode enables continuous in-cluster monitoring. Experimental auto-remediation patches supported resources. MCP server exposes cluster operations as tools for AI assistants.
Autonomous AI SRE platform for cloud-native infrastructure. Klaudia AI Agents perform autonomous investigation of Kubernetes issues by correlating deployment changes, config drift, alerts, and telemetry to identify root cause. Automated remediation playbooks execute operational actions — restart, scale, cordon, drain — with governance guardrails. Continuous drift detection and dynamic pod rightsizing bridge observability data to operational action.
AI SRE platform for Kubernetes built on an eBPF telemetry layer. Autonomously monitors services, spawns AI agents to root-cause issues using complete runtime context (traces, metrics, logs, K8s state, deployments, code changes), performs deployment verification, and generates fix pull requests. The eBPF layer provides the data; AI agents drive operational actions — investigations, verifications, fixes.
ML-based Kubernetes resource optimization platform. Machine learning models trained on historical utilization patterns predict optimal CPU and memory requests and limits. Optimize Live provides automated rightsizing that applies via the Kubernetes API or exports as YAML for GitOps. Node Optimization uses ML to guide cluster autoscaler decisions with predictive algorithms. Java JVM heap size recommendations included.
Enterprise GitOps platform built on Argo CD by its original creators. Akuity Intelligence adds AI-powered Promotion Advisor and Deployment Advisor agents that autonomously analyze Kubernetes event streams and pod logs during stalled rollouts, identify root causes of deployment drift, and execute automated remediation runbooks to ensure successful cluster state reconciliation.
Cloud native application protection platform delivering full-lifecycle container and Kubernetes security with AI-powered behavioral analytics for runtime threat detection. The dedicated Secure AI module extends protection to LLM workloads, detecting OWASP Top 10 for LLM risks, model poisoning, and prompt injection while maintaining supply chain integrity through SBOM generation.
Open-source observability platform with ML-powered Sift investigations and an AI assistant that generates PromQL/LogQL queries from natural language. Adaptive Telemetry automatically drops high-cardinality data before indexing, cutting ingest costs. The open-core model lets you self-host Grafana OSS free or use managed Cloud tiers.
Enterprise CD platform with ML-based deployment verification (AIDA). Auto-detects performance and quality regressions during canary deployments by comparing metrics against historical baselines, then triggers rollback when anomalies exceed thresholds. Predictive deployment risk scoring analyzes code change characteristics to flag high-risk releases before they ship.
AI-enhanced Ansible content generation trained on Red Hat Ansible Content Collections. Synthesizes playbooks, roles, and modules from natural language prompts, applies organizational content patterns, and validates output against Ansible best practices. Built for platform teams managing heterogeneous fleets who need consistent, auditable automation — not just faster YAML writing.
Feature management platform with AI-powered Guarded Rollouts. Sequential testing engine progressively increases traffic while monitoring metrics for regressions — ML detects statistically significant negative impact and automatically pauses or rolls back the rollout. Separates deployment from release, enabling rollback without redeployment. First FedRAMP-authorized feature management solution.
Unified observability platform with ingest-based pricing and New Relic AI (NRAI) for natural language querying, automated root cause analysis, and AIOps alert correlation. MCP server integration enables agentic AI workflows via AWS DevOps Agent. 100 GB/month free tier covers most small production environments.
AI-driven delivery intelligence overlay for Spinnaker and Argo CD. Automates release verification by aggregating APM telemetry and logs, using NLP and ML to generate real-time composite risk scores — Quality, Performance, Reliability, Security — that act as autonomous promotion or rollback gates in progressive delivery pipelines.
Event intelligence and AIOps platform that uses ML-based alert grouping, change correlation, and probable-origin analysis to cut noise by up to 90%. Gen-AI agents (Insights, SRE, Shift, Scribe) automate triage, root-cause investigation, on-call handoffs, and incident documentation across the full respond lifecycle.
AI-powered cloud workload automation platform using machine learning to predict spot instance interruptions, capacity trends, and pricing fluctuations. Elastigroup provides intelligent cluster management with up to 90% cost savings on interruptible compute; Ocean delivers serverless Kubernetes optimization with automated node provisioning; Eco automates Reserved Instance and Savings Plan lifecycle management.
Orchestration platform for Terraform and OpenTofu stacks with AI Mate assistant, MCP Server, and Catalyst framework for building AI agents. Uses DAG orchestration to manage complex stack dependencies. AI features are built into the CLI and Cloud interface — not a wrapper — enabling context-aware infrastructure changes, code generation, and troubleshooting inside the provisioning workflow.
Agentless CNAPP and CSPM solution that uses an AI-powered unified risk graph to correlate vulnerabilities, misconfigurations, exposed secrets, and identity risks across AWS, Azure, GCP, Kubernetes, and Snowflake. Prioritizes risks based on actual exploitability and blast-radius analysis rather than theoretical severity, enabling teams to remediate the 1% of issues that matter.
AI-powered cloud optimization platform that autonomously manages Kubernetes resources via Kompass — continuously rightsizing pod CPU/memory, autoscaling persistent volumes, and managing spot instance migrations with sub-40-second replacement. ML models analyze real-time usage patterns for ongoing adjustments without code changes. The Commitment Manager automates AWS Savings Plan and RI purchasing with micro-commitment strategies.
Free, fast AI completion plus Windsurf, an agentic IDE. Cascade is its agent: reads the codebase, plans, edits, verifies. Supports 70+ IDEs including VS Code, JetBrains, and Neovim via plugin. Self-hosted and air-gapped option for regulated teams. Now operating under the Windsurf brand as the flagship product.
AI-first IDE forked from VS Code, by Anysphere. Tab completion is the headline (deep multi-file edit predictions), plus Composer for repo-wide changes, agent mode, and chat that understands the whole codebase via embeddings. Bring-your-own model (Claude, GPT-5, Gemini). Privacy mode keeps your code out of training.
AI pair programmer from GitHub. Inline completions, multi-line suggestions, slash commands, chat for explaining and refactoring, and agent mode that can author PRs end-to-end. Trained on public code with enterprise filters for license-safe output. Available in VS Code, JetBrains, Neovim, Visual Studio, and the GitHub web UI.