A curated directory for senior SREs, infrastructure leads, and platform teams. Every tool is organized by lifecycle stage — plus a DevEx layer for AI coding assistants — and independently rated for AI maturity, so you can tell the difference between tools where AI is core and tools that bolted on a chatbot.
AI-native application security platform converging SAST, DAST, CSPM, IaC scanning, secrets detection, container security, and malware scanning into a single developer-centric workflow. AutoTriage and AI AutoFix use ML and reachability analysis to cut false positives by 95%, with one-click remediation PRs for developers without deep security expertise.
Autonomous configuration optimization platform powered by Reinforcement Learning. It systematically explores millions of configuration permutations — Kubernetes resource limits, JVM parameters, HPA thresholds — to hit strict SLO targets while minimizing infrastructure costs. Operates both offline in pre-production and live in production environments.
Enterprise GitOps platform built on Argo CD by its original creators. Akuity Intelligence adds AI-powered Promotion Advisor and Deployment Advisor agents that autonomously analyze Kubernetes event streams and pod logs during stalled rollouts, identify root causes of deployment drift, and execute automated remediation runbooks to ensure successful cluster state reconciliation.
AI-powered incident orchestration platform with OpsIQ — an intelligent correlation engine using reasoning agents and NLP to group related alerts, cut noise by up to 68%, and suggest resolution actions. Automates escalation, on-call scheduling, and bi-directional ChatOps workflows across 200+ integrations.
Cloud native application protection platform delivering full-lifecycle container and Kubernetes security with AI-powered behavioral analytics for runtime threat detection. The dedicated Secure AI module extends protection to LLM workloads, detecting OWASP Top 10 for LLM risks, model poisoning, and prompt injection while maintaining supply chain integrity through SBOM generation.
Self-hosted Terraform pull-request automation. Runs plan and apply from VCS webhooks with no per-resource SaaS billing. Apache 2.0; operators supply compute, state backend, and platform engineering capacity.
AI-based collaborative Kubernetes troubleshooting platform operating within Slack, Teams, Discord, and Mattermost. AI Insights provide troubleshooting and operations guidance for cluster issues. Real-time alert enrichment adds K8s context to notifications. Enables ChatOps remediation — execute kubectl commands and runbooks directly from chat. Converts passive telemetry notifications into human-in-the-loop operational actions inside the collaboration platform.
Cloud and self-hosted CI/CD with a strong reputation for fast pipelines and config-as-code. Native AI features: AI Test Insights for flaky test detection, ML-driven test splitting, and pipeline anomaly detection. Resource classes from small to GPU. Integrates with every major SCM and cloud, with first-class macOS for iOS builds.
Cloud cost intelligence platform with an AnyCost engine that ingests billing and usage data from any source without requiring perfect tags. ML-driven anomaly detection identifies cost spikes in real-time, while unit cost analytics attribute spend to specific features, customers, and deployments to align engineering decisions with business outcomes.
Free, fast AI completion plus Windsurf, an agentic IDE. Cascade is its agent: reads the codebase, plans, edits, verifies. Supports 70+ IDEs including VS Code, JetBrains, and Neovim via plugin. Self-hosted and air-gapped option for regulated teams. Now operating under the Windsurf brand as the flagship product.
AI-first IDE forked from VS Code, by Anysphere. Tab completion is the headline (deep multi-file edit predictions), plus Composer for repo-wide changes, agent mode, and chat that understands the whole codebase via embeddings. Bring-your-own model (Claude, GPT-5, Gemini). Privacy mode keeps your code out of training.
Full-stack observability platform powered by Watchdog anomaly detection and Bits AI autonomous SRE. Continuously baselines metrics across hosts, containers, and traces to eliminate static thresholds and surface root causes. Bits AI handles incident investigation autonomously — correlating signals, querying logs, and proposing remediations without requiring manual runbook execution.
IaC orchestration platform with embedded AI — Cloud Analyst for cost and compliance insights, AI PR Summaries, and an IaC Code Generator. Manages Terraform, OpenTofu, Pulumi, and CloudFormation with cost estimation, policy enforcement, and workflow automation. AI features are context-aware inside the provisioning lifecycle, not a bolted-on chatbot.
Cloud asset management platform with Thinkerbell AI agents for autonomous infrastructure operations. Continuously scans multi-cloud and Kubernetes estates for configuration drift, then triggers AI-assisted auto-remediation. One-click remediation, natural language infrastructure queries, and Cloud Resilience Posture Management with automated cross-region failover (RTO under 1 hour). Automated IaC generation brings unmanaged ClickOps resources under Terraform/Pulumi governance.
FireHydrant (acquired by Freshworks December 2025)
verified
Runbook-driven incident management platform that automates response coordination from detection through retrospective. AI Copilot auto-generates incident summaries, links similar historical incidents, transcribes war room meetings, and drafts retrospectives. Deep service catalog mapping enforces consistency across complex microservice architectures.
AI-enhanced secrets detection platform using ML for false-positive reduction (Secret Enricher) and permission-scope analysis (Secrets Analyzer) across 450+ secret types. Scans code repositories, Slack workspaces, Jira, and CI/CD pipelines to prevent secrets sprawl, with ggshield pre-commit hooks extended to AI coding assistants like Cursor and Claude Code.
GitHub-native CI/CD that runs workflows triggered by repo events. Hosted runners across Linux, macOS, and Windows or self-hosted on your infrastructure. Tight integration with GitHub Copilot for AI-assisted workflow authoring, Copilot Autofix for security findings, and Copilot agentic PR reviews. The default deploy plumbing for any team already on GitHub.
AI pair programmer from GitHub. Inline completions, multi-line suggestions, slash commands, chat for explaining and refactoring, and agent mode that can author PRs end-to-end. Trained on public code with enterprise filters for license-safe output. Available in VS Code, JetBrains, Neovim, Visual Studio, and the GitHub web UI.
Built-in CI/CD for GitLab: pipelines defined in .gitlab-ci.yml, runners on Linux, macOS, and Windows, and Auto DevOps for opinionated deploys. GitLab Duo brings AI code suggestions, vulnerability explanations, root cause analysis on failed jobs, and chat-based incident triage. The single-application platform sell remains its differentiator vs GitHub plus add-ons.
Open-source observability platform with ML-powered Sift investigations and an AI assistant that generates PromQL/LogQL queries from natural language. Adaptive Telemetry automatically drops high-cardinality data before indexing, cutting ingest costs. The open-core model lets you self-host Grafana OSS free or use managed Cloud tiers.
Enterprise CD platform with ML-based deployment verification (AIDA). Auto-detects performance and quality regressions during canary deployments by comparing metrics against historical baselines, then triggers rollback when anomalies exceed thresholds. Predictive deployment risk scoring analyzes code change characteristics to flag high-risk releases before they ship.
Managed control plane for HashiCorp Terraform: remote state, run execution, and policy enforcement. Billed by Resources Under Management (RUM) since November 2023. Operated by IBM since the HashiCorp acquisition closed in late 2024.
AI-enhanced Ansible content generation trained on Red Hat Ansible Content Collections. Synthesizes playbooks, roles, and modules from natural language prompts, applies organizational content patterns, and validates output against Ansible best practices. Built for platform teams managing heterogeneous fleets who need consistent, auditable automation — not just faster YAML writing.
Slack-native incident management platform that auto-generates timelines, assigns action items via AI, and runs structured retrospectives without leaving the war room. AI SRE features include an assistant that investigates root cause, drafts post-mortems, and correlates signals across your observability stack.
Proactive FinOps platform that shifts cost management left into CI/CD and IDEs. Parses Terraform, CloudFormation, and CDK plans to generate cost breakdowns before deployment, and equips AI coding agents (Claude Code, GitHub Copilot, Cursor) with a live cloud pricing API covering 10M+ prices to generate budget-compliant infrastructure on the first attempt.
AI-powered Kubernetes cluster analyzer and remediation tool. Built-in analyzers scan pods, services, deployments, ingresses, and events for misconfigurations and failures, providing plain-English explanations via multiple AI backends (OpenAI, Azure, Bedrock, local models). Operator mode enables continuous in-cluster monitoring. Experimental auto-remediation patches supported resources. MCP server exposes cluster operations as tools for AI assistants.
Autonomous AI SRE platform for cloud-native infrastructure. Klaudia AI Agents perform autonomous investigation of Kubernetes issues by correlating deployment changes, config drift, alerts, and telemetry to identify root cause. Automated remediation playbooks execute operational actions — restart, scale, cordon, drain — with governance guardrails. Continuous drift detection and dynamic pod rightsizing bridge observability data to operational action.
Kubernetes cost allocation platform that breaks cloud spend down to namespace, pod, and label level with real-time monitoring. ML-driven rightsizing recommendations analyze historical usage patterns to suggest optimal resource requests and limits, while anomaly detection catches cost spikes before the bill arrives. Available as open-core self-hosted and fully managed SaaS.
Feature management platform with AI-powered Guarded Rollouts. Sequential testing engine progressively increases traffic while monitoring metrics for regressions — ML detects statistically significant negative impact and automatically pauses or rolls back the rollout. Separates deployment from release, enabling rollback without redeployment. First FedRAMP-authorized feature management solution.
AI SRE platform for Kubernetes built on an eBPF telemetry layer. Autonomously monitors services, spawns AI agents to root-cause issues using complete runtime context (traces, metrics, logs, K8s state, deployments, code changes), performs deployment verification, and generates fix pull requests. The eBPF layer provides the data; AI agents drive operational actions — investigations, verifications, fixes.
Unified observability platform with ingest-based pricing and New Relic AI (NRAI) for natural language querying, automated root cause analysis, and AIOps alert correlation. MCP server integration enables agentic AI workflows via AWS DevOps Agent. 100 GB/month free tier covers most small production environments.
Deployment automation platform with AI Deployment Failure Analyzer that examines logs, process configs, and error details to identify root cause and suggest remediation. Recovery Agent diagnoses deployment failures with a single click. MCP Server enables external AI agents to query Octopus infrastructure for change management and audit workflows.
AI-driven delivery intelligence overlay for Spinnaker and Argo CD. Automates release verification by aggregating APM telemetry and logs, using NLP and ML to generate real-time composite risk scores — Quality, Performance, Reliability, Security — that act as autonomous promotion or rollback gates in progressive delivery pipelines.
Agentless cloud security platform using patented SideScanning technology to read cloud configuration and workload runtime state out-of-band without deploying agents. Embeds GenAI-powered investigation and natural language querying to explain attack paths, correlate risks across multi-cloud environments, and guide remediation including paused and stopped workloads.
LLM-powered blast radius and risk assessment for deployments. Analyzes incoming code and infrastructure changes against a real-time dependency graph of the cloud environment, delivering natural language predictions of downstream failures before rollout executes. Identifies hidden dependencies like schema changes that break downstream services — catches outages before a single user is affected.
Event intelligence and AIOps platform that uses ML-based alert grouping, change correlation, and probable-origin analysis to cut noise by up to 90%. Gen-AI agents (Insights, SRE, Shift, Scribe) automate triage, root-cause investigation, on-call handoffs, and incident documentation across the full respond lifecycle.
CNCF graduated time-series database and metrics scraper. Pull-based model, multi-dimensional data, PromQL, Alertmanager. The default monitoring backbone for Kubernetes. AI angle is downstream: exemplars, vector embeddings via plugins, and AI features in Grafana, Robusta, K8sGPT, and others built on top of Prometheus data.
AI-native infrastructure agent built into Pulumi Cloud. Synthesizes, deploys, and operates infrastructure from natural language or chat — beyond code completion to understand cloud APIs and state. Targets platform teams using Pulumi or TypeScript/Python/Go who want to eliminate boilerplate and accelerate developer self-service without losing type safety or policy guardrails.
AI-native incident management platform built for SRE and DevOps teams. Orchestrates the entire respond lifecycle from detection to retrospective with AI-powered alert grouping, root cause analysis, conversational AI assistant in Slack, and automated post-mortem generation.
Remote operations backend for Terraform and OpenTofu with Scalr AI (launched June 2025). Provides intelligent error analysis, AI-generated plan summaries, and natural language policy explanations. Maintains run history, state management, and cost estimation in a unified control plane. Best fit for teams scaling past local Terraform execution who need an opinionated backend with embedded AI assistance.
High-velocity SAST and supply chain security platform powered by Semgrep Assistant. Uses AI Memories to auto-triage findings with 96% accuracy and generate context-aware autofix code patches tailored to your codebase style. The open-source engine drives community adoption while the cloud platform adds management, reporting, and CI/CD blocking policies.
Deployment intelligence platform that tracks every code deployment from commit to production. ML models baseline key health indicators and detect anomalies correlated with specific deployments, providing a Deploy Rating (A-F). Calculates DORA metrics with deployment-level granularity and correlates deployments with PagerDuty incidents for precise rollback targeting.
AI-native security platform combining DeepCode AI and Evo by Snyk to perform reachability analysis, risk-based prioritization, and auto-generated fix suggestions across SAST, SCA, container, and IaC scanning. Uses symbolic AI to determine whether a vulnerability is reachable in your specific code path, cutting noise by surfacing only exploitable issues with one-click remediation in the IDE and CI pipeline.
Policy-as-code CI/CD platform for IaC with Spacelift Intelligence (launched March 2026). Runs Terraform, OpenTofu, Pulumi, Ansible, and CloudFormation with OPA guardrails, drift detection, and a private module registry. AI features surface plan summaries, policy violations, and remediation paths inside run context — not a side chatbot. Purpose-built for platform teams needing auditability and multi-stack support.
AI-powered cloud workload automation platform using machine learning to predict spot instance interruptions, capacity trends, and pricing fluctuations. Elastigroup provides intelligent cluster management with up to 90% cost savings on interruptible compute; Ocean delivers serverless Kubernetes optimization with automated node provisioning; Eco automates Reserved Instance and Savings Plan lifecycle management.
End-to-end incident response platform with ML-based Intelligent Alert Grouping that reduces noise by grouping related alerts, AI-generated incident summaries, Auto Pause Transient Alerts for suppressing ephemeral flapping, and Past Incident Insights for historical pattern matching. SLO dashboards connect incident response to reliability engineering.
AI-native agentic infrastructure platform. Uses agentic workflows (Aiden AI) to generate Terraform, Kubernetes manifests, and security policies from application context, not just prompts. Understands application dependencies and cloud-native patterns to produce production-ready, policy-compliant infrastructure. Targets platform teams building internal developer platforms who need golden-path provisioning without hand-coding every module.
ML-based Kubernetes resource optimization platform. Machine learning models trained on historical utilization patterns predict optimal CPU and memory requests and limits. Optimize Live provides automated rightsizing that applies via the Kubernetes API or exports as YAML for GitOps. Node Optimization uses ML to guide cluster autoscaler decisions with predictive algorithms. Java JVM heap size recommendations included.
Orchestration platform for Terraform and OpenTofu stacks with AI Mate assistant, MCP Server, and Catalyst framework for building AI agents. Uses DAG orchestration to manage complex stack dependencies. AI features are built into the CLI and Cloud interface — not a wrapper — enabling context-aware infrastructure changes, code generation, and troubleshooting inside the provisioning workflow.
Cloud cost optimization platform featuring an Automated FinOps Agent that uses AI for virtual tagging and waste detection across multi-cloud environments. Predictive Autopilot algorithms autonomously manage AWS Savings Plans, while specialized cost tracking for generative AI infrastructure covers GPU compute, vector databases, and LLM API inference costs.
Agentless CNAPP and CSPM solution that uses an AI-powered unified risk graph to correlate vulnerabilities, misconfigurations, exposed secrets, and identity risks across AWS, Azure, GCP, Kubernetes, and Snowflake. Prioritizes risks based on actual exploitability and blast-radius analysis rather than theoretical severity, enabling teams to remediate the 1% of issues that matter.
AI-powered cloud optimization platform that autonomously manages Kubernetes resources via Kompass — continuously rightsizing pod CPU/memory, autoscaling persistent volumes, and managing spot instance migrations with sub-40-second replacement. ML models analyze real-time usage patterns for ongoing adjustments without code changes. The Commitment Manager automates AWS Savings Plan and RI purchasing with micro-commitment strategies.