Observe.Collect and analyze metrics, logs, and traces to understand system health.
Observability is the difference between knowing your system is degraded and knowing why. Tools indexed below help platform teams automate or augment the work in this stage.
Observe tools
Primary
4 tools
Datadog
Datadog
verified
Full-stack observability platform powered by Watchdog anomaly detection and Bits AI autonomous SRE. Continuously baselines metrics across hosts, containers, and traces to eliminate static thresholds and surface root causes. Bits AI handles incident investigation autonomously — correlating signals, querying logs, and proposing remediations without requiring manual runbook execution.
Open-source observability platform with ML-powered Sift investigations and an AI assistant that generates PromQL/LogQL queries from natural language. Adaptive Telemetry automatically drops high-cardinality data before indexing, cutting ingest costs. The open-core model lets you self-host Grafana OSS free or use managed Cloud tiers.
Unified observability platform with ingest-based pricing and New Relic AI (NRAI) for natural language querying, automated root cause analysis, and AIOps alert correlation. MCP server integration enables agentic AI workflows via AWS DevOps Agent. 100 GB/month free tier covers most small production environments.
CNCF graduated time-series database and metrics scraper. Pull-based model, multi-dimensional data, PromQL, Alertmanager. The default monitoring backbone for Kubernetes. AI angle is downstream: exemplars, vector embeddings via plugins, and AI features in Grafana, Robusta, K8sGPT, and others built on top of Prometheus data.
AI-based collaborative Kubernetes troubleshooting platform operating within Slack, Teams, Discord, and Mattermost. AI Insights provide troubleshooting and operations guidance for cluster issues. Real-time alert enrichment adds K8s context to notifications. Enables ChatOps remediation — execute kubectl commands and runbooks directly from chat. Converts passive telemetry notifications into human-in-the-loop operational actions inside the collaboration platform.
Cloud cost intelligence platform with an AnyCost engine that ingests billing and usage data from any source without requiring perfect tags. ML-driven anomaly detection identifies cost spikes in real-time, while unit cost analytics attribute spend to specific features, customers, and deployments to align engineering decisions with business outcomes.
Kubernetes cost allocation platform that breaks cloud spend down to namespace, pod, and label level with real-time monitoring. ML-driven rightsizing recommendations analyze historical usage patterns to suggest optimal resource requests and limits, while anomaly detection catches cost spikes before the bill arrives. Available as open-core self-hosted and fully managed SaaS.
Feature management platform with AI-powered Guarded Rollouts. Sequential testing engine progressively increases traffic while monitoring metrics for regressions — ML detects statistically significant negative impact and automatically pauses or rolls back the rollout. Separates deployment from release, enabling rollback without redeployment. First FedRAMP-authorized feature management solution.
AI SRE platform for Kubernetes built on an eBPF telemetry layer. Autonomously monitors services, spawns AI agents to root-cause issues using complete runtime context (traces, metrics, logs, K8s state, deployments, code changes), performs deployment verification, and generates fix pull requests. The eBPF layer provides the data; AI agents drive operational actions — investigations, verifications, fixes.
AI-driven delivery intelligence overlay for Spinnaker and Argo CD. Automates release verification by aggregating APM telemetry and logs, using NLP and ML to generate real-time composite risk scores — Quality, Performance, Reliability, Security — that act as autonomous promotion or rollback gates in progressive delivery pipelines.
Agentless cloud security platform using patented SideScanning technology to read cloud configuration and workload runtime state out-of-band without deploying agents. Embeds GenAI-powered investigation and natural language querying to explain attack paths, correlate risks across multi-cloud environments, and guide remediation including paused and stopped workloads.
Deployment intelligence platform that tracks every code deployment from commit to production. ML models baseline key health indicators and detect anomalies correlated with specific deployments, providing a Deploy Rating (A-F). Calculates DORA metrics with deployment-level granularity and correlates deployments with PagerDuty incidents for precise rollback targeting.
End-to-end incident response platform with ML-based Intelligent Alert Grouping that reduces noise by grouping related alerts, AI-generated incident summaries, Auto Pause Transient Alerts for suppressing ephemeral flapping, and Past Incident Insights for historical pattern matching. SLO dashboards connect incident response to reliability engineering.
Cloud cost optimization platform featuring an Automated FinOps Agent that uses AI for virtual tagging and waste detection across multi-cloud environments. Predictive Autopilot algorithms autonomously manage AWS Savings Plans, while specialized cost tracking for generative AI infrastructure covers GPU compute, vector databases, and LLM API inference costs.
Agentless CNAPP and CSPM solution that uses an AI-powered unified risk graph to correlate vulnerabilities, misconfigurations, exposed secrets, and identity risks across AWS, Azure, GCP, Kubernetes, and Snowflake. Prioritizes risks based on actual exploitability and blast-radius analysis rather than theoretical severity, enabling teams to remediate the 1% of issues that matter.