The AI SRE Startup Landscape
It's hot. It's raising a lot of money. And the promise is compelling: Put the machines on-call. Here's an (updating) list of AI SRE tools I'm tracking in the market.
The AI SRE category is red hot. The promise is compelling: What if machines could go on-call for you? Several startups have emerged to hunt this white whale of a product, so much so that our customers keep asking me which ones they should integrate with FireHydrant.
But WTF Even Is AI Site Reliability Engineering?
The SRE name has been bruised and battered for years. DevOps teams were renamed to SRE overnight, hoping they'd somehow "do SRE things." Now "AI" has latched onto the sacred SRE name, confusing it even more. I'm here to help the overwhelmed citizens.
Let's get this straight: SRE is a huge role that encompasses dozens of tasks. There's a reason SREs at large tech companies get paid the big bucks: It's a very demanding job with a very wide blast radius (of both success and failure).
In its current 2025 form, "AI SRE" encompasses two things:
Autonomously investigating incidents just like an engineer would open a dashboard and attempt to find the smoking gun in their logs.
Mitigating incidents by autonomously fixing the underlying cause(s) either with code fixes or rollbacks.
These AI SRE Agent Startups do not include features to notify on-call engineers, perform retrospectives, or create workflows for their incidents. This is the distinction between “Incident Management” and “AI SRE.”
The List
I've been tracking AI SRE startups for months, and I think this list could be valuable to businesses looking to try this new technology. I'm neutral on who's the best player—and I believe everyone should assume this category will change dramatically in the next 2 years.
Alright, let's get down to business!
Dedicated AI SRE Startups
These AI SRE companies focus on the investigation and remediation layer of the "R" in SRE. They're not building on-call scheduling, service catalogs, incident management, status pages, or retrospectives—just good ol' laser-focused products on helping engineers resolve incidents faster.
In alphabetical order:
Causely
Website: https://www.causely.ai/
Causely's causal reasoning engine automatically infers the single root cause when a storm of alerts begins cascading through your environment. The platform auto-discovers your environment and starts delivering insights in seconds from your existing telemetry—no setup or tuning required.
Cleric
Website: https://cleric.ai/
Cleric is the first AI for application teams that investigates like a senior SRE, autonomously investigating production issues and delivering findings directly to Slack. Backed by Zetta Venture Partners in a $4.3M seed round, Cleric reasons through problems it's never seen before by forming hypotheses and running real queries with your tools.
Neubird
Website: https://neubird.ai/
NeuBird's Hawkeye is an AI-powered SRE co-pilot that brings the reasoning power of LLMs to telemetry data with reliable, secure Agentic AI built for enterprise IT. The company recently raised $22.5M in funding led by M12 (Microsoft's venture fund), with participation from Mayfield, StepStone Group, and Prosperity7 Ventures.
Phoebe
Website: https://phoebe.ai/
Troubleshoot faster. Agentic search for your tech stack. Investigate errors, incidents, and more.
Resolve
Website: https://resolve.ai
Created by the co-creators of OpenTelemetry, Resolve AI handles all alerts, performs root cause analysis, and troubleshoots incidents within minutes. The platform operates autonomously to handle common alerts and actions, reducing escalations and saving up to 20 hours per on-call engineer per week.
Tierzero
Website: https://www.tierzero.ai/
TierZero AI automatically investigates, triages, and resolves infrastructure issues, believing that infrastructure should be self-driving by surfacing the right insights and anticipating issues. The company is SOC 2 Type II certified and hosts its production services on Amazon AWS with enterprise-grade security measures.
Traversal
Website: https://traversal.com
Traversal's agent parses logs, metrics, traces, and your codebase to narrow down root causes of errors or latency, replacing the flood of alerts and logs with easy natural language. The team consists of CS PhDs from MIT and UC Berkeley, with experience at industry leaders like Uber, Amazon, Citadel, and Mistral AI.
Vibranium Labs
Website: https://vibraniumlabs.ai/
Vibranium AI acts as your 24/7 on-call teammate, eliminating alert fatigue, pinpointing root causes, and providing actionable insights for faster incident resolution. The platform can slash Mean Time to Resolution (MTTR) by up to 82% and includes a real-time AI assistant that can join calls and transcribe discussions.
Wildmoose
Website: https://www.wildmoose.ai/
Wild Moose provides fast, efficient root cause analysis that improves with every incident, converting tribal knowledge into smart automations to navigate complex environments. The platform constantly improves performance with a system model that learns from each incident and integrates within minutes via APIs.
🕵️♂️ Did I miss your company? Email me! robert@firehydrant.com
Observability Tools Dipping Their Toes
Most O11Y players are dipping their toes—or cannonballing—into this space. It makes perfect sense: They have the data that investigations would be performed on anyway, so they're building agentic workflows on top of that goldmine.
None of these should come as a surprise:
Honeycomb*
Datadog
Observe, Inc (observeinc.com)
NewRelic
Coralogix
Dynatrace
...basically everyone
*Honeycomb is a genuine thought leader in this space. I recommend reading Austin Park's post. Their blog is a treasure trove of thoughtful pieces on AI in software development.
Incident Management Startups
For the past five years, several incident management companies have emerged to fill the void that PagerDuty left wide open—and all of us started taking advantage of it. As the CEO of an incident management and on-call tool (FireHydrant.com, if you didn't know), I have just a few… million… perspectives on this space.
Here's a list of incident management startups I'm watching that are also including AI/SRE capabilities in their offerings:
Incident.io
With a name like "Incident," of course they're going to build "AI SRE." Incident.io threw their hat into the AI SRE ring in the past month with the promise of "AI SRE resolves incidents like your best engineer" featured prominently on their product page header.
Of all FireHydrant's competitors, I have the most respect for incident.io. Their team and founders have always been graceful with me and FireHydrant, and I'm curious to see how their AI SRE + everything else platform plays out.
P.S. I will destroy you Stephen Whitworth 😈
PagerDuty
Obviously, the oldest of the bunch. Definitely not a startup anymore. They’re building an AI SRE too. They appear to be using their Rundeck acquisition from several years ago to bolster their position as an automated investigation and remediation tool. They've been integrating Slack with investigations, albeit poorly from what we hear from the field. I have yet to hear of a single company using it successfully—but maybe that's because I only speak with companies that switch to FireHydrant? Who knows.
Rootly
Rootly won't list FireHydrant on their comparison page (probably because of how much I’ve called them out publicly for the past several years).
Rootly is building an AI SRE, but it's really hard to know what it does because their screenshots are just copies (literally) of other vendors in the space.
To each their own.
Other Businesses
Many businesses have an "AI SRE" SKU of sorts in their product. Several DevOps platform and CI/CD companies and even code writing editors have begun to dabble with AI investigation and remediation. They're worth noting for completeness:
Sentry: https://sentry.io/product/seer/
GitHub: https://github.com/features/ai
Microsoft Azure: https://learn.microsoft.com/en-us/azure/sre-agent/overview
Cursor Bugbot: https://docs.cursor.com/en/bugbot
Starsling (waitlist): https://www.starsling.dev/
I'm sure there are dozens more. If I missed yours, email me: robert@firehydrant.com.
And now FireHydrant!?
Nope. Not quite.
Alas, it’s me saying something I’m more excited about: Partnering with the next generation of AI SRE and Observability tools. All of them.
I’ve spoken with countless technology leaders recently, and they’re all saying the same thing: “We’re looking at all of the AI SRE tools on the market – and they need to integrate with FireHydrant.”
Businesses are looking for the right AI Agent(s) that fit their specific needs. Because tech stacks vary so widely in their complexity and design – no AI SRE will be a “one size fits all” tool. Businesses are even likely to purchase several AI tools for their needs.
And all of those AI tools will need a place to retrieve incident context, read retrospectives, and page the humans when they come up short. That’s the platform we’re building.
AI SRE is a genuine opportunity for FireHydrant. By letting teams connect their AI agents with our incident management platform, businesses actually get something useful. FireHydrant becomes the connective tissue between AI SREs and the real world—teams can bring whatever AI SRE agent works best for them and plug it into us as their management and on-call layer. It just makes more sense to us that way.
We at FireHydrant have our sights on something else that we’re more excited to solve 👀