Sponsored Function Your cellphone buzzes at 2 AM. The web site is down. Slack has grow to be a wall of pink alerts, and clients are already tweeting. You stare on the display, nonetheless half-asleep, attempting to determine the place to even start trying.
That is the ritual that website reliability engineers (SREs) know too properly. These are the oldsters that should hold on-line providers working in any respect prices, and when these providers go down, stress ranges soar. Restoration is a race in opposition to time, but most groups burn the primary hour simply gathering proof earlier than the precise troubleshooting begins.
“The primary 5 minutes is panic,” says Goutham Rao, chief govt officer and co-founder of NeuBird. “The subsequent 25 minutes is assembling the crew to say that we have now a proxy error. Get on Slack, get on cellphone calls, name folks.” Conflict rooms get spun up. Bridge calls convened. Fingers pointed between groups whereas the outage clock retains ticking.
Rao is aware of the ache first-hand. The serial entrepreneur as soon as needed to fly from San Francisco to Amsterdam to repair his personal bug in a darkish datacenter as a result of the client would not enable distant entry. The downtime was basically the flight time. He determined there needed to be a greater method, and so Neubird was born.
This startup is backed by Microsoft and partnered with AWS, and is making this complete dance pointless. Its product Hawkeye is an AI-powered SRE that runs the investigation whereas your workforce continues to be rubbing the sleep from their eyes. Rao emphasizes that this is not one other chatbot for querying logs. It is an agentic system that kinds hypotheses, checks them in opposition to your telemetry, and tells you what really broke.
Why cloud operations hit a breaking level
SRE automation has been a very long time coming, says Rao. The structure that makes trendy software program doable can also be what makes it so maddening to debug. Service-oriented architectures turned the trade customary over the previous 20 years as a result of they let groups construct sooner. Nonetheless, in addition they create a tangled mesh of interdependencies that few absolutely perceive. These are advanced techniques, the place pulling a thread in a single system can unravel one other hundreds of miles away.
This is a situation Rao describes: your web site occasions out. Intuitively, it seems to be like an issue with the UI or internet software layer. You’d suppose one thing is incorrect with the entrance finish. However the true downside seems to be a database working out of sources three layers down.
“The foundation reason behind why your web site is working sluggish just isn’t due to something associated to your internet app or your compute. It is since you’re working out of capability,” he explains. “Who would have thought this? And it takes a very long time for folks to have the ability to join these dots.”
The instruments meant to assist have created their very own issues. AWS environments now generate thousands and thousands of telemetry information factors throughout hundreds of sources. You’ll be able to instrument all the things, however extra visibility usually simply means much less readability. This downside is commonly known as the observability paradox.
In response to AWS, seventy p.c of alerts require handbook correlation throughout a number of providers. Engineers usually spend three to 4 hours investigating advanced incidents, and that is earlier than anybody begins fixing something.
Rao is fast to level out this is not about changing folks. “It is not about do the identical with fewer folks,” he says. “That is by no means been the case in any innovation cycle. It at all times is do extra with what you could have.”
What makes agentic AI totally different
The AIOps market is crowded with instruments that slap chatbot interfaces onto log queries and name it innovation. Hawkeye is doing one thing structurally totally different, and the excellence issues if you are going to belief it along with your manufacturing atmosphere.
Most enterprise AI merchandise use retrieval augmented era (RAG). You feed paperwork to an LLM, vectorize them, then ask questions on that content material. That strategy works nice for company data bases and coverage paperwork, but it surely collapses in a heap when you attempt to use it for IT telemetry.
“You’ll be able to’t copy all your IT telemetry into ChatGPT and say ‘assist me’,” Rao explains. “That does not work.” The information is a continuously altering morass of logs, traces, configuration information, and time-series metrics captured at millisecond granularity. You’ll be able to’t dump all of that right into a immediate window and anticipate helpful outcomes.
Agentic techniques flip the strategy. As a substitute of feeding content material to the LLM and asking questions, you inform the LLM to determine what data it really wants, then surgically extract it out of your information sources. The LLM generates investigation packages relatively than prose solutions.
That is the place context engineering turns into extra necessary than immediate engineering. Rao makes use of a medical analogy to elucidate the distinction: even one of the best physician on the earth can not diagnose you precisely when you can not describe your signs correctly.
“The issue with LLMs is you may ask it a query and you will at all times get a solution,” he says. “That is an issue for manufacturing techniques, since you do not wish to mislead an individual.” Give an LLM the incorrect context and it’ll confidently remedy the incorrect downside. The trick is ensuring it asks the proper questions of the proper information sources earlier than it begins reasoning.
A system that learns – and writes its personal directions
Beneath Hawkeye sits one thing NeuBird calls the Raven AI Expression Language (RAEL). It is a structured grammar that lets LLMs create verifiable investigation packages relatively than pure language responses. These packages may be validated and compiled, which eliminates hallucinations within the investigation steps themselves.
“For us an agentic system is a mix of an professional system with the cognitive capabilities that exist in Gen AI,” Rao explains. The system marries professional system reliability with generative AI creativity. This makes it structured sufficient to be reliable, however versatile sufficient to deal with novel conditions.
The power to codify investigation methods permits engineers to form how investigations run over time. Inform Hawkeye in plain English to pay extra consideration to networking subsequent time, and the underlying RAEL grammar (which the LLM itself creates) morphs accordingly. You are teaching a cognitive system, not configuring a static guidelines engine.
One buyer found this functionality when Hawkeye could not clarify a sudden drop in DNS requests. The foundation trigger was an exterior Cloudflare outage that Hawkeye had no visibility into. The client responded by including Cloudflare standing checks to future investigations. The system learns.
A military of LLMs
Hawkeye would not run on a single LLM, both. NeuBird makes use of what Rao calls a squadron of fashions. Some are higher suited to time-series evaluation, and others for parsing JSON buildings. The present combine contains Anthropic’s Claude and numerous GPT fashions, although the structure is designed to swap them because the market evolves. Enterprises can even convey their very own Bedrock fashions, burning down dedicated cloud spend whereas utilizing Hawkeye’s investigation framework.
The platform connects natively to AWS providers together with CloudWatch, EKS, Lambda, RDS, and S3, although it additionally works with Azure and on-premises environments. Customary observability stacks like Dynatrace, Splunk, and Prometheus are supported out of the field. For organizations working homegrown tooling, the Mannequin Context Protocol (MCP) gives a bridge to proprietary techniques.
Safety shall be an enormous concern for potential customers. Hawkeye operates with read-only entry and shops no telemetry information. It solely persists some metadata that fingerprints your atmosphere, like what number of EC2 situations you could have or what Kubernetes clusters exist. For organizations that want further isolation, there is a full in-virtual personal cloud (VPC) choice. All processing occurs inside that VPC, and information by no means leaves their AWS atmosphere.
Holding your palms on the wheel
Hawkeye stops at suggestions. It will not routinely execute fixes, and that is deliberate. “We purposely restrict it from taking actions,” Rao explains, arguing that agentic techniques are a bit like self-driving vehicles for a lot of; a cool idea, however nonetheless new sufficient for most individuals to utterly take their palms off the wheel. That stated, for patrons who’re prepared to automate repetitive actions, NeuBird gives an choice to automate as such.
Genuinely benign actions, corresponding to toggling characteristic flags, are OK. In that instance, the flag itself has already been examined and the implications are properly understood. However writing code or patching Helm charts? Not but.
The priority is {that a} 95 p.c success fee with spectacular 5 p.c failure may poison the properly for agentic techniques solely. Higher to maintain a human within the loop for now and construct belief step by step.
When Hawkeye can not remedy an issue, it says so. The system fact-checks its conclusions in opposition to precise telemetry, so the worst case is admitting uncertainty relatively than confidently pointing you within the incorrect path. It additionally has an attention-grabbing behind-the-scenes characteristic that helps to hone its outcomes: it makes use of competing LLMs to argue with each other about their findings. This dialectic results in higher outcomes which have been sanity checked.
Hawkeye’s dashboard generates studies displaying estimated time saved per investigation. Mannequin Rocket, a customized expertise options supplier working a posh atmosphere spanning Lambda, RDS, ElastiCache, and EKS, minimize imply time to restoration by over 90 p.c after deploying the platform.
The cognitive shift
NeuBird sits in an enviable place. Microsoft is a backer, and the corporate participates in Redmond’s elite Pegasus program, which gives entry to enterprise clients together with Adobe, Autodesk, and Chevron. On the AWS aspect, NeuBird was chosen for quite a few AWS packages together with the Generative AI Accelerator , and Generative AI Competency Associate standing, with Hawkeye obtainable on the AWS Market.
A part of what acquired it there may be its understanding that agentic AI is not software program you configure as soon as and overlook about. “It’s a must to deal with it like a cognitive being, a cognitive system, as a result of that is what it is rooted in,” Rao says. “Coach it, work with it, give it suggestions, let it collaborate with it. It is not a binary system.”
2AM cellphone requires SREs aren’t going away. Infrastructure will at all times break in inventive methods at inconvenient hours. But when NeuBird’s wager pays off, by the point you get to your desk with slippers and occasional on the prepared, Hawkeye shall be properly on its solution to delivering a root trigger evaluation.
Sponsored by NeuBird.ai.
Sponsored Function Your cellphone buzzes at 2 AM. The web site is down. Slack has grow to be a wall of pink alerts, and clients are already tweeting. You stare on the display, nonetheless half-asleep, attempting to determine the place to even start trying.
That is the ritual that website reliability engineers (SREs) know too properly. These are the oldsters that should hold on-line providers working in any respect prices, and when these providers go down, stress ranges soar. Restoration is a race in opposition to time, but most groups burn the primary hour simply gathering proof earlier than the precise troubleshooting begins.
“The primary 5 minutes is panic,” says Goutham Rao, chief govt officer and co-founder of NeuBird. “The subsequent 25 minutes is assembling the crew to say that we have now a proxy error. Get on Slack, get on cellphone calls, name folks.” Conflict rooms get spun up. Bridge calls convened. Fingers pointed between groups whereas the outage clock retains ticking.
Rao is aware of the ache first-hand. The serial entrepreneur as soon as needed to fly from San Francisco to Amsterdam to repair his personal bug in a darkish datacenter as a result of the client would not enable distant entry. The downtime was basically the flight time. He determined there needed to be a greater method, and so Neubird was born.
This startup is backed by Microsoft and partnered with AWS, and is making this complete dance pointless. Its product Hawkeye is an AI-powered SRE that runs the investigation whereas your workforce continues to be rubbing the sleep from their eyes. Rao emphasizes that this is not one other chatbot for querying logs. It is an agentic system that kinds hypotheses, checks them in opposition to your telemetry, and tells you what really broke.
Why cloud operations hit a breaking level
SRE automation has been a very long time coming, says Rao. The structure that makes trendy software program doable can also be what makes it so maddening to debug. Service-oriented architectures turned the trade customary over the previous 20 years as a result of they let groups construct sooner. Nonetheless, in addition they create a tangled mesh of interdependencies that few absolutely perceive. These are advanced techniques, the place pulling a thread in a single system can unravel one other hundreds of miles away.
This is a situation Rao describes: your web site occasions out. Intuitively, it seems to be like an issue with the UI or internet software layer. You’d suppose one thing is incorrect with the entrance finish. However the true downside seems to be a database working out of sources three layers down.
“The foundation reason behind why your web site is working sluggish just isn’t due to something associated to your internet app or your compute. It is since you’re working out of capability,” he explains. “Who would have thought this? And it takes a very long time for folks to have the ability to join these dots.”
The instruments meant to assist have created their very own issues. AWS environments now generate thousands and thousands of telemetry information factors throughout hundreds of sources. You’ll be able to instrument all the things, however extra visibility usually simply means much less readability. This downside is commonly known as the observability paradox.
In response to AWS, seventy p.c of alerts require handbook correlation throughout a number of providers. Engineers usually spend three to 4 hours investigating advanced incidents, and that is earlier than anybody begins fixing something.
Rao is fast to level out this is not about changing folks. “It is not about do the identical with fewer folks,” he says. “That is by no means been the case in any innovation cycle. It at all times is do extra with what you could have.”
What makes agentic AI totally different
The AIOps market is crowded with instruments that slap chatbot interfaces onto log queries and name it innovation. Hawkeye is doing one thing structurally totally different, and the excellence issues if you are going to belief it along with your manufacturing atmosphere.
Most enterprise AI merchandise use retrieval augmented era (RAG). You feed paperwork to an LLM, vectorize them, then ask questions on that content material. That strategy works nice for company data bases and coverage paperwork, but it surely collapses in a heap when you attempt to use it for IT telemetry.
“You’ll be able to’t copy all your IT telemetry into ChatGPT and say ‘assist me’,” Rao explains. “That does not work.” The information is a continuously altering morass of logs, traces, configuration information, and time-series metrics captured at millisecond granularity. You’ll be able to’t dump all of that right into a immediate window and anticipate helpful outcomes.
Agentic techniques flip the strategy. As a substitute of feeding content material to the LLM and asking questions, you inform the LLM to determine what data it really wants, then surgically extract it out of your information sources. The LLM generates investigation packages relatively than prose solutions.
That is the place context engineering turns into extra necessary than immediate engineering. Rao makes use of a medical analogy to elucidate the distinction: even one of the best physician on the earth can not diagnose you precisely when you can not describe your signs correctly.
“The issue with LLMs is you may ask it a query and you will at all times get a solution,” he says. “That is an issue for manufacturing techniques, since you do not wish to mislead an individual.” Give an LLM the incorrect context and it’ll confidently remedy the incorrect downside. The trick is ensuring it asks the proper questions of the proper information sources earlier than it begins reasoning.
A system that learns – and writes its personal directions
Beneath Hawkeye sits one thing NeuBird calls the Raven AI Expression Language (RAEL). It is a structured grammar that lets LLMs create verifiable investigation packages relatively than pure language responses. These packages may be validated and compiled, which eliminates hallucinations within the investigation steps themselves.
“For us an agentic system is a mix of an professional system with the cognitive capabilities that exist in Gen AI,” Rao explains. The system marries professional system reliability with generative AI creativity. This makes it structured sufficient to be reliable, however versatile sufficient to deal with novel conditions.
The power to codify investigation methods permits engineers to form how investigations run over time. Inform Hawkeye in plain English to pay extra consideration to networking subsequent time, and the underlying RAEL grammar (which the LLM itself creates) morphs accordingly. You are teaching a cognitive system, not configuring a static guidelines engine.
One buyer found this functionality when Hawkeye could not clarify a sudden drop in DNS requests. The foundation trigger was an exterior Cloudflare outage that Hawkeye had no visibility into. The client responded by including Cloudflare standing checks to future investigations. The system learns.
A military of LLMs
Hawkeye would not run on a single LLM, both. NeuBird makes use of what Rao calls a squadron of fashions. Some are higher suited to time-series evaluation, and others for parsing JSON buildings. The present combine contains Anthropic’s Claude and numerous GPT fashions, although the structure is designed to swap them because the market evolves. Enterprises can even convey their very own Bedrock fashions, burning down dedicated cloud spend whereas utilizing Hawkeye’s investigation framework.
The platform connects natively to AWS providers together with CloudWatch, EKS, Lambda, RDS, and S3, although it additionally works with Azure and on-premises environments. Customary observability stacks like Dynatrace, Splunk, and Prometheus are supported out of the field. For organizations working homegrown tooling, the Mannequin Context Protocol (MCP) gives a bridge to proprietary techniques.
Safety shall be an enormous concern for potential customers. Hawkeye operates with read-only entry and shops no telemetry information. It solely persists some metadata that fingerprints your atmosphere, like what number of EC2 situations you could have or what Kubernetes clusters exist. For organizations that want further isolation, there is a full in-virtual personal cloud (VPC) choice. All processing occurs inside that VPC, and information by no means leaves their AWS atmosphere.
Holding your palms on the wheel
Hawkeye stops at suggestions. It will not routinely execute fixes, and that is deliberate. “We purposely restrict it from taking actions,” Rao explains, arguing that agentic techniques are a bit like self-driving vehicles for a lot of; a cool idea, however nonetheless new sufficient for most individuals to utterly take their palms off the wheel. That stated, for patrons who’re prepared to automate repetitive actions, NeuBird gives an choice to automate as such.
Genuinely benign actions, corresponding to toggling characteristic flags, are OK. In that instance, the flag itself has already been examined and the implications are properly understood. However writing code or patching Helm charts? Not but.
The priority is {that a} 95 p.c success fee with spectacular 5 p.c failure may poison the properly for agentic techniques solely. Higher to maintain a human within the loop for now and construct belief step by step.
When Hawkeye can not remedy an issue, it says so. The system fact-checks its conclusions in opposition to precise telemetry, so the worst case is admitting uncertainty relatively than confidently pointing you within the incorrect path. It additionally has an attention-grabbing behind-the-scenes characteristic that helps to hone its outcomes: it makes use of competing LLMs to argue with each other about their findings. This dialectic results in higher outcomes which have been sanity checked.
Hawkeye’s dashboard generates studies displaying estimated time saved per investigation. Mannequin Rocket, a customized expertise options supplier working a posh atmosphere spanning Lambda, RDS, ElastiCache, and EKS, minimize imply time to restoration by over 90 p.c after deploying the platform.
The cognitive shift
NeuBird sits in an enviable place. Microsoft is a backer, and the corporate participates in Redmond’s elite Pegasus program, which gives entry to enterprise clients together with Adobe, Autodesk, and Chevron. On the AWS aspect, NeuBird was chosen for quite a few AWS packages together with the Generative AI Accelerator , and Generative AI Competency Associate standing, with Hawkeye obtainable on the AWS Market.
A part of what acquired it there may be its understanding that agentic AI is not software program you configure as soon as and overlook about. “It’s a must to deal with it like a cognitive being, a cognitive system, as a result of that is what it is rooted in,” Rao says. “Coach it, work with it, give it suggestions, let it collaborate with it. It is not a binary system.”
2AM cellphone requires SREs aren’t going away. Infrastructure will at all times break in inventive methods at inconvenient hours. But when NeuBird’s wager pays off, by the point you get to your desk with slippers and occasional on the prepared, Hawkeye shall be properly on its solution to delivering a root trigger evaluation.
Sponsored by NeuBird.ai.















