Build an agentic incident triage assistant with AWS Quick and New Relic

Build an agentic incident triage assistant with AWS Quick and New Relic. Automate the first 5 minutes of incident response. gather context, run diagnostics, suggest remediation from alerts.

TL;DR: My pager went off at 2am. By the time I logged in, the logs had rotated, the metrics window had passed, and I spent 30 minutes reconstructing what happened. This architecture fixes that. Alert fires, agent gathers context, runs diagnostics, suggests remediation. All before you finish your coffee.

Incident response follows a predictable pattern. An alert fires, and the engineer on call spends the first 5-10 minutes gathering context: checking dashboards, pulling logs, reviewing recent deployments. Only then do they start diagnosing the actual problem. An agentic triage assistant can automate that context-gathering phase.

Key takeaways:

The first 5 minutes of incident response is context gathering: highly automatable

Architecture: alert → agent orchestration → observability data → diagnostics → remediation suggestion

AWS Quick handles the agent loop, Bedrock AgentCore provides the LLM

Works best for well-understood failure patterns with clear diagnostic procedures

Always include a human-in-the-loop for remediation: agents suggest, humans approve

What is the architecture of an agentic incident triage assistant?

The pattern from the AWS/New Relic reference architecture breaks down into four stages:

1. Alert triggers the agent. A New Relic alert fires and sends a webhook to Amazon Quick. The alert payload includes the incident type, affected service, severity, and a link to the related dashboard.

2. Context gathering. The agent receives the alert and immediately starts collecting context: recent logs from the affected service, error rates from New Relic metrics, recent deployment activity, and related alerts from the past hour.

3. Diagnostic execution. Based on the incident type, the agent runs predefined diagnostic procedures. For a high-latency alert, it checks database query performance, CPU use, and upstream dependency latency. For an error rate spike, it looks at recent code deployments and error log patterns.

4. Remediation suggestion. The agent compiles its findings into a structured triage report: what’s happening, what’s changed, likely causes, and suggested remediation steps. This goes to the on-call engineer for review.

How can I extend this incident triage architecture?

The interesting part is what happens after the initial implementation. Once you have this pattern running, you can extend it:

Playbook automation. For well-understood failure modes, the agent can execute remediation steps directly, restart services, roll back deployments, scale resources, with human approval for each action.
Post-mortem generation. After the incident is resolved, the agent can automatically generate a post-mortem draft from the timeline, context data, and remediation actions taken.
Pattern learning. Over time, the agent can learn which diagnostic steps are most useful for each incident type and prioritize them accordingly.

The AWS and New Relic reference architecture post is worth reading for the specific implementation details. But the pattern itself, alert-driven agent orchestration with observability tool integration, is applicable to any stack.

I’ve covered agent deployment patterns and monitoring agents in production: the incident triage pattern fits naturally into both.

FAQ

What is an agentic incident triage assistant? An AI agent that automatically responds to incidents by gathering context (logs, metrics, traces), running diagnostic checks, and suggesting remediation steps. This reduces the first 5 minutes of manual incident response.

What AWS services are used? Amazon Quick (agent orchestration), Bedrock AgentCore (LLM integration), and integration with New Relic for observability data.

Can this pattern work outside AWS? Yes: the architecture pattern of alert → context gathering → diagnostics → suggestion works with any observability platform and agent framework.

AI agent error handling patterns. Retry strategies, circuit breakers, and graceful degradation for production agents
The policy gate every agent needs before production. How to add safety checks and human approval gates to agent tool calls
AI agent multi-step workflows. Sequential chains, parallel execution, and conditional branching for orchestrated agents

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev

Build an agentic incident triage assistant with AWS Quick and New Relic

What is the architecture of an agentic incident triage assistant?

How can I extend this incident triage architecture?

FAQ

Related Posts

Get the brief on AI agents