Lawrence Jones, product engineer at incident.io, describes how their AI incident response system evolved from basic log summaries to agents that analyze thousands of GitHub PRs and Slack messages to draft remediation pull requests within three minutes of an alert firing. The system doesn't pursue full automation because the real value lies elsewhere: eliminating the diagnostic work that consumes the first 30-60 minutes of incident response, and filtering out the false positives that wake engineers unnecessarily at 3am.
The core architectural decision treats each organization's incident history as a unique immune system rather than fitting generic playbooks. By pre-processing and indexing how a specific company has resolved incidents across dimensions like affected teams, error patterns, and system dependencies, incident.io generates ephemeral runbooks that surface the 3-4 commands that actually worked last time this type of failure occurred. This approach emerged from recognizing that cross-customer meta-models fail because incident response is fundamentally organization-specific: one company's SEV-0 is an airline bankruptcy, another's is a stolen laptop.
The engineering challenge centers on building trust with deeply skeptical SRE teams who view AI as non-deterministic chaos in their deterministic infrastructure. Lawrence's team addresses this through custom Go tooling that enables backtest-driven development: they rerun thousands of historical investigations with different model configurations and prompt changes, then use precision-focused scorecards to prove improvements objectively before deploying. This workflow revealed that traditional product engineers struggle with AI's slow evaluation cycles, while the team succeeded by hiring for methodical ownership over velocity.
Topics discussed:
Balancing precision versus recall in agent outputs to earn trust from SRE teams who are "hardcore AI holdouts"
Pre-processing incident artifacts (PRs, Slack threads, transcripts) into queryable indexes that cross-reference team ownership, system dependencies, and historical resolution patterns
Model selection strategy: GPT-4.1 for cost-effective daily operations, Claude Sonnet for superior code analysis and agentic planning loops
Backtest infrastructure that reruns thousands of past investigations with modified prompts to objectively validate changes through scorecard comparisons
Building ephemeral runbooks by extracting which historical commands and fixes worked for similar incidents, filtered by what the organization learned NOT to do in subsequent incidents
Prioritizing alert noise reduction over autonomous remediation because the false positive problem has clearer ROI and lower risk
Why AI engineering teams fail when staffed with traditional engineers optimized for fast feedback loops rather than tolerance for non-deterministic iteration
Building entirely custom tooling in Go without vendor frameworks due to early ecosystem constraints and desire for native product integration
The evaluation problem where only engineers who invested hundreds of hours building a system can predict how prompt changes cascade through multi-step agentic workflows