snag / how it works

Snag · FMEA for pull requests

How snag works

Snag applies FMEA — Failure Mode and Effects Analysis, the discipline behind failure analysis in aerospace and manufacturing — to a change. Instead of reading a diff line by line, it asks a structural question: what is supposed to happen here, and every way it could silently not?

Why a review misses these

Line-by-line review is good at what's on the screen: this null check, that off-by-one, this awkward name. But the failures that hurt most are about what isn't there — the unhandled disconnect, the retry that runs twice, the log that's lost on a hard kill. They don't live on any single line, so a reader scanning lines has no place to notice them.

Snag reframes the change as a Happy Path → Failure Map. The blue spine is what should happen, step by step. Everything branching off it is a way it can go wrong — enumerated deliberately, the way an FMEA walks each step and asks how it fails. The result is a map you can read, argue with, and commit.

Anatomy of a map

Happy path — the intended flow Failure mode — how a step silently goes wrong

Each map is a failure-map/v1 JSON document. It renders as an interactive graph in the viewer, and it lives in git next to the code it describes.

A worked example

Here's a real map from the demo — Runner Protocol v1, generated from a protocol spec. The spine is the intended lifecycle; below it are the failure modes snag surfaced off that spine — the kind a spec review nods past.

Runner Protocol v1

happy path → failure map
  • Runner boots
  • join (boot token)
  • Validate boot token
  • Issue session token; verdict: continue
  • job:assign (def, steps, env)
  • job:ack (CP stamps ts)
  • job:started (assigned -> running)
  • log:chunk seq 1..N, acked
  • job:finished (after all acked)
  • Seal: logs contiguous 1..N?
  • Job success
  • Channel disconnect -> rejoin (session token)
  • job:cancel (user / pipeline / timeout)
15 failure modes branching off that spine
Boot token reused / leaked
Boot-token TTL window too wide
Boot timeout (~60s) -- never joins
Max boot attempts (3) exceeded
protocol_version mismatch
Duplicate job:started not idempotent
Re-sent job:assign -> double execution
Grace period (30s) expiry -> runner_lost
Resumes work after Job already terminal
Lost chunk never resent -> seal fails 1..N
Chunk-name collision -> log corruption
Integrity checked only at seal, not on stream
SIGKILL before flush -> lost final logs
Process group not fully killed -> orphans
Cancel-drain deadline (10s) -> force destroy

Each one is a node you can open in the interactive viewer — with the path it branches from and why it matters. None of them is a single line you'd circle in review.

What makes it different

Design altitude, not just diffs

Point it at a PRD or spec — not only a diff. Snag finds failure modes before the code exists, which a diff-only reviewer structurally can't.

Bring your own model

Your key, your provider — Anthropic, OpenAI, or OpenRouter. Nothing is proxied through a Snag service; there are no Snag servers in the loop.

A committed, diffable artifact

Maps are JSON in git. They travel with the change in review, and diffs across snapshots show how a change's failure surface moves over time.