SECTOR 02 // HOW-TO GUIDES

Debug a failing run

Preview the timeline, keep the stack alive, stream values, and replay the event journal.

Goal: figure out why a scenario failed without re-running blind.

Preview what it would do #

Before running anything, see the whole timeline statically:

./shinari explain worker-killed-mid-task

explain prints the lifecycle (setupsteadyStatemethod phases → verifyteardown) with each step’s resolved verb, its kind ([action], [probe], [assertion]), any fault effect, and the finding markers. It executes nothing and touches no system. Use it to confirm the shape of a scenario, spot a misordered phase, or check which steps are faults before committing to a run.

Stream values and durations #

./shinari run --verbose worker-killed-mid-task   # or -v

--verbose adds section banners and, for every step, the value it produced and how long it took (✓ jobstore.status → RUNNING (12ms)). It turns the console into a live trace, so a probe returning the wrong value or a step running far too long shows up inline rather than only in the journal.

Keep the stack up #

Teardown destroys the evidence. Skip it:

./shinari run --keep-up worker-killed-mid-task

The whole teardown section is skipped: containers, toxics, and DNS overrides stay exactly as the failure left them. Poke at the system, then clean up manually (docker compose down -v). The KEEP_UP=1 environment variable does the same thing.

Dry-run the timeline #

./shinari run --dry-run worker-killed-mid-task

--dry-run skips every action (anything that mutates: docker.kill, http.post, exec.run unless overridden) and still executes probes and assertions. Use it to check wiring (do the probes resolve, do the captures flow) without touching the system.

Read the journal #

Every run writes journal.jsonl: the complete, ordered event stream, one JSON object per line.

jq -r 'select(.type=="step.failed") | "\(.scenario) :: \(.step) :: \(.payload.error)"' shinari-out/journal.jsonl

Useful event types: fault.injected (what actually got injected, when), gate.observed (what value finally satisfied a wait_until), finding.recorded, step.failed. Timestamps let you reconstruct the exact timeline of injection versus observation.

Watch a single check #

Target one scenario by name (or a whole suite by directory name):

./shinari run worker-killed-mid-task     # one scenario
./shinari run recovery                   # the recovery suite

Common failure signatures #

symptomlikely cause
ERRORED + compose outputstack didn’t come up: read the setup step’s error, it embeds stderr
INCONCLUSIVEsteadyState failed before any fault: the environment, not the product
wait_until: condition not observed within Ns; last observed: Xthe gate never fired; X tells you what the probe actually saw
a finding flipped to ✗ finding now passesnot a bug: the gap was fixed, promote the assertion