SECTOR 02 // HOW-TO GUIDES

Wait for a service to recover

Block until a container is healthy again after an outage, polling its healthcheck instead of sleeping a fixed interval.

Goal: after a fault, continue the moment a dependency is healthy again, not a guessed number of seconds later.

The problem with a fixed sleep #

After an outage a container is often still running but transiently unhealthy: its last healthcheck failed while the fault was active, and it needs a few seconds to pass again. Sleeping a fixed interval is both racy and slow. Too short and the next step races a half-recovered service; too long and every run pays for the worst case.

# Racy: hopes recovery takes under 10s, wastes time when it takes less.
- run: sleep
  with: 10

Poll the healthcheck instead #

docker.ps reports a service’s Health (when the compose service defines a healthcheck). Drive it with wait_until, which re-runs a probe to a deadline and stops as soon as the condition holds:

- run: wait_until
  with:
    probe:
      run: docker.ps
      with: api
    read: .Health
    equals: healthy
    timeout: 60
    interval: 2

This polls api every 2 seconds and proceeds the instant its healthcheck reports healthy, failing the scenario only if 60 seconds pass first. It works the same whether the container was just started or was already running and unhealthy: the wait is a poll, not a one-shot read of a stale status.

Recover and wait, end to end #

A typical recovery sequence restarts the dependency and then waits for it to be healthy before asserting that in-flight work completed:

method:
  - phase: outage
    steps:
      - run: docker.kill
        with: api
  - phase: recover
    steps:
      - run: docker.start
        with: api
      - run: wait_until
        with:
          probe: { run: docker.ps, with: api }
          read: .Health
          equals: healthy
          timeout: 60
verify:
  - run: sut.await
    with: "${.outputs.id}"

up --wait versus polling for health #

docker.up with the default wait: true already blocks until services are healthy, so it covers the initial bring-up in setup. Use the wait_until plus docker.ps pattern for recovery, when the container is already running and you need to block until its healthcheck passes again rather than start it.

See Verbs & builtins for the full wait_until shape and the assert-operator set its condition shares.