SECTOR 03 // REFERENCE
docker
Lifecycle (up/down) plus process, network, and resource faults (kill, pause, restart, disconnect, throttle) over the docker compose CLI.
Lifecycle and process faults. Drives the docker compose CLI. This is the
lifecycle provider: it implements up/down and powers the default teardown.
providers:
docker:
config:
composeFiles: [assets/stack.yml]
project: chaos-run
composeFiles lists the compose files to drive (relative paths resolve against
the project root); project is the compose project name. Every verb except ps
returns the command’s trimmed stdout as the value, the untrimmed stdout in
output, and an empty meta.
Verbs #
up (action) #
Runs compose up -d --wait, blocking until every started service is healthy.
Pass wait: false to drop --wait (see below). Select compose profiles with
profiles:.
| arg | type | req | description |
|---|---|---|---|
services | list | no | services to start; all of them when omitted (primary) |
wait | bool | no | wait for health before returning (default true) |
profiles | list | no | compose profiles to activate |
- run: docker.up
with: { services: [api, worker] }
down (action) #
Runs compose down -v --remove-orphans, tearing the whole project down
regardless of profile. This powers the default teardown.
No args.
- run: docker.down
kill / stop (action, outage) #
Signals a running container: kill sends SIGKILL (abrupt), stop sends SIGTERM
(the graceful shutdown path). Both inject an outage.
| arg | type | req | description |
|---|---|---|---|
service | string | yes | the service to signal (primary) |
- run: docker.kill
with: worker
start (action) #
Restarts a stopped service.
| arg | type | req | description |
|---|---|---|---|
service | string | yes | the service to start (primary) |
- run: docker.start
with: worker
restart (action, outage) #
Bounces a service (stop + start) in one step: the graceful rolling-restart fault. An outage, since work in flight when the SIGTERM lands is dropped, but one that heals itself, so the interesting assertions are about what peers observed during the bounce (retries, failover, no lost writes).
| arg | type | req | description |
|---|---|---|---|
service | string | yes | the service to bounce (primary) |
- run: docker.restart
with: api
pause / unpause (action) #
Freezes (pause) or thaws (unpause) a container’s processes with SIGSTOP/
SIGCONT. pause carries effect: outage; unpause reverts it.
| arg | type | req | description |
|---|---|---|---|
service | string | yes | the service to freeze or thaw (primary) |
- run: docker.pause
with: worker
- run: docker.unpause
with: worker
disconnect / connect (action) #
Partitions a single container at the network layer (disconnect, an outage)
and reconnects it (connect). The process keeps running and co-located peers
are untouched, so the scenario observes last-known-state behavior and
reconnection on restore: a distinct failure mode from kill/stop/pause.
They target one docker network, defaulting to compose’s <project>_default
(so network is required when no project is configured); a multi-network
container is isolated by disconnecting each. connect restores the compose
service-name DNS alias, so peers resolve the container again.
| arg | type | req | description |
|---|---|---|---|
service | string | yes | the service to partition or reconnect (primary) |
network | string | no | the docker network (default <project>_default) |
- run: docker.disconnect
with: worker
- run: docker.connect
with: worker
throttle / unthrottle (action) #
Caps (throttle) or restores (unthrottle) a container’s CPU via
docker update --cpus: resource starvation as a degradation. The process keeps
running and keeps its connections, it just gets slow: a distinct failure mode
from pause (frozen) and kill (gone). throttle carries
effect: degradation; unthrottle reverts it (--cpus 0 means “no limit”).
CPU only, deliberately: a memory ceiling cannot be reset to unlimited through
docker update, so it would be a fault with no restore. Inject memory pressure
by restarting the service with compose-level limits instead.
| arg | type | req | description |
|---|---|---|---|
service | string | yes | the service to cap or restore (primary) |
cpus | number | throttle only | the CPU ceiling (e.g. 0.2 = a fifth of one core) |
- run: docker.throttle
with: { service: worker, cpus: 0.2 }
- run: docker.unthrottle
with: worker
logs (probe) #
Fetches a container’s logs. tail/since fetch an incremental slice, so a
wait_until can gate on a log line appearing.
| arg | type | req | description |
|---|---|---|---|
service | string | yes | the service whose logs to read (primary) |
tail | string | no | only the last N lines |
since | string | no | only lines since a timestamp or relative time (e.g. 30s) |
- run: wait_until
with:
probe: { run: docker.logs, with: { service: worker, tail: "20" } }
matches: "rebalanced"
timeout: 30
ps (probe) #
Returns parsed compose ps --all --format json (--all so exited and dead
containers still report). With a service named that matches exactly one
container, it binds that container’s object directly, so read:/capture:
reach .State, .ExitCode, and .Health without indexing a list; with no
service it returns the full list.
| arg | type | req | description |
|---|---|---|---|
service | string | no | one service to inspect; all containers when omitted (primary) |
Returns the container object (single match) or the list of objects.
meta.count (int) is the number of containers. output is the raw JSON.
- run: docker.ps
with: worker
capture: { state: ".State", code: ".ExitCode" }
- run: assert # crashed clean, did not hang
with: { of: "${.outputs.state}", equals: "exited" }
- run: assert
with: { of: "${.outputs.code}", equals: 0 }
exec (probe) #
Runs a command inside a running container (compose exec -T <service> sh -c)
and returns its stdout, so a scenario can read internal runtime state (thread
or fd counts, memory, an in-container metric) and baseline-then-compare it with
the standard assert operators. A probe: it observes, it does not inject a
fault. Keep the command read-only; a fault injected through exec belongs on
an action step with an explicit effect: override.
| arg | type | req | description |
|---|---|---|---|
service | string | yes | the service to run the command in |
command | string | yes | shell command line; pipes and globs work (primary) |
Returns the trimmed stdout as the value.
- run: docker.exec
with: { service: worker, command: "ls /proc/1/task | wc -l" }
as: threads_before
Starting a service that is meant to fail #
up runs compose up -d --wait, blocking until every started service is
healthy. To bring up a service that is supposed to crash or hang (so you can
assert it fails fast rather than blocks), pass wait: false: the --wait is
dropped and the step returns once the container is created.
- run: docker.up
with: { services: [worker], wait: false }
Service variants (compose profiles) #
To run the same role in different shapes (a baseline worker, a round-robin
worker, a partition-failover worker) keep one compose file and tag each variant
service with a compose profile, then
select one per scenario with profiles:. This stays hermetic and keeps a single
lifecycle owner; there is no per-scenario compose-file swapping or second docker
provider (a scenario can still override composeFiles in its own providers:
block if it genuinely needs a different stack).
- run: docker.up
with: { profiles: [rr], wait: true } # → compose --profile rr up -d --wait
down tears the whole project down regardless of profile.