SECTOR 02 // HOW-TO GUIDES
Break DNS resolution
Make a hostname disappear, point it somewhere else, or blackhole it with the net provider.
Goal: simulate DNS-level failures: the outage class that bypasses every proxy because the client never connects at all.
Prerequisites #
The net provider drives dnsmasq through conf snippets: it writes one
file per host into a directory dnsmasq watches (conf-dir=), then runs your
reload command. Your stack must resolve through that dnsmasq.
providers:
net:
config:
confDir: assets/dnsmasq.d
reloadCmd: docker compose -p chaos kill -s SIGHUP dnsmasq
Make a host vanish (NXDOMAIN) #
- run: net.nxdomain
with: api.partner.com
The host resolves to nothing: the “their domain expired” failure.
Repoint a host #
- run: net.set_dns
with:
host: db.internal
ip: 10.0.0.99
Useful for “the failover DNS record was wrong” scenarios.
Resolve a host to several endpoints #
A service name backed by several A records uses ips:
- run: net.set_dns
with:
host: backends.example.test
ips: [10.0.0.1, 10.0.0.2]
The name now resolves to both addresses at once: the multi-endpoint discovery mode where a client spreads across the live set and rides out the loss of any one member.
Change the live set #
set_dns declares the full set, so the endpoints a name resolves to change by restating it. A client
that refreshes DNS picks up an address added to the set, and stops using one dropped from it, without
restarting:
# was {10.0.0.1, 10.0.0.2}; drop one endpoint
- run: net.set_dns
with:
host: backends.example.test
ips: [10.0.0.1]
effect: degradation
Restating a smaller set is the fault being injected, so the step declares effect: degradation (the
same per-step override exec.run uses when it shells out to tc). Restating a larger set, or the
original set, is recovery and needs no effect:. Taking the records away entirely is net.nxdomain:
a client that keeps running on its last-known endpoints through the outage, then recovers once the set
returns, is the DNS-outage resilience property.
Blackhole a host #
- run: net.dns_blackhole
with: api.partner.com
Resolves to 0.0.0.0: lookups succeed, connections go nowhere, and clients hit
connect timeouts instead of resolution errors. Different failure, different
code path in your retry logic. Test both.
Restore #
net.clear lifts one host’s override mid-scenario (the recovery step of a
fault-then-recover exercise); net.reset removes every snippet the provider
wrote and reloads once. Both are idempotent, so a teardown that runs after an
early failure is a no-op:
- run: net.clear
with: api.partner.com
teardown:
- run: net.reset
- run: docker.down