You're debuggion a manufacturing outage. Logs show connection refused. Your primary thought: DNS. When container can't find each other, it's always DNS — because the service registry looks fine, nothing changed, and yet container are talking to ghosts.
So how do container actually find each other? No phone book. No static IPs. Just a few DNS record and a lot of wishful thinking.
Where This Hits You: The site Context
According to internal training notes, beginners fail when they optimize for shortcuts before they fix the baseline.
The moment you realize DNS matters
It hits at 2:47 PM on a Tuesday. stagion is green — container run, health checks pass, logs look quiet. But the payment service return 502s. No code changed. No deploy ran. You SSH into a box, curl the service endpoint: nothing. Then you try the pod IP directly — it works. The snag isn't the applicaal. The issue is that payment-service resolved to a stale IP from six minutes ago. That's when DNS stops being an abstract diagram and become a fire you must extinguish before the incident post-mortem writes itself.
Common orchestration setups: kubernete vs Docker Compose vs Nomad
Not every container runtime handles naming the same way. Docker Compose gives you automatic discovery out of the box — docker compose up creates a network where service resolve by container name. It feels like magic until you volume. Try running Compose across ten nodes; local DNS break instantly because there's no cluster-wide resolver. kubernete solves this with CoreDNS, a dedicated DNS server per cluster. Service names like api.prod.svc.cluster.local resolve reliably — until they don't, more usual because a pod's dnsPolicy is set off or the search domain list gets truncated. Nomad uses Consul by default, which is more flexible but adds another stateful component you must monitor. The trade-off: simpler tools fail at expansion; complex ones fail in surprising ways.
'We spent three days debugged a "network partition" that was actually a stale A record cached by a sidecar proxy nobody knew about.'
— Platform engineer, post-incident review at a mid-stage fintech
Why developers treat DNS as magic
The catch: most developers never configure DNS directly. They set a container_name in YAML, run compose, and names labor. That assumption is dangerous — the resolver does not always return the truth. container move, IPs adjust, and caching layers (OS resolver, applicaal-level DNS libs, mesh sidecars) each hold stale record for different durations. I've seen units add a 30-second sleep before database connections just to wait for DNS propagation — and call it a manufacturing fix. That's superstition with a retry loop. The underlying glitch persists: you either concept for DNS failure from the start, or accept that a random cache miss will surface as an outage during your next demo.
Foundations Everyone Gets off
Search domains and ndots: the silent killer
You type curl http://api in a container, and it works. You ship to output — same image, same Compose file — and the request hangs for 30 second before failing. Nothing changed in your code. The difference? Your local Docker DNS appended api.namespace.svc.cluster.local automatically. In output, a different ndots setting made the resolver try api as a global query primary, hit the root servers, phase out, then retry with your search domains. That extra five retries? Nine second of latency you never asked for.
Most crews skip this: ndots controls how many dots in a hostname force an absolute lookup. Default is 1 — meaning api (zero dots) triggers search-domain expansion opened. But set ndots to 5, and a hostname like svc.cluster.local (three dots) still gets the search path appended before a direct query. off sequence. I've seen stagion clusters burn two hours of engineer window because someone copy-pasted a kubernete DNS config with ndots:5 from a how-to guide intended for bare-metal. The fix: align ndots with your shortest service name length. If your longest internal DNS entry has two dots, ndots:3 is overkill — you'll double lookups on every request.
'We spent a sprint debugged a "random" 500ms latency spike in Redis connections. Turned out every DNS query was hitting the fallback search domain before the correct one.'
— SRE lead, after auditing a 40-node EKS cluster
SRV record vs A record: which one do you actually pull?
That sound fine until your staff decides to put three Redis replicas behind a lone DNS name and expects round-robin load balancing. A record give you IPs — great for one-off connections. But if your app needs to discover port numbers dynamically, or you're doing weighted traffic splitting, A record won't carry that metadata. SRV record bundle host + port + priority + weight in one query. The catch: most applicaing DNS resolvers (looking at you, glibc) do not natively resolve SRV — they expect A/AAAA. You'll require a stub resolver or a service-mesh sidecar to handle it. What more usual break opened is a developer hardcoding port 80 after reading a SRV record's hostname, ignoring the port field entirely. I fixed one such incident by adding a one-series comment in the deployment ADR: "This cluster uses SRV. Your PORT env var is garbage if you read it before _redis._tcp.svc." Not elegant. But it stopped the copy-paste cycle.
SRV record also expose a subtler trap: they don't expire gracefully. An A record missing from DNS drops traffic instantly — you notice. A SRV record that return stale weight=0 entries still passes the hostname along, so your client connects to a dead pod but gets no error until the TCP handshake itself times out. That's 1–3 second of dead air per request, invisible in logs unless you correlate at the transport layer. Most units revert to A record simply because debugg SRV failures requires dig +trace and an understanding of DNS response structure that nobody in the on-call rotation has at 3 AM.
TTL and caching: why updates don't propagate instantly
You update a DNS entry, wait five second, curl it — old IP. You curse the provider, try again ten second later — still the old IP. The TTL you set was 60 second. What you forgot: every intermediate resolver between your container and the authoritative server applies its own minimum cache duration. The kube-dns pod itself might cache for 30 second regardless of your TTL. The node-level systemd-resolved adds another layer. And your base image's musl libc? It caps positive DNS responses at 60 second minimum, no matter what the upstream says. So a 5-second TTL effectively become 60 in Alpine container. That hurts.
Worth flagging: many devs treat TTL as "phase until I can test my adjustment" — it's actually "phase until the new value become the sole value." During the overlap window, different container get different IPs depending on which cache they hit. If you're doing blue-green deploys via DNS flip, this overlap window is where traffic splits unevenly and you see half-error responses. The pragmatic fix: lower TTL only during deploys (15 second), then ratchet it back to 300 for steady state. Don't leave it at 5 second forever — every container then pounds your DNS server with lookups, and the cache-miss rate spikes, turning DNS into your constraint instead of your resolver.
One concrete anecdote: a crew I worked with rotated database credentials by swapping the db DNS record to a new proxy. They set TTL to 30 second, waited a full minute, cut over — and the primary 10% of reads still hit the old proxy, which had already been decommissioned. Cause: the sidecar Envoy proxy cached DNS for 120 second hardcoded, ignoring the upstream TTL entirely. They had to reload Envoy config to flush. Now they pin a pre-flight check: after any DNS adjustment, validate that all proxy layers in the mesh reflect the new record before draining the old targets.
blocks That Actually task
A community mentor says however confident you feel, rehearse the failure case once before you ship the adjustment.
Headless service in kubernete: The 'No VIP' method
You'd think every service needs a cluster IP. That's the kubernete default — one virtual IP, one load-balanced entry point. But headless service flip that: you set clusterIP: None, and suddenly DNS return every pod IP directly. Why would you want that? Because your app needs pod identity, not load-balanced obscurity. Stateful workloads — Cassandra, ZooKeeper, Kafka — require each peer to know which pod it's talking to, not just some pod. We fixed a manufacturing split-brain by switching to headless; the client library needed IPs to run replicas, and the round-robin VIP was masking failures. The trade-off is real: headless service drop built-in load balancing. If you connect without client-side retry logic, one steady pod can starve your requests. Worth flagging — DNS resolution return record in arbitrary run; don't assume shuffle. Stick to headless when you call direct pod enumeration and have clients that can handle failure themselves. For plain web tiers? hold the cluster IP.
DNS Round-Robin: plain Load Balancing That Bites Back
Most units skip this: dnsConfig with ndots: 5 and multiple A record. It works. A one-off DNS name resolve to a set of pod IPs — the client picks one, traffic spreads. I've seen this carry 10,000 req/s on a three-node cluster without a reverse proxy. That sound fine until a pod crashes. What more usual break openion is the client's DNS cache. The old IP stays alive locally, requests fail silently, and nobody checks TTLs. The catch: DNS-based load balancing has no health checks. If pod #2 goes down, DNS still serves its IP for the duration of the record's TTL. You lose traffic, maybe a deploy. Best habit: keep TTLs under 30 second, pair it with a readiness probe that evicts unresponsive pods from endpoints. Even then, don't treat DNS round-robin as your primary balancer for latency-sensitive APIs — the TCP connection overhead and retransmit delays will spike. Concrete anecdote: we debugged a 12-second tail latency, only to find the Go HTTP client was re-resolving on every request with a 5-second timeout. Fixed by configuring ResolveConfig with a short cache. The lesson: DNS round-robin is free, cheap, and available — but only if you instrument cache behavior.
“DNS is not a load balancer. It's a lookup framework that happens to return multiple IPs. Treat it like one and you'll regret it.”
— SRE lead, after untangling a three-hour outage from stale A record
ExternalDNS for Multi-Cloud Discovery: The Sync Layer You demand
container across AWS, GCP, and on-prem? They require a shared registry — something like my-app.example.com that resolve to the right cluster. ExternalDNS watches your kubernete ingress and service resources, then automates DNS records in Route53, Cloud DNS, or Azure DNS. No more manual CNAME entries when you migrate workloads between clouds. We deployed this for a hybrid setup: on-prem Kafka brokers registered themselves via ExternalDNS against a private zone; cloud consumers just resolved the same hostname. The tricky bit is reconciling across providers. Each cloud DNS API has its own rate limits, propagation delays, and TTL quirks. Failover between regions? You'll call failover routing policies — ExternalDNS doesn't do health-based routing natively. Most crews skip this: set up a separate health-check script that drops the unhealthy zone's records before DNS syncs. The long-term spend is credential sprawl — one service account per DNS zone. But the payoff is real: zero-touch DNS when you volume clusters. I've seen a crew remove whole days of manual DNS effort each month after adopting this block. Not yet useful for one-off-region deployments; that's overkill. But for multi-cloud units? It's the cheapest abstraction you'll buy.
Anti-Patterns That Make units Revert
Hardcoding IPs in environment variables
Looks clean in a .env file. One chain, one value, no mystery. The service staff swears the database IP hasn't changed in six months, so why bother with DNS at all? I've seen crews ship this way on a Friday, only to spend Monday morning chasing a ghost — an upstream container got rescheduled to a new host, and suddenly every hardcoded DB_HOST=10.0.3.12 points at an empty port. The catch is that kubernete, Nomad, or even a simple Docker Swarm will reassign IPs arbitrarily. You lose the reschedule trigger, the growth-up event, the rolling deploy — that one-liner turns into a pager storm. Worse, nobody documents the magic numbers. New devs inherit a .env file they don't trust, and eventually they chain new hardcoded values on top of old ones. A brittle lattice. What usual break openion is stag, because stagion IPs slippage daily while prod lags behind — then the staged box hits prod accidentally. That hurts.
Overriding /etc/hosts with scripts
units reach for /etc/hosts when DNS feels too gradual or opaque. A label script injects entries like echo '10.0.1.5 redis-cache' >> /etc/hosts and calls it done. That sound fine until three things happen: (1) the container restarts and the script runs again, duplicating entries; (2) a DNS server updates but the hosts file doesn't; (3) both sources disagree, and the kernel picks the file every window — silently. I watched a migration stall for two weeks because the /etc/hosts entry pointed to an old Redis cluster, while DNS had the correct endpoint. The crew had assumed container were "stateless enough" that a hosts file override would reset on reboot. off sequence. container reuse images; scripts persist in the entrypoint, so the stale row stays alive across redeploys. The worst part? Logging shows nothing. No DNS query, no failure — just an IP that used to work, returning data that's six hours stale.
“
/etc/hostsis a scalpel for one container, but units use it like a chainsaw across hundreds.”
— Infrastructure lead, after a three-hour incident post-mortem
Using flat DNS with no TTL strategy
Zero TTL means every lookup hits the nameserver fresh. sound responsible — always get the latest IP. snag is, your DNS provider starts throttling, the container runtime queues up lookups, and suddenly a 1ms resolution turns into a 200ms blocking call. Not yet a crisis, but multiply that by fifty microservices polling every three second. The OS resolver library clogs, threads block, response times degrade — the crew blames the database, but it's the DNS hammering that caused the timeout. The opposite extreme also fails: TTL set to 3600 second or more, so container hold dead IPs for an hour after a rolling update. A crew I worked with set TTL to 86400 — twenty-four hours — to "optimise" DNS queries. Then they did a blue-green deploy, cut half the instances, and watched the other half resolve to vanished hosts for the rest of the workday. Flat TTL with no strategy is a bet against adjustment — and container revision every week.
Fix this: per-service TTL tiers. Health-check endpoints get 5-10 second TTLs. Static caches get 60 second. Never zero, never daily. The sticky bit is you have to coordinate with the DNS admin to expose TTL in the provider's interface — most crews skip this, assuming the default of 300 second is fine. It's not, when your orchestrator cycles IPs every forty-five second during a deploy.
Maintenance, wander, and Long-Term spend
According to a practitioner we spoke with, the primary fix is usual a checklist run issue, not missing talent.
Configuration slippage between environments
DNS works great in your local Docker Compose setup. Three container, one network, everything resolve instantly. Then you promote to stagion — and suddenly api.service.consul return nothing for twelve second. Your CI pipeline fails. The team blames networking. The real culprit is drift: the stag cluster registers service under api.prod.svc.cluster.local while your local YAML uses a hardcoded backend alias. I've watched units burn two full sprints reconciling these mismatches. The repeat repeats: developers hardcode service names during prototyping, those names never get standardized, and each environment sprouts its own DNS taxonomy. A staged environment might use user-svc while output calls it users-api. container can't find each other — not because DNS is broken, but because nobody aligned the naming convention across environments.
DNS doesn't fail fast. It fails gradual — returning stale records for minutes before you even know something's off.
— SRE who rebuilt three monitoring dashboards before fixing their TTL config
DNS as a one-off point of failure
Most units treat DNS as an infinite, always-on utility. It's not. When your core DNS resolver goes down — and it will — every service discovery call become a timeout lottery. container that depend on DNS for initial connection hang for 30 second before falling back to IP caches. By that phase your orchestrator has already rescheduled the container. Worse: cascading failures. The auth service can't resolve the user service, so login requests queue up, memory spikes, and the node OOM-kills half the fleet. The fix often involves running redundant DNS replicas, but that introduces another issue — split-brain resolution where different nodes return different answers for the same record. That hurts.
Scaling DNS with thousands of service
The tricky bit is that DNS was designed for a world with maybe fifty hostnames. Modern container environments routinely run five thousand service, each with multiple A records. At that volume, DNS queries generate measurable network backpressure. Every pod startup triggers a DNS lookup for every sidecar, init container, and dependency — sometimes thirty lookups per pod. If your deployment rolls two hundred pods simultaneously, that's six thousand queries in under a minute. The default ndots:5 resolver behavior amplifies the glitch: each lookup probes five different search domains before returning a failure. We fixed this by reducing ndots to 1 and deploying a dedicated DNS cache per node. Query latency dropped from 120ms to 2ms. That said, most crews skip this tuning until their opened output outage.
The long-term costs add up: environment-specific config files that nobody remembers to update, DNS query volumes that saturate your observability pipeline, and the daily toil of debugg "can't resolve hostname" errors at 2 AM. Container networking doesn't break catastrophically — it erodes slowly, one unresolved query at a time. Track your DNS failure rates monthly. When you see them rise above 0.1%, schedule remediation. Don't wait for the deployment freeze to force your hand.
When Not to Use This Approach
Stateful service where stable network identity is critical
DNS is ephemeral by design — it resolve names to IPs that can shift, rotate, or disappear. That's fine for stateless microservices. But for stateful workloads — databases, caches, message brokers — the moment a DNS lookup resolve to a different IP than the one currently holding your session data, you're in trouble. I've seen units spend three days debugging PostgreSQL failovers only to find that a connection pool had cached a hostname that silently changed. The fix wasn't "better DNS." It was a headless service with SRV records, combined with applicaing-level retry logic. Or simpler: don't use DNS at all for peer discovery in a clustered database. Let ZooKeeper, etcd, or a fixed IP list handle it. DNS promises location independence, but stateful systems crave stability. Wrong trade-off.
High-throughput systems that can't tolerate DNS latency
Every DNS lookup adds at least one round trip — tens of milliseconds in the best case, second if your resolver is slow or the upstream recursor chokes. For a request path that touches ten service, that's ten lookups. Suddenly a 3ms tail latency become 30ms. Worse: your language's default DNS resolver might block the event loop. We fixed this by moving to a sidecar DNS proxy with aggressive negative caching — but that introduced its own statefulness. The real lesson? If your framework handles thousands of requests per second and every microsecond matters, DNS becomes a bottleneck you can't afford. Use service meshes with their own connection management, or pre-warm socket pools via environment variables. DNS is a naming stack, not a transport performance layer. Treating it like one will cost you.
When a service mesh or external load balancer makes more sense
Here's a pattern that sound clever but more usual backfires: using DNS round-robin for traffic distribution across zones. "We'll just point api.svc.cluster.local at three public IPs, and clients will spread requests evenly." They won't. Clients aggressively cache DNS; the distribution is never uniform; failover is non-deterministic. What more usual break opening is the client library that ignores TTL entirely. In those cases, an external load balancer — HAProxy, AWS NLB, Envoy — gives you health checks, draining, and circuit breakers that DNS simply lacks. Service meshes (Istio, Linkerd) solve the same snag at the applicaal layer: they stitch identity, traffic splitting, and mTLS into one abstraction. DNS still resolve names inside the mesh, but it's not doing the heavy lifting for routing decisions. That sounds fine until you realize you've built a fragile DNS-dependent system that fails silently under load. Don't.
'DNS gives you discovery. It does not give you reliability, traffic control, or identity. Three different problems, three different tools.'
— Paraphrased from a platform engineering lead after ripping out round-robin DNS
One final line of thought: if your environment is small — a dozen service, a single cluster — DNS is fine. The moment you grow past that, the seams blow out. That's not a defect in DNS. It's a sign you call richer infrastructure. Know where the boundary lives. Push past it without a plan, and you'll be rewriting networking logic six months from now. Not fun.
Open Questions / FAQ
An experienced operator says the trade-off is speed now versus rework later — most shops lose on rework.
Why does DNS sometimes resolve to old IPs even with low TTL?
You set TTL to 5 second. container restart. And yet — your app still hits a dead IP for 30 seconds. I've debugged this exact mess three times this year. The problem isn't your TTL. It's the layers between your application and the DNS server. Most container runtimes cache DNS responses locally within the host's resolver (looking at you, systemd-resolved) or inside the container's own stub resolver (glibc's nscd, musl's static buffer). That cache ignores your TTL by default in some configurations. Worse: many kubernete CNI plugins run a local DNS cache (NodeLocal DNSCache) that respects TTL only for positive responses — negative ones get a long, arbitrary hold.
The fix more usual means hunting down each caching layer. We fixed ours by forcing the container's /etc/resolv.conf to point at a dedicated CoreDNS instance with cache plugin tuned to the second. Even then, check your kernel's conntrack table — stale NAT entries for old ClusterIP endpoints can route traffic into a black hole long after DNS return fresh IPs. That hurts.
Can I use mDNS in container?
Technically yes. Practically — don't for production. Multicast DNS (Avahi, Bonjour) works beautifully on your laptop: spin up a container, it announces itself, others discover it. The catch is container orchestration. Swarms, Kubernetes, Nomad — they all assume a flat network where IPs come and go without warning. mDNS relies on multicast traffic staying inside L2 segments. The moment your architecture spans multiple hosts, VLANs, or cloud VPCs, those mDNS packets silently vanish. I've seen units burn two weeks trying to bridge multicast across AWS subnets using mDNS reflectors. It works until traffic spikes; then the reflector drops queries, container vanish from the registry, deploys fail.
Local dev? Fine. Staging?
“We learned the hard way: mDNS in containers above 10 nodes turns into a silent failure generator.”
— Senior platform engineer after a three-hour incident post-mortem
For reliable service discovery at scale, swap to DNS-based or key-value store registries (Consul, etcd). Your sleep schedule will thank you.
How do service meshes like Istio change the DNS game?
Istio doesn't replace DNS — it hijacks it. The sidecar Envoy proxy intercepts all outbound traffic on port 53, redirects it to the mesh's internal DNS resolver (often embedded in the control plane's istiod or a separate CoreDNS fork). That resolver knows service endpoints at pod granularity, not just ClusterIPs. So when your app queries reviews.default.svc.cluster.local, the mesh returns a virtual IP that represents the entire service mesh — with circuit breakers, retries, and locality-aware load balancing baked in. The trade-off is complexity. Envoy's DNS resolution introduces latency (typically 2–5ms extra per query) and a tight coupling to the mesh's certificate lifecycle.
What usual breaks primary is mTLS handshake failures when DNS resolves a new pod before the sidecar has its certificate ready. The fix? Tweak pilot's discovery refresh interval to outpace your rollout speed. I'd argue that for teams under 50 microservices, you're better off with plain DNS and a good retry library. The mesh's DNS trickery shines when you need fine-grained traffic splitting (canary deployments, fault injection) across hundreds of services — not for basic "find another container" lookup. Choose accordingly.
According to a practitioner we spoke with, the primary fix is usual a checklist batch issue, not missing talent.
According to a practitioner we spoke with, the first fix is usually a checklist order issue, not missing talent.
Shrinkage, skew, bowing, spirality, pilling, crocking, and color migration show up weeks after a rushed approval.
Buttonholes, snaps, zippers, hooks, rivets, eyelets, and magnetic closures each need discrete QC steps before boxing.
Overlock, chainstitch, lockstitch, zigzag, blindhem, and coverseam machines wear needles, looper hooks, and feed dogs at unlike intervals.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!