Bug#605870: pdnsd: crashes on heavy load
I can reproduce this. pdnsd isn't actually crashing, but it does effectively stop serving requests. If I set the server to 'down', pdnsd quickly starts responding again. Setting the server to 'up' again stops DNS again.
If the queue time exceeds the client timeout limit, the client can no longer resolve names, even if pdnsd is responding. If upstream silently drops requests, queue time becomes ~2 min minimum. Some routers also add a 2 min delay to all DNS requests when loaded. Client timeout seems to be around 5 seconds for most programs, with 3 attempts made, so 15 seconds max.
There are 3 factors involved.
1) All requests must pass the queue before the cache is checked. This is a bug, and the main problem.
2) pdnsd cannot have a server timeout longer than its client timeout, but most clients only listen for about 5 seconds, while servers can take > 120s to reply under load. Some routers are especially bad for this.
3) pdnsd does not merge duplicate requests. A bug in and of itself, and it compounds the problem by making 1 and 2 worse.
What needs to be done:
1) If a fresh cache entry is available, pdnsd should respond immediately, even if there is a queue, and even if the queue is full. That is, the queue should only be used for cache misses and stale entries.
2) It should be possible to set the client timeout shorter than the server timeout. In particular, pdnsd should be able to send a stale record, or even SERVFAIL while still listening for a reply from upstream, or even if the request is in queue. That is, the client timeout clock should start when a request is received, but the server timeout should not start until a request is sent. There are 3 situations that could have different timeouts:
2a) On a cache miss, pdnsd could time out the client and reply SERVFAIL before the server connection times out. If enabled, this should even time out while the request is in queue. If the request was in queue, it should be queued at low priority, or dropped. eg: client timeout could be 110s, while server timeout is 150s.
2b) On a stale cache entry, but no queue, it should be possible to set the client timeout shorter than the server timeout. That is, pdnsd would reply from stale cache while still listening for the server response. eg: Reply from stale cache in 4 sec, but still listen 150 sec for the server to respond.
2c) On a stale cache entry and a queue, it should be possible to tell pdnsd to reply from cache instantly, and queue the name at low priority, or drop it.
3) If a request comes in for a name that is already active, or in the queue, pdnsd should only send one request upstream, and then give the reply to all requesters. pdnsd should handle its own retries, as long as any client is waiting for the name. If all clients timed out while a request is in queue, it should be handled at lower priority, or even dropped.
My ISP uses some really flakey DNS servers that often silently drop ~50% of requests. Some servers are better than others.
I'm also using a Linksys router that suffers from DNS loading issues. When overloaded, its response time exceeds 120 seconds, and it starts silently dropping requests too. This happens every time upstream acts up, as the router starts queuing.
I have pdnsd on each machine on my network. They all exhibit this problem.
-- System Information:
Debian Release: jessie/sid
APT prefers testing
APT policy: (990, 'testing'), (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 3.2.0-4-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_CA.UTF-8, LC_CTYPE=en_CA.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages pdnsd depends on:
ii adduser 3.113+nmu3
ii debconf [debconf-2.0] 1.5.50
ii libc6 2.17-3
Versions of packages pdnsd recommends:
pn resolvconf <none>
pdnsd suggests no packages.
server_ip = 127.0.0.1; // Use eth0 here if you want to allow other
// machines on your network to query pdnsd.
status_ctl = on;
min_ttl=3600; // Retain cached entries at least 1 hour.
max_ttl=2419200; // One month.
timeout=10; // Global timeout option (10 seconds).