Intermittent 522 errors across multiple Cloudflare-fronted sites. A perfectly healthy-looking server. A wrong diagnosis that felt right. And an invisible IPv6 routing failure that was the real killer all along.
The Symptom
My web server was slow. Not a little slow — Cloudflare was returning 522 Connection Timed Out errors across multiple sites. Pages that did manage to load would then hang trying to fetch static assets like CSS and JavaScript. A fresh reboot didn't fix it. The problem came back within minutes.
The Misleading First Look
Every sysadmin's first instinct is to check the usual suspects:
| Resource | Status | |----------|--------| | CPU | 12 cores, 83% idle | | RAM | 32GB total, 25GB free | | Disk | 26% used, 0% I/O wait | | Load Average | 0.07 | | DNS | 38ms resolution |
Everything looked perfectly healthy. The server was practically asleep. So why were users getting timeout errors?
The Setup
The server runs a fairly common stack:
- Apache 2.4 (event MPM) serving ~250 virtual hosts - PHP-FPM (multiple versions: 7.4, 8.2, 8.3, 8.5) - MySQL 8.0, Redis, Memcached - BT Panel (aaPanel) for server management - Cloudflare in front of all sites
A quick curl to localhost confirmed the web server itself was fast — sub-millisecond response times for HTTP, 5ms for HTTPS. The bottleneck was somewhere between Cloudflare and the response.
The Wrong Diagnosis: Dead Reverse Proxies
Digging into the Apache error logs, I found a flood of proxy errors and "long lost child came home" warnings — classic signs of thread starvation. Across ~250 sites, about 20 had reverse proxy rules pointing to backends that no longer existed — dead local ports and unreachable external servers.
Combined with a 600-second Apache timeout and KeepAlive disabled, these dead backends were consuming worker threads and starving the entire server. I cleaned them all up, reduced the timeout to 100 seconds, enabled KeepAlive, and felt confident I'd found the root cause.
I was wrong.
The 522 errors kept coming back. The intermittent failures persisted. The cleanup helped Apache's overall health, but it wasn't why requests were timing out at the Cloudflare edge.
The Real Investigation
With the proxy red herring out of the way, I ran a systematic availability test across all four Cloudflare-fronted domains sharing this origin server: c., c.*, j.ee, and cu.com*.
Baseline: Something Is Very Wrong
A 60-second burst test with the OS choosing its preferred IP version:
| Domain | Total | OK | Fail | Success% | Avg (ms) | |--------|-------|----|------|----------|----------| | c.** | 9 | 4 | 5 | 44.4% | 70 | | c.** | 13 | 8 | 5 | 61.5% | 59 | | j*.ee | 11 | 6 | 5 | 54.5% | 201 | | c*u.com | 8 | 2 | 6 | 25.0% | 53 |
Over half the requests were failing. All failures were HTTP status 000 — TCP connection timeouts at 10 seconds. When connections succeeded, response times were fast (50–100ms). The failures came in periodic bursts, not randomly.
The Clue: Which IP Version?
I checked what IP version the server was actually using for outbound connections:
Target Outgoing IP Version
1.1.1.1 (forced v4) 43.***.***.98 IPv4
[2606:4700::1111] 2400:****:****:****::2020 IPv6
c.** (actual) 2400:****:****:****::2020 IPv6 ← OS preferred v6
The OS was routing all Cloudflare traffic over IPv6 by default.
IPv4 vs IPv6: Head-to-Head
I ran parallel tests forcing -4 and -6 on curl for each domain:
| Domain | IPv4 Rate | IPv4 OK/Total | IPv6 Rate | IPv6 OK/Total | |--------|-----------|---------------|-----------|---------------| | c.** | 55% | 6/11 | 71% | 12/17 | | c. | 40% | 4/10 | 0% | 0/59** | | j*.ee | 25% | 2/8 | 44% | 4/9 | | c*u.com | 55% | 6/11 | 64% | 9/14 |
c. was 100% broken on IPv6** — all 59 attempts failed. The other domains showed failures on both protocols, but IPv6 was clearly the unstable path. IPv4 failures were likely collateral damage from the network's IPv6 routing issues affecting the shared infrastructure.
The Smoking Gun: Disable IPv6
I disabled IPv6 at the kernel level:
sysctl -w net.ipv6.conf.all.disable_ipv6=1
sysctl -w net.ipv6.conf.default.disable_ipv6=1
Then ran the same test:
| Domain | Total | OK | Fail | Success% | Avg (ms) | P50 (ms) | P95 (ms) | |--------|-------|----|------|----------|----------|----------|----------| | c. | 56 | 56 | 0 | 100%** | 55 | 52 | 62 | | c. | 56 | 56 | 0 | 100%** | 64 | 59 | 69 | | j*.ee | 55 | 55 | 0 | 100% | 78 | 73 | 106 | | c*u.com | 56 | 56 | 0 | 100% | 51 | 46 | 58 |
223 out of 223 requests successful. Zero timeouts. Zero failures.
Manual browsing confirmed it — every site loaded instantly and consistently.
The Root Cause
IPv6 routing failure between the server (2400:****:****:****::2020) and Cloudflare's HKG edge.
The symptoms: - TCP connections over IPv6 would periodically fail to establish - Failures manifested as complete 10-second connection timeouts (not HTTP errors) - Failures occurred in periodic bursts, not randomly - All Cloudflare-fronted domains were equally affected - c.** was additionally 100% unreachable over IPv6 (likely missing or misconfigured AAAA records)
The server OS (Linux) preferred IPv6 when both A and AAAA records were available, causing the majority of outbound traffic to route over the broken IPv6 path.
Before vs After
| Domain | Before (IPv6 active) | After (IPv4 only) | |--------|----------------------|--------------------| | c. | 44–85% | 100%** | | c. | 0–62% | 100%** | | j*.ee | 25–55% | 100% | | c*u.com | 25–62% | 100% |
Response times: - Before: 50–310ms with 10,000ms timeout spikes - After: 42–170ms, median 46–73ms, no spikes
The Fix
Immediate (applied at 19:00 UTC):
sysctl -w net.ipv6.conf.all.disable_ipv6=1
sysctl -w net.ipv6.conf.default.disable_ipv6=1
Permanent (applied at 19:10 UTC):
Added to/etc/sysctl.conf:
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
Lessons Learned
1. IPv6 failures are invisible. Nothing in CPU, RAM, disk, or load average will tell you that your IPv6 path is broken. The server looks perfectly healthy while half your requests silently die.
2. Check your IP version. When debugging connectivity, always verify which protocol the OS is actually using. A server with both A and AAAA records might be routing everything over a broken IPv6 path without you knowing.
3. Obvious problems can be red herrings. The dead reverse proxy backends were real, visible in logs, and satisfying to fix. But they weren't the actual cause. Don't stop investigating just because you found a problem — make sure it's the problem.
4. Test systematically. Parallel IPv4 vs IPv6 tests with forced protocol selection made the root cause undeniable. Without that controlled comparison, I might have kept chasing Apache configuration ghosts.
5. Report upstream. IPv6 routing issues on the hosting provider's network should be reported — the path to Cloudflare HKG was unreliable and may affect other customers.
What About the Dead Proxies?
The reverse proxy cleanup from my initial (wrong) diagnosis wasn't wasted effort. Dead backends pointing to unreachable external servers with a 600-second timeout were consuming Apache threads unnecessarily. Fixing that, reducing the timeout to 100 seconds, and enabling KeepAlive were all legitimate improvements to server health.
But those were maintenance issues, not the outage cause. The outage was IPv6.
Final Thought
The irony is that I wrote an entire detailed blog post about dead reverse proxies being the root cause, complete with diagnostic commands and architectural explanations. It was thorough, well-reasoned, and wrong. The real culprit was two sysctl lines away from being revealed.
Sometimes the hardest problems aren't about what you can see in the logs. They're about what the logs never show you.
Quantum Responses