Distributed SystemsRedisPythonBackend·9 min read·April 11, 2026

How I Built a Credentials Service That Handles 800 req/s With Zero Errors

Solving the thundering herd problem in production with Redis, Python, and a 50-line algorithm

It was 2 AM on a Tuesday when I realized my caching layer was going to betray me.

I'd just built a credential-serving microservice - a thin layer between our data pipeline fleet and AWS Secrets Manager. Simple job: fetch secrets, cache them, serve them fast. The kind of service you sketch on a napkin and assume will "just work."

Then I simulated 500 concurrent requests hitting a cold cache.

All 500 of them blew right past the empty cache and slammed into AWS simultaneously. The secrets manager throttled us. Every request failed. The cache never got populated. The next wave of requests saw the same empty cache and tried again. And again.

I was staring at a thundering herd - and my "simple" caching service was the stampede.

Two years later, that same service has handled hundreds of millions of requests at a sustained 700-800 requests per second, with a total error count of exactly zero.

What's a Thundering Herd?

If you've worked with caches, you've probably heard the term. I had too, but when I actually hit the problem I didn't fully understand the mechanics. I went back to basics.

I picked up Patrick Galbraith's Developing Web Applications with Apache, MySQL, memcached, and Perl - specifically p. 353 on the dogpile effect in memcached. Old book. The problem hasn't changed.

You have a cached value. It has a TTL. The TTL expires.

In the 50 milliseconds before anyone repopulates that cache, 400 requests arrive. Every single one checks the cache, finds it empty, and independently decides: "I'll fetch it myself."

This is the thundering herd (also called cache stampede or dogpiling). The cache was supposed to protect your backend. Instead, every cache expiration becomes a coordinated assault on it.

At 800 req/s, this isn't a hypothetical. Every TTL expiration is a guaranteed outage.

The thing that reframed it for me: this isn't a caching problem. It's a coordination problem. You need exactly one request to fetch, and everyone else to wait.

The Naive Solutions (and Why They Don't Work)

Before I landed on the right approach, I considered the obvious alternatives:

"Just set a longer TTL" - Delays the problem, doesn't solve it. And stale credentials cause their own failures.

"Use a background refresh" - Better, but adds complexity. What if the background job crashes? What if it runs on a different instance than the one serving requests? You still need coordination.

"Add jitter to TTLs" - Helps spread out expirations, but doesn't prevent multiple requests from hitting the same miss window. At high throughput, even a jittered expiration will have dozens of concurrent misses.

I needed something that was:

Distributed - works across 4 load-balanced server instances
Atomic - no race conditions, no "check-then-set" gaps
Self-healing - recovers automatically if something crashes mid-fetch
Simple - small enough to reason about under pressure at 2 AM

The Fix

This is the whole thing:

import asyncio
from redis.asyncio import Redis

async def get_or_set_with_dogpile(
    redis: Redis,
    key: str,
    fetch_func,          # async callable that retrieves the real value
    cache_ttl=3600,      # cache lifetime: 1 hour
    lock_ttl=30          # lock lifetime: 30 seconds (deadlock safety net)
):
    # Fast path: return immediately if cached
    cached = await redis.get(key)
    if cached is not None:
        return cached

    # Atomic lock acquisition - only ONE caller wins
    lock_key = f"{key}:lock"
    got_lock = await redis.set(lock_key, "1", ex=lock_ttl, nx=True)

    if got_lock:
        # Winner: fetch the value and populate cache
        try:
            value = await fetch_func()
            await redis.set(key, value, ex=cache_ttl)
        finally:
            await redis.delete(lock_key)
        return value
    else:
        # Everyone else: wait for the winner to finish
        for _ in range(lock_ttl * 2):
            await asyncio.sleep(0.5)
            cached = await redis.get(key)
            if cached is not None:
                return cached

        raise TimeoutError(f"Cache population timed out for {key}")

~50 lines. The two things that do the actual work are the SET NX and the EX 30. Everything else is just bookkeeping around them.

Walking Through It

Step 1: The Fast Path

99.9% of requests hit this. The value is in Redis, we return it, done. Sub-5ms response time.

Step 2: The Atomic Lock - `SET key NX EX`

Both flags together are what make this actually work:

NX (Not eXists) - "Only set this key if it doesn't already exist." Redis executes this atomically. If 500 requests all try SET lock NX in the same millisecond, exactly one gets True. The other 499 get False.

One command. One winner. No race conditions. Distributed leader election with a single Redis command.

EX 30 (Expire in 30 seconds) - "Auto-delete this lock after 30 seconds." This is the deadlock safety net. If the winner crashes, gets OOM-killed, or hangs on a network call, the lock doesn't persist forever. It evaporates, and someone else can take over.

Why not Redlock? Redlock is designed for scenarios where you need consensus across multiple independent Redis instances. For a single Redis instance (or a managed HA cluster with automatic failover), SET NX EX gives you the same guarantee with zero additional complexity. Don't reach for a distributed consensus algorithm when an atomic command will do.

Step 3: Winner Fetches, Losers Wait

The winning request fetches from the cloud API (100-300ms typically), writes to Redis, and deletes the lock. The try/finally ensures the lock is always cleaned up, even if the fetch throws.

The losing requests poll every 500ms. In practice, the fetch completes within 1-2 polls. The timeout of 60 seconds is a circuit breaker - if we're still waiting after a full minute, something is genuinely broken and we fail loudly rather than hang silently.

The Full Architecture

Data flow through the AWS credentials system

A few things to note about this architecture:

Fully async, end to end. Every I/O call - Redis, HTTP, cloud API - is awaited. A single worker handles hundreds of concurrent connections without threads. Four workers behind a load balancer gives us the throughput we need.

Redis pulls double duty. It's both the cache and the lock coordinator. This is important - if the cache and the lock lived in different systems, you'd have consistency headaches. Same Redis instance means the lock and the cached value are always in sync.

Authentication middleware, not per-route. Every request (except healthcheck) passes through signature verification. There's no code path that can accidentally skip auth. For a service that literally serves secrets, this is non-negotiable.

Zero-downtime deploys. New containers spin up and start receiving traffic before old ones drain. Since the cache lives in Redis (not in-process), new instances immediately benefit from a warm cache. No cold-start stampede.

The Layers: Not All Secrets Are Equal

Different credentials have different lifetimes, and the cache should respect that:

Layer	TTL	Reasoning
Static secrets (passwords, API keys)	1 hour	Rarely change. Hourly refresh balances freshness vs. load.
Cloud session tokens	4 hours	Valid for 6h. We refresh 2h early as a safety buffer.
OAuth access tokens	`expires_in - 5 min`	Dynamic! Respect the token's actual lifetime, minus a safety margin.
User auth validation	24 hours	Permissions change infrequently. Daily revalidation is sufficient.
Signing keys	15 min (in-memory)	Security-critical. Rotate frequently, but keep in-memory for speed.

The dogpile algorithm protects every single layer. Whether it's a static password or a dynamically-minted OAuth token, the same SET NX EX pattern prevents stampedes everywhere.

The Sneaky Second Herd: Token Refresh Storms

Here's something I almost got wrong.

Cloud session tokens expire. When they do, the cloud provider returns an ExpiredTokenException. The natural instinct is to catch the error, refresh the token, and retry.

But think about what happens at scale: 200 requests are all using the same cached session token. The token expires. All 200 requests get ExpiredTokenException. All 200 of them try to refresh the token.

A retry storm is just a thundering herd wearing different clothes.

The fix was wrapping the refresh logic in the same dogpile algorithm:

def with_retry_on_expired_token(func):
    async def wrapper(*args, **kwargs):
        try:
            return await func(*args, **kwargs)
        except ClientError as e:
            if e.response["Error"]["Code"] in (
                "ExpiredToken", "ExpiredTokenException"
            ):
                # Invalidate the stale token
                await redis.delete(credentials_cache_key)

                # Refresh - dogpile-protected, so only ONE request
                # actually hits the token endpoint
                await refresh_credentials()

                # Retry the original call with fresh credentials
                return await func(*args, **kwargs)
            raise
    return wrapper

The refresh_credentials() call goes through get_or_set_with_dogpile. So even if 200 requests discover the expired token simultaneously, exactly one of them refreshes it. The rest wait.

Rule of thumb: Any time you write retry logic, ask yourself: "What happens if 500 requests all retry at the same time?" If the answer is "bad things," you have a hidden thundering herd.

The Third Herd: The Near-Expiry Spike I Couldn't Pin Down

There was a third problem. It was the subtlest, and for a while it was the most maddening.

The service had been stable for months. Zero errors. Then I started noticing a pattern in the latency graphs: a periodic spike in response time, always on the same keys, always lasting about one second. And exactly 5 consecutive requests for the same cache key would fail before everything recovered.

Not 3. Not 10. Always 5.

The retries were succeeding - our retry logic was eating the failures cleanly - so there were no errors in the error rate dashboard. The system looked healthy. But the latency told a different story, and I couldn't figure out why 5 requests were failing in the first place. The cache was warm, the lock algorithm was running, nothing had expired. Everything should have worked.

It took me longer than I'd like to admit to pin down what was actually happening.

The answer was in how Redis handles TTL near expiry.

When a cached value is close to its TTL - within the last fraction of a second - some Redis client configurations will treat it as effectively expired. Multiple requests arrive in that near-expiry window, all check the cache, and a small number of them race against the actual expiry moment. If the key expires between the check and the read, those requests see a miss. The dogpile lock fires, one request wins, the rest wait - but in the brief window before the lock is set, 5 requests had already checked the cache, found nothing, and queued up their own fetch attempts.

The retry logic was the red herring. The retries were working - they were resolving correctly once the new value was in cache. But because the retries masked the failures as eventual successes, the root cause was invisible in the error dashboard. I was only seeing it in the latency percentiles.

The fix was two-pronged:

1. Proactive TTL refresh. Instead of waiting for the key to expire, refresh it when the remaining TTL drops below a threshold - say, 10% of the original TTL:

async def get_with_proactive_refresh(redis, key, fetch_func, cache_ttl=3600, refresh_threshold=0.1):
    cached = await redis.get(key)
    remaining_ttl = await redis.ttl(key)

    if cached is not None:
        # If we're in the last 10% of TTL, trigger a background refresh
        if remaining_ttl < cache_ttl * refresh_threshold:
            asyncio.create_task(refresh_in_background(redis, key, fetch_func, cache_ttl))
        return cached

    # Normal dogpile logic for a cold cache
    return await get_or_set_with_dogpile(redis, key, fetch_func, cache_ttl)

async def refresh_in_background(redis, key, fetch_func, cache_ttl):
    """Fire-and-forget refresh. The key is still serving while this runs."""
    lock_key = f"{key}:refresh_lock"
    got_lock = await redis.set(lock_key, "1", ex=30, nx=True)
    if not got_lock:
        return  # Another instance is already refreshing
    try:
        value = await fetch_func()
        await redis.set(key, value, ex=cache_ttl)
    finally:
        await redis.delete(lock_key)

The key is still serving stale-but-valid data while the background refresh runs. By the time TTL actually hits zero, the new value is already in cache. The expiry window that was causing the spikes ceases to exist.

2. Separate the retry system from the latency dashboard. This was the harder lesson. The retries were doing exactly what they were supposed to do - recovering from transient failures. But they were hiding a systematic problem by making it look like random noise. Once I separated "requests that succeeded on first attempt" from "requests that required a retry" in the metrics, the pattern became immediately obvious.

The near-expiry spike is a ghost. It only exists in the latency data. If your retry system is swallowing errors, you will never see it in your error rate. Instrument your retry path separately - count every retry as a signal, not just the ones that ultimately fail.

This is the one that cost me the most hours to find. The thundering herd is obvious when your error rate spikes. It's much harder to spot when your retry logic is quietly absorbing the damage.

Security: Because This Service Holds All the Keys

A credential service is a crown jewel target. If it's compromised, everything it protects is compromised. Security can't be an afterthought:

Cryptographic signature verification on every request. Callers present a digitally signed authentication ticket. The service verifies the DSA signature against the issuer's public key. No valid signature = instant 401. Not a bearer token, not an API key - a cryptographic proof of identity.

Explicit allowlisting. Valid signature isn't enough. The caller must also be on a hardcoded allowlist. This prevents lateral movement - if Service A is compromised, it can't use its own valid credentials to fetch Service B's secrets.

TLS everywhere. Client-to-service: HTTPS. Service-to-Redis: TLS. Redis-to-disk: encrypted. Secrets never travel in plaintext.

No long-lived secrets in environment variables. The service authenticates to AWS using short-lived session tokens (6 hours) that are automatically rotated. The session token itself is cached and refreshed through the dogpile algorithm. Turtles all the way down.

Middleware enforcement. Auth is middleware, not per-route. There's no @skip_auth decorator, no "I'll add auth later" TODO. If a request reaches a route handler, it's already authenticated. Period.

The Results

After nearly two years of production operation:

Request throughput metrics - 128M requests, 800 r/s peak, 0 errors

The number that matters most: zero errors.

Not "five nines." Not "an occasional timeout we retry away." Zero. None. When every cache miss is handled by exactly one fetch, and every token refresh is deduplicated before it happens, the failure modes simply vanish.

Every request in the live logs returns 200 OK, distributed evenly across all four workers, handling traffic from dozens of different services simultaneously.

Load Testing: Trust, But Verify

I didn't just deploy and hope. Before going to production, I ran a custom load test:

# 50 concurrent threads × 500 requests per round × 36 rounds = 3 hours
with ThreadPoolExecutor(max_workers=50) as pool:
    for round in range(36):
        futures = [pool.submit(authenticated_request) for _ in range(500)]
        results = [f.result() for f in futures]
        assert all(r.status_code == 200 for r in results)
        sleep(300)  # 5 min cooldown between rounds

Every. Single. Request. Returned. 200.

The test specifically targets the scenarios that kill naive caching:

Cold start - empty cache, all 500 requests hit simultaneously
TTL expiration - cache expires mid-round, triggering dogpile protection
Sustained load - 3 hours of continuous traffic to catch memory leaks and connection exhaustion

When Should You Use This Pattern?

Good fit:

Many consumers hitting a single rate-limited origin (cloud APIs, databases, external services)
Read-heavy data that doesn't change every request
Distributed setup (multiple server instances behind a load balancer)

Probably overkill:

Single server process (use threading.Lock or asyncio.Lock instead)
Origin has no rate limits and sub-10ms latency
Data changes on every request

The Short Version

If I had to distill this down: it's a coordination problem, not a caching problem. SET NX EX gives you coordination in a single Redis command. The rest is just making sure you don't accidentally recreate the same problem in your retry paths - which I did, twice, before I stopped being clever about it.

The service has been running for nearly two years. I haven't thought about it much since, which is the best thing I can say about any piece of infrastructure.

Ketan Maurya - Backend Engineer specialising in FastAPI, AWS, and distributed systems.

GitHub · LinkedIn · ketanmaurya.com

If you've solved this differently - or hit a version of the thundering herd I haven't described - I'm curious. Reach out.