How to design, implement, and operate caches that make systems fast, cheap, and reliable.
Caching is one of the most leverage-rich tools in systems engineering: it can cut tail latency by orders of magnitude, reduce database load, and shrink cloud bills. But poorly designed caches cause subtle correctness bugs, “thundering herd” outages, and stale data nightmares.
This guide goes deep on what to cache, where to cache, how to keep it fresh, and how to measure it, with diagrams and runnable code.
1) Where Caches Live (Multi-Level Caching) #
flowchart LR
A["Client\n(Browser/App)"] --> B["CDN/Edge Cache\n(L0)"]
B --> C["Reverse Proxy Cache\n(e.g., NGINX)"]
C --> D["App Server L1\n(in-process cache)"]
D --> E["Distributed Cache L2\n(e.g., Redis/Memcached)"]
E --> F["Primary Datastore\n(SQL/NoSQL/Search)"]
- L0 Edge/CDN: Static assets, static JSON, API GET with cache headers.
- Reverse proxy: Response caching, compression, ETag handling.
- L1 in-process: Nanosecond access; great for per-instance hot keys. Loses data on restart.
- L2 distributed: Cross-instance shared cache; survives app restarts.
- DB: Source of truth.
Rule of thumb: Promote the most frequently accessed & stable data up the pyramid.
2) Core Caching Patterns #
2.1 Cache-Aside (Lazy Loading)
The application fetches from cache first; on miss, read from DB and populate the cache.
sequenceDiagram
participant App
participant Cache
participant DB
App->>Cache: GET key
alt Hit
Cache-->>App: value
else Miss
App->>DB: SELECT ...
DB-->>App: value
App->>Cache: SET key=value TTL
Cache-->>App: OK
end
- Pros: Simple, minimal coupling to cache.
- Cons: First request pays DB cost; risk of stampede on popular keys.
2.2 Read-Through
The cache itself knows how to load from DB on misses (via loader/adapter).
sequenceDiagram
participant App
participant Cache
participant DB
App->>Cache: GET key
alt Hit
Cache-->>App: value
else Miss
Cache->>DB: Load data
DB-->>Cache: value
Cache->>Cache: Store value
Cache-->>App: value
end
- Pros: Centralized logic; fewer stampedes.
- Cons: Cache becomes smarter (more moving parts).
2.3 Write-Through
Writes go to cache and DB synchronously.
sequenceDiagram
participant App
participant Cache
participant DB
App->>Cache: SET key=value
Cache->>DB: Write data
DB-->>Cache: OK
Cache-->>App: OK
- Pros: Reads always hot; strong consistency (cache == DB).
- Cons: Write latency includes cache + DB; higher write amplification.
2.4 Write-Behind (Write-Back)
Write to cache first; asynchronously flush to DB.
sequenceDiagram
participant App
participant Cache
participant DB
App->>Cache: SET key=value
Cache-->>App: OK
Note over Cache, DB: Async operation
Cache->>DB: Write data
- Pros: Low write latency; coalesces writes.
- Cons: Risk of data loss on crash; needs durable queue/commit log.
2.5 Refresh-Ahead (Prefetch)
Proactively refresh items nearing expiration to avoid misses.
sequenceDiagram
participant App
participant Cache
participant DB
Note over Cache: Item nearing expiration
Cache->>DB: Prefetch data
DB-->>Cache: Updated value
App->>Cache: GET key
Cache-->>App: Fresh value
3) Invalidation: The Two Hard Things #
There are only two hard things in Computer Science: cache invalidation and naming things.
3.1 Cache Invalidation Triggers
- Time-based: TTL/expirations (+ jitter to avoid synchronized expiry).
- Event-based: On data change (CDC, outbox pattern, domain events).
- Versioned keys: Include a version/etag in key (user:42:v17).
- Negative caching: Cache “not found” briefly to stop repeated misses.
3.2 Stampede & Herding
When a hot key expires, thousands of requests hit the DB.
A cache stampede (also called a dogpile effect) happens when a popular cache entry expires (or is missing), and many clients request it at the same time.
Since the cache doesn’t have the value, all requests hit the backend/database simultaneously.
This sudden surge can overwhelm the database or service, causing high latency or even outages.
Example:
A “hot” key like /trending or /hot_deals expires at 12:00.
Thousands of users request it at 12:01.
All those requests miss the cache and hit the DB together → spike in load → possible crash.
Common mitigations:
Singleflight / Mutex locks: Only one request regenerates the value, others wait. Staggered expiration (jitter): Avoid all keys expiring at the same moment. Background refresh / cache warming: Refresh before expiry.
In short: a cache stampede is when cache misses turn into a “thundering herd” against your backend.
sequenceDiagram
participant Clients
participant App
participant Cache
participant DB
Clients->>App: GET /hot
App->>Cache: GET hot_key (MISS)
par Without protection
App->>DB: SELECT hot
Note right of App: Many apps repeat this query
and With singleflight/mutex
App->>App: Acquire lock hot_key
App->>DB: SELECT hot (one query)
DB-->>App: value
App->>Cache: SET hot_key=value TTL
App->>App: Release lock
end
Mitigations: request coalescing (singleflight), soft TTL + background refresh, jittered TTLs, per-key mutex, probabilistic early refresh.
4) Keys, Values, and Eviction #
- Keys: Namespace + version + identifiers, e.g., prod:user:42:v17.
- Values: Compact, compressible, and serialization-friendly (JSON/MsgPack).
- Eviction: LRU, LFU, ARC/2Q. Tune per workload. Enforce max memory.
Eviction in caching refers to the process of removing items from the cache when it becomes full or when certain policies dictate that data should no longer be stored. Since cache storage is limited, eviction ensures space is freed up for newer or more relevant data.
flowchart TD
A[Cache Eviction Policies] --> B[LRU<br>Least Recently Used]
A --> C[LFU<br>Least Frequently Used]
A --> D[FIFO<br>First In First Out]
A --> E[TTL<br>Time To Live]
A --> F[Random<br>Selection]
If you don’t measure item size and reference frequency, eviction is guesswork.
5) Consistency & Freshness #
- Strong (write-through, short TTL): Cache ≈ DB all the time.
- Eventual (cache-aside): Acceptable staleness window.
- Stale-While-Revalidate: Serve stale if background refresh is running.
- Per-route freshness: Mission-critical APIs use tighter TTL/events; analytics tolerate larger staleness.
flowchart TD
A[Incoming Read] --> B{Cache hit & fresh?}
B -- Yes --> C[Serve cache]
B -- No --> D{Stale allowed?}
D -- Yes --> E[Serve stale & refresh async]
D -- No --> F[Read DB -> update cache -> serve fresh]
6) Security & Multi-Tenancy #
- Never cache secrets/PII without isolation (per-tenant namespaces, encryption).
- Vary-by-auth: Include user/role in key if output differs by identity.
- Authorization leaks happen when a shared cache serves personalized content globally.
flowchart LR
A[Request] --> B{Auth Header?}
B -- Yes --> C[Include in Cache Key]
B -- No --> D[Use Generic Key]
C --> E[Cache Lookup]
D --> E
7) Observability: What to Measure #
- Hit ratio (overall + per-endpoint).
- Latency p50/p90/p99 (cache vs DB).
- Evictions and OOM events.
- Keyspace misses (cache penetration by non-existent keys).
- Queue depth for write-behind/refresh jobs.
flowchart LR
A[Requests] --> B["Metrics: hit/miss"]
A --> C["Tracing: spans (cache, DB)"]
A --> D["Logging: key, action, ms, size"]
B --> E["Dashboards/Alerts"]
C --> E
D --> E
8) Python + Redis: Cache-Aside with Stampede Protection #
8.1 Setup
Run Redis locally: docker run -p 6379:6379 redis:7
Install deps: pip install redis tenacity
8.2 Code (Cache-Aside + Per-Key Mutex + TTL Jitter)
import os, json, random, time
from contextlib import contextmanager
from typing import Callable, Optional
import redis
from tenacity import retry, stop_after_attempt, wait_exponential
REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/0")
r = redis.Redis.from_url(REDIS_URL, decode_responses=True)
def get_with_cache(key: str,
loader: Callable[[], dict],
base_ttl_sec: int = 300,
jitter_pct: float = 0.1,
lock_ttl_sec: int = 10) -> dict:
"""Cache-aside with per-key mutex and TTL jitter."""
# 1) Try cache
val = r.get(key)
if val is not None:
return json.loads(val)
# 2) Cache miss → acquire per-key lock
lock_key = f"lock:{key}"
with acquire_lock(lock_key, ttl=lock_ttl_sec):
# Double-check after lock (another worker may have populated)
val = r.get(key)
if val is not None:
return json.loads(val)
# 3) Load from source of truth (DB/service)
data = loader()
# 4) Negative caching: if not found, cache placeholder briefly
if data is None:
r.setex(key, 30, json.dumps({"_not_found": True}))
return {"_not_found": True}
# 5) Set with jittered TTL to avoid synchronized expiry
ttl = int(base_ttl_sec * (1 + random.uniform(-jitter_pct, jitter_pct)))
r.setex(key, ttl, json.dumps(data))
return data
@contextmanager
def acquire_lock(lock_key: str, ttl: int = 10):
"""Non-blocking per-key lock with auto-expiry."""
token = str(time.time())
ok = r.set(lock_key, token, nx=True, ex=ttl)
if not ok:
# Busy-wait backoff (tiny) to reduce storm; in production use queue or singleflight lib
for _ in range(20):
time.sleep(0.05)
ok = r.set(lock_key, token, nx=True, ex=ttl)
if ok:
break
try:
yield
finally:
# Best-effort unlock
try:
# Delete only if we still own it (avoid deleting someone else's renewed lock)
if r.get(lock_key) == token:
r.delete(lock_key)
except Exception:
pass
# --- Example usage ---
FAKE_DB = {"user:42": {"id": 42, "name": "Ada Lovelace", "tier": "gold"}}
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=0.05, max=0.5))
def load_user_from_db(user_key: str) -> Optional[dict]:
# Simulate DB latency
time.sleep(0.2)
return FAKE_DB.get(user_key)
if __name__ == "__main__":
key = "user:42"
data = get_with_cache(f"prod:{key}:v1", lambda: load_user_from_db(key))
print("Result:", data)
What this demonstrates
- Cache-aside: only hits DB on misses.
- Per-key mutex: avoids dogpiles.
- TTL jitter: randomized expiry prevents synchronized stampedes.
- Negative caching: reduces penetration for missing keys.
9) HTTP Caching (CDN & Reverse Proxy) #
Leverage the platform before writing code.
9.1 Cache-Control Playbook
- Cache-Control: Public, max-age=300, stale-while-revalidate=60
- ETag + If-None-Match for validation.
- Vary on headers that change response (e.g., Authorization, Accept-Encoding).
sequenceDiagram
participant Client
participant CDN
participant Origin
Client->>CDN: GET /api/report
alt CDN Hit Fresh
CDN-->>Client: 200 (cached)
else CDN Stale but SWR
CDN-->>Client: 200 (stale)
CDN->>Origin: Revalidate/Fetch
Origin-->>CDN: 200/304
else Miss
CDN->>Origin: GET
Origin-->>CDN: 200 + Cache-Control
CDN-->>Client: 200
end
10) Partitioning, Hot Keys, and Scale-Out #
10.1 Partitioning (Sharding)
Why: A single cache node has limits (memory, CPU, network). As data grows, you split it across multiple nodes.
How:
- Key-based hashing: The cache key is hashed and mapped to a specific node.
- Consistent hashing: Minimizes re-shuffling when nodes are added/removed.
- Challenge: Monitoring balance — if keys aren’t evenly distributed, some nodes may get overloaded.
10.2 Hot Keys
What: A “hot key” is a cache key that’s accessed way more frequently than others (e.g., homepage_feed, top_trending).
Problem: Even with partitioning, the node holding that key gets disproportionate load, becoming a bottleneck.
Mitigations:
- Key replication: Store the hot key on multiple nodes. Clients randomly read from replicas.
- Local caching (L1): App servers cache hot keys in-process to reduce pressure on distributed cache.
- Request coalescing / singleflight: Prevent multiple clients from regenerating the same hot key simultaneously.
10.3 Scale-Out
- Vertical scale: Add more memory/CPU to a single cache node. Works only up to a point.
- Horizontal scale: Add more cache nodes and distribute data (via partitioning).
- Dynamic scaling: In cloud systems (Redis Cluster, Memcached with consistent hashing), nodes can be added/removed while keeping service online.
Key Takeaway
-
Partitioning ensures load distribution.
-
Hot key management prevents single-node overload.
-
Scale-out is how caches evolve from single-node setups to large distributed clusters.
-
Consistent hashing distributes keys evenly across cache shards and minimizes remapping on reshard.
-
Hot keys (heavy read traffic) → replicate the value across multiple shards, or local L1 caching.
-
Large objects → compress; consider object segmentation (chunking) if near size limits.
flowchart LR
RING((Consistent Hash Ring))
A[keyA] --> RING
B[keyB] --> RING
C[keyC] --> RING
RING --> S1[Shard 1]
RING --> S2[Shard 2]
RING --> S3[Shard 3]
11) Refresh-Ahead & Background Jobs #
- Soft TTL: Serve stale if within grace window; kick off background refresh.
- Batch refreshers: Periodically refresh top-N hot keys from logs/metrics.
- Change Data Capture (CDC): Invalidate keys on DB row changes through an event stream.
flowchart LR
DB[(Database)] --> CDC[Change Data Capture]
CDC --> MQ[Message Queue]
MQ --> Worker[Cache Worker]
Worker --> Cache[(Cache)]
Cache --> App[Application]
12) Anti-Patterns to Avoid #
- Caching personalized responses without including identity in the key.
- Global TTLs everywhere (ignore data semantics).
- Ignoring object size and eviction policy interactions.
- Not measuring miss cost → a cache that hides load most of the time but melts the DB on expiry.
flowchart TD
A[Cache Anti-Patterns] --> B["Cache Invalidation Bugs"]
A --> C["Thundering Herd Problems"]
A --> D["Cache Pollution"]
A --> E["Security Issues"]
A --> F["Consistency Problems"]
13) A Minimal Checklist #
- Clear key schema with namespaces & versions.
- Choose pattern per route (aside, read-through, write-through).
- Implement stampede control (mutex/singleflight + TTL jitter).
- Separate L1 (in-proc) and L2 (Redis) with different TTLs.
- Add negative caching for 404/empty results (short TTL).
- Wire metrics: hit ratio, p99 latency, evictions, keyspace misses.
- Define SLA by endpoint (freshness vs performance).
- Test failure modes: cache down, partial shard loss, mass expiry.
14) Putting It Into Practice (Learning Path) #
- Start with cache-aside on a read-heavy endpoint; add TTL jitter.
- Add per-key locks to remove thundering herds.
- Introduce L1 in-proc cache for super-hot items; measure p99.
- Add event-based invalidation from your DB’s outbox/CDC.
- Push static endpoints to CDN with stale-while-revalidate.
- Build dashboards: hit ratio by keyspace and miss penalty.
timeline
title Caching Implementation Timeline
section Phase 1
Basic Cache-Aside : Implement cache-aside pattern
Add TTL Jitter : Prevent synchronized expiration
section Phase 2
Stampede Protection : Add per-key mutex locks
Negative Caching : Cache missing entities
section Phase 3
Multi-Level Cache : Add L1 in-process cache
Event Invalidation : Implement CDC-based invalidation
section Phase 4
Advanced Patterns : Write-behind/refresh-ahead
Observability : Comprehensive metrics & tracing
15) Appendix: Tiny FastAPI Wrapper (Optional) #
If you prefer HTTP endpoints to try locally:
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
app = FastAPI()
class User(BaseModel):
id: int
name: str
tier: str
# reuse get_with_cache(...) from the previous snippet here
@app.get("/user/{uid}")
def get_user(uid: int):
key = f"prod:user:{uid}:v1"
data = get_with_cache(key, lambda: load_user_from_db(f"user:{uid}"))
return data
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
Final Thoughts #
Caching is not a single feature; it’s an architecture. When you line up placement (L0–L2), pattern (aside/read-through/etc.), invalidation, and observability, you get systems that feel instant and never fall over when they’re popular.
Fast is a feature. Design your caches like you mean it.