How to design, implement, and operate caches that make systems fast, cheap, and reliable.
Caching is one of the most leverage-rich tools in systems engineering: it can cut tail latency by orders of magnitude, reduce database load, and shrink cloud bills. But poorly designed caches cause subtle correctness bugs, “thundering herd” outages, and stale data nightmares.
This guide goes deep on what to cache, where to cache, how to keep it fresh, and how to measure it, with diagrams and runnable code.
1) Where Caches Live (Multi-Level Caching) #
flowchart LR A["Client\n(Browser/App)"] --> B["CDN/Edge Cache\n(L0)"] B --> C["Reverse Proxy Cache\n(e.g., NGINX)"] C --> D["App Server L1\n(in-process cache)"] D --> E["Distributed Cache L2\n(e.g., Redis/Memcached)"] E --> F["Primary Datastore\n(SQL/NoSQL/Search)"]
- L0 Edge/CDN: Static assets, static JSON, API GET with cache headers.
- Reverse proxy: Response caching, compression, ETag handling.
- L1 in-process: Nanosecond access; great for per-instance hot keys. Loses data on restart.
- L2 distributed: Cross-instance shared cache; survives app restarts.
- DB: Source of truth.
Rule of thumb: Promote the most frequently accessed & stable data up the pyramid.
2) Core Caching Patterns #
2.1 Cache-Aside (Lazy Loading)
The application fetches from cache first; on miss, read from DB and populate the cache.
sequenceDiagram participant App participant Cache participant DB App->>Cache: GET key alt Hit Cache-->>App: value else Miss App->>DB: SELECT ... DB-->>App: value App->>Cache: SET key=value TTL Cache-->>App: OK end
- Pros: Simple, minimal coupling to cache.
- Cons: First request pays DB cost; risk of stampede on popular keys.
2.2 Read-Through
The cache itself knows how to load from DB on misses (via loader/adapter).
sequenceDiagram participant App participant Cache participant DB App->>Cache: GET key alt Hit Cache-->>App: value else Miss Cache->>DB: Load data DB-->>Cache: value Cache->>Cache: Store value Cache-->>App: value end
- Pros: Centralized logic; fewer stampedes.
- Cons: Cache becomes smarter (more moving parts).
2.3 Write-Through
Writes go to cache and DB synchronously.
sequenceDiagram participant App participant Cache participant DB App->>Cache: SET key=value Cache->>DB: Write data DB-->>Cache: OK Cache-->>App: OK
- Pros: Reads always hot; strong consistency (cache == DB).
- Cons: Write latency includes cache + DB; higher write amplification.
2.4 Write-Behind (Write-Back)
Write to cache first; asynchronously flush to DB.
sequenceDiagram participant App participant Cache participant DB App->>Cache: SET key=value Cache-->>App: OK Note over Cache, DB: Async operation Cache->>DB: Write data
- Pros: Low write latency; coalesces writes.
- Cons: Risk of data loss on crash; needs durable queue/commit log.
2.5 Refresh-Ahead (Prefetch)
Proactively refresh items nearing expiration to avoid misses.
sequenceDiagram participant App participant Cache participant DB Note over Cache: Item nearing expiration Cache->>DB: Prefetch data DB-->>Cache: Updated value App->>Cache: GET key Cache-->>App: Fresh value
3) Invalidation: The Two Hard Things #
There are only two hard things in Computer Science: cache invalidation and naming things.
3.1 Cache Invalidation Triggers
- Time-based: TTL/expirations (+ jitter to avoid synchronized expiry).
- Event-based: On data change (CDC, outbox pattern, domain events).
- Versioned keys: Include a version/etag in key (user:42:v17).
- Negative caching: Cache “not found” briefly to stop repeated misses.
3.2 Stampede & Herding
When a hot key expires, thousands of requests hit the DB.
A cache stampede (also called a dogpile effect) happens when a popular cache entry expires (or is missing), and many clients request it at the same time.
Since the cache doesn’t have the value, all requests hit the backend/database simultaneously.
This sudden surge can overwhelm the database or service, causing high latency or even outages.
Example:
A “hot” key like /trending or /hot_deals expires at 12:00.
Thousands of users request it at 12:01.
All those requests miss the cache and hit the DB together → spike in load → possible crash.
Common mitigations:
Singleflight / Mutex locks: Only one request regenerates the value, others wait. Staggered expiration (jitter): Avoid all keys expiring at the same moment. Background refresh / cache warming: Refresh before expiry.
In short: a cache stampede is when cache misses turn into a “thundering herd” against your backend.
sequenceDiagram participant Clients participant App participant Cache participant DB Clients->>App: GET /hot App->>Cache: GET hot_key (MISS) par Without protection App->>DB: SELECT hot Note right of App: Many apps repeat this query and With singleflight/mutex App->>App: Acquire lock hot_key App->>DB: SELECT hot (one query) DB-->>App: value App->>Cache: SET hot_key=value TTL App->>App: Release lock end
Mitigations: request coalescing (singleflight), soft TTL + background refresh, jittered TTLs, per-key mutex, probabilistic early refresh.
4) Keys, Values, and Eviction #
- Keys: Namespace + version + identifiers, e.g., prod:user:42:v17.
- Values: Compact, compressible, and serialization-friendly (JSON/MsgPack).
- Eviction: LRU, LFU, ARC/2Q. Tune per workload. Enforce max memory.
Eviction in caching refers to the process of removing items from the cache when it becomes full or when certain policies dictate that data should no longer be stored. Since cache storage is limited, eviction ensures space is freed up for newer or more relevant data.
flowchart TD A[Cache Eviction Policies] --> B[LRU<br>Least Recently Used] A --> C[LFU<br>Least Frequently Used] A --> D[FIFO<br>First In First Out] A --> E[TTL<br>Time To Live] A --> F[Random<br>Selection]
If you don’t measure item size and reference frequency, eviction is guesswork.
5) Consistency & Freshness #
- Strong (write-through, short TTL): Cache ≈ DB all the time.
- Eventual (cache-aside): Acceptable staleness window.
- Stale-While-Revalidate: Serve stale if background refresh is running.
- Per-route freshness: Mission-critical APIs use tighter TTL/events; analytics tolerate larger staleness.
flowchart TD A[Incoming Read] --> B{Cache hit & fresh?} B -- Yes --> C[Serve cache] B -- No --> D{Stale allowed?} D -- Yes --> E[Serve stale & refresh async] D -- No --> F[Read DB -> update cache -> serve fresh]
6) Security & Multi-Tenancy #
- Never cache secrets/PII without isolation (per-tenant namespaces, encryption).
- Vary-by-auth: Include user/role in key if output differs by identity.
- Authorization leaks happen when a shared cache serves personalized content globally.
flowchart LR A[Request] --> B{Auth Header?} B -- Yes --> C[Include in Cache Key] B -- No --> D[Use Generic Key] C --> E[Cache Lookup] D --> E
7) Observability: What to Measure #
- Hit ratio (overall + per-endpoint).
- Latency p50/p90/p99 (cache vs DB).
- Evictions and OOM events.
- Keyspace misses (cache penetration by non-existent keys).
- Queue depth for write-behind/refresh jobs.
flowchart LR A[Requests] --> B["Metrics: hit/miss"] A --> C["Tracing: spans (cache, DB)"] A --> D["Logging: key, action, ms, size"] B --> E["Dashboards/Alerts"] C --> E D --> E
8) Python + Redis: Cache-Aside with Stampede Protection #
8.1 Setup
Run Redis locally: docker run -p 6379:6379 redis:7
Install deps: pip install redis tenacity
8.2 Code (Cache-Aside + Per-Key Mutex + TTL Jitter)
import os, json, random, time
from contextlib import contextmanager
from typing import Callable, Optional
import redis
from tenacity import retry, stop_after_attempt, wait_exponential
REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/0")
r = redis.Redis.from_url(REDIS_URL, decode_responses=True)
def get_with_cache(key: str,
loader: Callable[[], dict],
base_ttl_sec: int = 300,
jitter_pct: float = 0.1,
lock_ttl_sec: int = 10) -> dict:
"""Cache-aside with per-key mutex and TTL jitter."""
# 1) Try cache
val = r.get(key)
if val is not None:
return json.loads(val)
# 2) Cache miss → acquire per-key lock
lock_key = f"lock:{key}"
with acquire_lock(lock_key, ttl=lock_ttl_sec):
# Double-check after lock (another worker may have populated)
val = r.get(key)
if val is not None:
return json.loads(val)
# 3) Load from source of truth (DB/service)
data = loader()
# 4) Negative caching: if not found, cache placeholder briefly
if data is None:
r.setex(key, 30, json.dumps({"_not_found": True}))
return {"_not_found": True}
# 5) Set with jittered TTL to avoid synchronized expiry
ttl = int(base_ttl_sec * (1 + random.uniform(-jitter_pct, jitter_pct)))
r.setex(key, ttl, json.dumps(data))
return data
@contextmanager
def acquire_lock(lock_key: str, ttl: int = 10):
"""Non-blocking per-key lock with auto-expiry."""
token = str(time.time())
ok = r.set(lock_key, token, nx=True, ex=ttl)
if not ok:
# Busy-wait backoff (tiny) to reduce storm; in production use queue or singleflight lib
for _ in range(20):
time.sleep(0.05)
ok = r.set(lock_key, token, nx=True, ex=ttl)
if ok:
break
try:
yield
finally:
# Best-effort unlock
try:
# Delete only if we still own it (avoid deleting someone else's renewed lock)
if r.get(lock_key) == token:
r.delete(lock_key)
except Exception:
pass
# --- Example usage ---
FAKE_DB = {"user:42": {"id": 42, "name": "Ada Lovelace", "tier": "gold"}}
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=0.05, max=0.5))
def load_user_from_db(user_key: str) -> Optional[dict]:
# Simulate DB latency
time.sleep(0.2)
return FAKE_DB.get(user_key)
if __name__ == "__main__":
key = "user:42"
data = get_with_cache(f"prod:{key}:v1", lambda: load_user_from_db(key))
print("Result:", data)
What this demonstrates
- Cache-aside: only hits DB on misses.
- Per-key mutex: avoids dogpiles.
- TTL jitter: randomized expiry prevents synchronized stampedes.
- Negative caching: reduces penetration for missing keys.
9) HTTP Caching (CDN & Reverse Proxy) #
Leverage the platform before writing code.
9.1 Cache-Control Playbook
- Cache-Control: Public, max-age=300, stale-while-revalidate=60
- ETag + If-None-Match for validation.
- Vary on headers that change response (e.g., Authorization, Accept-Encoding).
sequenceDiagram participant Client participant CDN participant Origin Client->>CDN: GET /api/report alt CDN Hit Fresh CDN-->>Client: 200 (cached) else CDN Stale but SWR CDN-->>Client: 200 (stale) CDN->>Origin: Revalidate/Fetch Origin-->>CDN: 200/304 else Miss CDN->>Origin: GET Origin-->>CDN: 200 + Cache-Control CDN-->>Client: 200 end
10) Partitioning, Hot Keys, and Scale-Out #
10.1 Partitioning (Sharding)
Why: A single cache node has limits (memory, CPU, network). As data grows, you split it across multiple nodes.
How:
- Key-based hashing: The cache key is hashed and mapped to a specific node.
- Consistent hashing: Minimizes re-shuffling when nodes are added/removed.
- Challenge: Monitoring balance — if keys aren’t evenly distributed, some nodes may get overloaded.
10.2 Hot Keys
What: A “hot key” is a cache key that’s accessed way more frequently than others (e.g., homepage_feed, top_trending).
Problem: Even with partitioning, the node holding that key gets disproportionate load, becoming a bottleneck.
Mitigations:
- Key replication: Store the hot key on multiple nodes. Clients randomly read from replicas.
- Local caching (L1): App servers cache hot keys in-process to reduce pressure on distributed cache.
- Request coalescing / singleflight: Prevent multiple clients from regenerating the same hot key simultaneously.
10.3 Scale-Out
- Vertical scale: Add more memory/CPU to a single cache node. Works only up to a point.
- Horizontal scale: Add more cache nodes and distribute data (via partitioning).
- Dynamic scaling: In cloud systems (Redis Cluster, Memcached with consistent hashing), nodes can be added/removed while keeping service online.
Key Takeaway
-
Partitioning ensures load distribution.
-
Hot key management prevents single-node overload.
-
Scale-out is how caches evolve from single-node setups to large distributed clusters.
-
Consistent hashing distributes keys evenly across cache shards and minimizes remapping on reshard.
-
Hot keys (heavy read traffic) → replicate the value across multiple shards, or local L1 caching.
-
Large objects → compress; consider object segmentation (chunking) if near size limits.
flowchart LR RING((Consistent Hash Ring)) A[keyA] --> RING B[keyB] --> RING C[keyC] --> RING RING --> S1[Shard 1] RING --> S2[Shard 2] RING --> S3[Shard 3]
11) Refresh-Ahead & Background Jobs #
- Soft TTL: Serve stale if within grace window; kick off background refresh.
- Batch refreshers: Periodically refresh top-N hot keys from logs/metrics.
- Change Data Capture (CDC): Invalidate keys on DB row changes through an event stream.
flowchart LR DB[(Database)] --> CDC[Change Data Capture] CDC --> MQ[Message Queue] MQ --> Worker[Cache Worker] Worker --> Cache[(Cache)] Cache --> App[Application]
12) Anti-Patterns to Avoid #
- Caching personalized responses without including identity in the key.
- Global TTLs everywhere (ignore data semantics).
- Ignoring object size and eviction policy interactions.
- Not measuring miss cost → a cache that hides load most of the time but melts the DB on expiry.
flowchart TD A[Cache Anti-Patterns] --> B["Cache Invalidation Bugs"] A --> C["Thundering Herd Problems"] A --> D["Cache Pollution"] A --> E["Security Issues"] A --> F["Consistency Problems"]
13) A Minimal Checklist #
- Clear key schema with namespaces & versions.
- Choose pattern per route (aside, read-through, write-through).
- Implement stampede control (mutex/singleflight + TTL jitter).
- Separate L1 (in-proc) and L2 (Redis) with different TTLs.
- Add negative caching for 404/empty results (short TTL).
- Wire metrics: hit ratio, p99 latency, evictions, keyspace misses.
- Define SLA by endpoint (freshness vs performance).
- Test failure modes: cache down, partial shard loss, mass expiry.
14) Putting It Into Practice (Learning Path) #
- Start with cache-aside on a read-heavy endpoint; add TTL jitter.
- Add per-key locks to remove thundering herds.
- Introduce L1 in-proc cache for super-hot items; measure p99.
- Add event-based invalidation from your DB’s outbox/CDC.
- Push static endpoints to CDN with stale-while-revalidate.
- Build dashboards: hit ratio by keyspace and miss penalty.
timeline title Caching Implementation Timeline section Phase 1 Basic Cache-Aside : Implement cache-aside pattern Add TTL Jitter : Prevent synchronized expiration section Phase 2 Stampede Protection : Add per-key mutex locks Negative Caching : Cache missing entities section Phase 3 Multi-Level Cache : Add L1 in-process cache Event Invalidation : Implement CDC-based invalidation section Phase 4 Advanced Patterns : Write-behind/refresh-ahead Observability : Comprehensive metrics & tracing
15) Appendix: Tiny FastAPI Wrapper (Optional) #
If you prefer HTTP endpoints to try locally:
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
app = FastAPI()
class User(BaseModel):
id: int
name: str
tier: str
# reuse get_with_cache(...) from the previous snippet here
@app.get("/user/{uid}")
def get_user(uid: int):
key = f"prod:user:{uid}:v1"
data = get_with_cache(key, lambda: load_user_from_db(f"user:{uid}"))
return data
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
Final Thoughts #
Caching is not a single feature; it’s an architecture. When you line up placement (L0–L2), pattern (aside/read-through/etc.), invalidation, and observability, you get systems that feel instant and never fall over when they’re popular.
Fast is a feature. Design your caches like you mean it.