How to design, implement, and operate caches that make systems fast, cheap, and reliable.

Caching is one of the most leverage-rich tools in systems engineering: it can cut tail latency by orders of magnitude, reduce database load, and shrink cloud bills. But poorly designed caches cause subtle correctness bugs, “thundering herd” outages, and stale data nightmares.

This guide goes deep on what to cache, where to cache, how to keep it fresh, and how to measure it, with diagrams and runnable code.

1) Where Caches Live (Multi-Level Caching)
#

flowchart LR
    A["Client\n(Browser/App)"] --> B["CDN/Edge Cache\n(L0)"]
    B --> C["Reverse Proxy Cache\n(e.g., NGINX)"]
    C --> D["App Server L1\n(in-process cache)"]
    D --> E["Distributed Cache L2\n(e.g., Redis/Memcached)"]
    E --> F["Primary Datastore\n(SQL/NoSQL/Search)"]

L0 Edge/CDN: Static assets, static JSON, API GET with cache headers.
Reverse proxy: Response caching, compression, ETag handling.
L1 in-process: Nanosecond access; great for per-instance hot keys. Loses data on restart.
L2 distributed: Cross-instance shared cache; survives app restarts.
DB: Source of truth.

Rule of thumb: Promote the most frequently accessed & stable data up the pyramid.

2) Core Caching Patterns
#

2.1 Cache-Aside (Lazy Loading)

The application fetches from cache first; on miss, read from DB and populate the cache.

sequenceDiagram
    participant App
    participant Cache
    participant DB
    App->>Cache: GET key
    alt Hit
      Cache-->>App: value
    else Miss
      App->>DB: SELECT ...
      DB-->>App: value
      App->>Cache: SET key=value TTL
      Cache-->>App: OK
    end

Pros: Simple, minimal coupling to cache.
Cons: First request pays DB cost; risk of stampede on popular keys.

2.2 Read-Through

The cache itself knows how to load from DB on misses (via loader/adapter).

sequenceDiagram
    participant App
    participant Cache
    participant DB
    App->>Cache: GET key
    alt Hit
        Cache-->>App: value
    else Miss
        Cache->>DB: Load data
        DB-->>Cache: value
        Cache->>Cache: Store value
        Cache-->>App: value
    end

Pros: Centralized logic; fewer stampedes.
Cons: Cache becomes smarter (more moving parts).

2.3 Write-Through

Writes go to cache and DB synchronously.

sequenceDiagram
    participant App
    participant Cache
    participant DB
    App->>Cache: SET key=value
    Cache->>DB: Write data
    DB-->>Cache: OK
    Cache-->>App: OK

Pros: Reads always hot; strong consistency (cache == DB).
Cons: Write latency includes cache + DB; higher write amplification.

2.4 Write-Behind (Write-Back)

Write to cache first; asynchronously flush to DB.

sequenceDiagram
    participant App
    participant Cache
    participant DB
    App->>Cache: SET key=value
    Cache-->>App: OK
    Note over Cache, DB: Async operation
    Cache->>DB: Write data

Pros: Low write latency; coalesces writes.
Cons: Risk of data loss on crash; needs durable queue/commit log.

2.5 Refresh-Ahead (Prefetch)

Proactively refresh items nearing expiration to avoid misses.

sequenceDiagram
    participant App
    participant Cache
    participant DB
    Note over Cache: Item nearing expiration
    Cache->>DB: Prefetch data
    DB-->>Cache: Updated value
    App->>Cache: GET key
    Cache-->>App: Fresh value

3) Invalidation: The Two Hard Things
#

There are only two hard things in Computer Science: cache invalidation and naming things.

3.1 Cache Invalidation Triggers

Time-based: TTL/expirations (+ jitter to avoid synchronized expiry).
Event-based: On data change (CDC, outbox pattern, domain events).
Versioned keys: Include a version/etag in key (user:42:v17).
Negative caching: Cache “not found” briefly to stop repeated misses.

3.2 Stampede & Herding

When a hot key expires, thousands of requests hit the DB.

A cache stampede (also called a dogpile effect) happens when a popular cache entry expires (or is missing), and many clients request it at the same time.

Since the cache doesn’t have the value, all requests hit the backend/database simultaneously.

This sudden surge can overwhelm the database or service, causing high latency or even outages.

Example:

A “hot” key like /trending or /hot_deals expires at 12:00.

Thousands of users request it at 12:01.

All those requests miss the cache and hit the DB together → spike in load → possible crash.

Common mitigations:

Singleflight / Mutex locks: Only one request regenerates the value, others wait. Staggered expiration (jitter): Avoid all keys expiring at the same moment. Background refresh / cache warming: Refresh before expiry.

In short: a cache stampede is when cache misses turn into a “thundering herd” against your backend.

sequenceDiagram
    participant Clients
    participant App
    participant Cache
    participant DB

    Clients->>App: GET /hot
    App->>Cache: GET hot_key (MISS)

    par Without protection
        App->>DB: SELECT hot
        Note right of App: Many apps repeat this query
    and With singleflight/mutex
        App->>App: Acquire lock hot_key
        App->>DB: SELECT hot (one query)
        DB-->>App: value
        App->>Cache: SET hot_key=value TTL
        App->>App: Release lock
    end

Mitigations: request coalescing (singleflight), soft TTL + background refresh, jittered TTLs, per-key mutex, probabilistic early refresh.

4) Keys, Values, and Eviction
#

Keys: Namespace + version + identifiers, e.g., prod:user:42:v17.
Values: Compact, compressible, and serialization-friendly (JSON/MsgPack).
Eviction: LRU, LFU, ARC/2Q. Tune per workload. Enforce max memory.

Eviction in caching refers to the process of removing items from the cache when it becomes full or when certain policies dictate that data should no longer be stored. Since cache storage is limited, eviction ensures space is freed up for newer or more relevant data.

flowchart TD
    A[Cache Eviction Policies] --> B[LRU<br>Least Recently Used]
    A --> C[LFU<br>Least Frequently Used]
    A --> D[FIFO<br>First In First Out]
    A --> E[TTL<br>Time To Live]
    A --> F[Random<br>Selection]

If you don’t measure item size and reference frequency, eviction is guesswork.

5) Consistency & Freshness
#

Strong (write-through, short TTL): Cache ≈ DB all the time.
Eventual (cache-aside): Acceptable staleness window.
Stale-While-Revalidate: Serve stale if background refresh is running.
Per-route freshness: Mission-critical APIs use tighter TTL/events; analytics tolerate larger staleness.

flowchart TD
    A[Incoming Read] --> B{Cache hit & fresh?}
    B -- Yes --> C[Serve cache]
    B -- No --> D{Stale allowed?}
    D -- Yes --> E[Serve stale & refresh async]
    D -- No --> F[Read DB -> update cache -> serve fresh]

6) Security & Multi-Tenancy
#

Never cache secrets/PII without isolation (per-tenant namespaces, encryption).
Vary-by-auth: Include user/role in key if output differs by identity.
Authorization leaks happen when a shared cache serves personalized content globally.

flowchart LR
    A[Request] --> B{Auth Header?}
    B -- Yes --> C[Include in Cache Key]
    B -- No --> D[Use Generic Key]
    C --> E[Cache Lookup]
    D --> E

7) Observability: What to Measure
#

Hit ratio (overall + per-endpoint).
Latency p50/p90/p99 (cache vs DB).
Evictions and OOM events.
Keyspace misses (cache penetration by non-existent keys).
Queue depth for write-behind/refresh jobs.

flowchart LR
    A[Requests] --> B["Metrics: hit/miss"]
    A --> C["Tracing: spans (cache, DB)"]
    A --> D["Logging: key, action, ms, size"]
    B --> E["Dashboards/Alerts"]
    C --> E
    D --> E

8) Python + Redis: Cache-Aside with Stampede Protection
#

8.1 Setup

Run Redis locally: docker run -p 6379:6379 redis:7

Install deps: pip install redis tenacity

8.2 Code (Cache-Aside + Per-Key Mutex + TTL Jitter)

import os, json, random, time
from contextlib import contextmanager
from typing import Callable, Optional
import redis
from tenacity import retry, stop_after_attempt, wait_exponential

REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/0")
r = redis.Redis.from_url(REDIS_URL, decode_responses=True)

def get_with_cache(key: str,
                   loader: Callable[[], dict],
                   base_ttl_sec: int = 300,
                   jitter_pct: float = 0.1,
                   lock_ttl_sec: int = 10) -> dict:
    """Cache-aside with per-key mutex and TTL jitter."""
    # 1) Try cache
    val = r.get(key)
    if val is not None:
        return json.loads(val)

    # 2) Cache miss → acquire per-key lock
    lock_key = f"lock:{key}"
    with acquire_lock(lock_key, ttl=lock_ttl_sec):
        # Double-check after lock (another worker may have populated)
        val = r.get(key)
        if val is not None:
            return json.loads(val)

        # 3) Load from source of truth (DB/service)
        data = loader()

        # 4) Negative caching: if not found, cache placeholder briefly
        if data is None:
            r.setex(key, 30, json.dumps({"_not_found": True}))
            return {"_not_found": True}

        # 5) Set with jittered TTL to avoid synchronized expiry
        ttl = int(base_ttl_sec * (1 + random.uniform(-jitter_pct, jitter_pct)))
        r.setex(key, ttl, json.dumps(data))
        return data

@contextmanager
def acquire_lock(lock_key: str, ttl: int = 10):
    """Non-blocking per-key lock with auto-expiry."""
    token = str(time.time())
    ok = r.set(lock_key, token, nx=True, ex=ttl)
    if not ok:
        # Busy-wait backoff (tiny) to reduce storm; in production use queue or singleflight lib
        for _ in range(20):
            time.sleep(0.05)
            ok = r.set(lock_key, token, nx=True, ex=ttl)
            if ok:
                break
    try:
        yield
    finally:
        # Best-effort unlock
        try:
            # Delete only if we still own it (avoid deleting someone else's renewed lock)
            if r.get(lock_key) == token:
                r.delete(lock_key)
        except Exception:
            pass

# --- Example usage ---
FAKE_DB = {"user:42": {"id": 42, "name": "Ada Lovelace", "tier": "gold"}}

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=0.05, max=0.5))
def load_user_from_db(user_key: str) -> Optional[dict]:
    # Simulate DB latency
    time.sleep(0.2)
    return FAKE_DB.get(user_key)

if __name__ == "__main__":
    key = "user:42"
    data = get_with_cache(f"prod:{key}:v1", lambda: load_user_from_db(key))
    print("Result:", data)

What this demonstrates

Cache-aside: only hits DB on misses.
Per-key mutex: avoids dogpiles.
TTL jitter: randomized expiry prevents synchronized stampedes.
Negative caching: reduces penetration for missing keys.

9) HTTP Caching (CDN & Reverse Proxy)
#

Leverage the platform before writing code.

9.1 Cache-Control Playbook

Cache-Control: Public, max-age=300, stale-while-revalidate=60
ETag + If-None-Match for validation.
Vary on headers that change response (e.g., Authorization, Accept-Encoding).

sequenceDiagram
    participant Client
    participant CDN
    participant Origin
    Client->>CDN: GET /api/report
    alt CDN Hit Fresh
        CDN-->>Client: 200 (cached)
    else CDN Stale but SWR
        CDN-->>Client: 200 (stale)
        CDN->>Origin: Revalidate/Fetch
        Origin-->>CDN: 200/304
    else Miss
        CDN->>Origin: GET
        Origin-->>CDN: 200 + Cache-Control
        CDN-->>Client: 200
    end

10) Partitioning, Hot Keys, and Scale-Out
#

10.1 Partitioning (Sharding)

Why: A single cache node has limits (memory, CPU, network). As data grows, you split it across multiple nodes.

How:

Key-based hashing: The cache key is hashed and mapped to a specific node.
Consistent hashing: Minimizes re-shuffling when nodes are added/removed.
Challenge: Monitoring balance — if keys aren’t evenly distributed, some nodes may get overloaded.

10.2 Hot Keys

What: A “hot key” is a cache key that’s accessed way more frequently than others (e.g., homepage_feed, top_trending).

Problem: Even with partitioning, the node holding that key gets disproportionate load, becoming a bottleneck.

Mitigations:

Key replication: Store the hot key on multiple nodes. Clients randomly read from replicas.
Local caching (L1): App servers cache hot keys in-process to reduce pressure on distributed cache.
Request coalescing / singleflight: Prevent multiple clients from regenerating the same hot key simultaneously.

10.3 Scale-Out

Vertical scale: Add more memory/CPU to a single cache node. Works only up to a point.
Horizontal scale: Add more cache nodes and distribute data (via partitioning).
Dynamic scaling: In cloud systems (Redis Cluster, Memcached with consistent hashing), nodes can be added/removed while keeping service online.

Key Takeaway

Partitioning ensures load distribution.
Hot key management prevents single-node overload.
Scale-out is how caches evolve from single-node setups to large distributed clusters.
Consistent hashing distributes keys evenly across cache shards and minimizes remapping on reshard.
Hot keys (heavy read traffic) → replicate the value across multiple shards, or local L1 caching.
Large objects → compress; consider object segmentation (chunking) if near size limits.

flowchart LR
    RING((Consistent Hash Ring))
    A[keyA] --> RING
    B[keyB] --> RING
    C[keyC] --> RING
    RING --> S1[Shard 1]
    RING --> S2[Shard 2]
    RING --> S3[Shard 3]

11) Refresh-Ahead & Background Jobs
#

Soft TTL: Serve stale if within grace window; kick off background refresh.
Batch refreshers: Periodically refresh top-N hot keys from logs/metrics.
Change Data Capture (CDC): Invalidate keys on DB row changes through an event stream.

flowchart LR
    DB[(Database)] --> CDC[Change Data Capture]
    CDC --> MQ[Message Queue]
    MQ --> Worker[Cache Worker]
    Worker --> Cache[(Cache)]
    Cache --> App[Application]

12) Anti-Patterns to Avoid
#

Caching personalized responses without including identity in the key.
Global TTLs everywhere (ignore data semantics).
Ignoring object size and eviction policy interactions.
Not measuring miss cost → a cache that hides load most of the time but melts the DB on expiry.

flowchart TD
    A[Cache Anti-Patterns] --> B["Cache Invalidation Bugs"]
    A --> C["Thundering Herd Problems"]
    A --> D["Cache Pollution"]
    A --> E["Security Issues"]
    A --> F["Consistency Problems"]

13) A Minimal Checklist
#

Clear key schema with namespaces & versions.
Choose pattern per route (aside, read-through, write-through).
Implement stampede control (mutex/singleflight + TTL jitter).
Separate L1 (in-proc) and L2 (Redis) with different TTLs.
Add negative caching for 404/empty results (short TTL).
Wire metrics: hit ratio, p99 latency, evictions, keyspace misses.
Define SLA by endpoint (freshness vs performance).
Test failure modes: cache down, partial shard loss, mass expiry.

14) Putting It Into Practice (Learning Path)
#

Start with cache-aside on a read-heavy endpoint; add TTL jitter.
Add per-key locks to remove thundering herds.
Introduce L1 in-proc cache for super-hot items; measure p99.
Add event-based invalidation from your DB’s outbox/CDC.
Push static endpoints to CDN with stale-while-revalidate.
Build dashboards: hit ratio by keyspace and miss penalty.

timeline
    title Caching Implementation Timeline
    section Phase 1
        Basic Cache-Aside : Implement cache-aside pattern
        Add TTL Jitter    : Prevent synchronized expiration
    section Phase 2
        Stampede Protection : Add per-key mutex locks
        Negative Caching    : Cache missing entities
    section Phase 3
        Multi-Level Cache  : Add L1 in-process cache
        Event Invalidation : Implement CDC-based invalidation
    section Phase 4
        Advanced Patterns  : Write-behind/refresh-ahead
        Observability      : Comprehensive metrics & tracing

15) Appendix: Tiny FastAPI Wrapper (Optional)
#

If you prefer HTTP endpoints to try locally:

from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn

app = FastAPI()

class User(BaseModel):
    id: int
    name: str
    tier: str

# reuse get_with_cache(...) from the previous snippet here

@app.get("/user/{uid}")
def get_user(uid: int):
    key = f"prod:user:{uid}:v1"
    data = get_with_cache(key, lambda: load_user_from_db(f"user:{uid}"))
    return data

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Final Thoughts
#

Caching is not a single feature; it’s an architecture. When you line up placement (L0–L2), pattern (aside/read-through/etc.), invalidation, and observability, you get systems that feel instant and never fall over when they’re popular.

Fast is a feature. Design your caches like you mean it.

1) Where Caches Live (Multi-Level Caching) #

2) Core Caching Patterns #

3) Invalidation: The Two Hard Things #

4) Keys, Values, and Eviction #

5) Consistency & Freshness #

6) Security & Multi-Tenancy #

7) Observability: What to Measure #

8) Python + Redis: Cache-Aside with Stampede Protection #

9) HTTP Caching (CDN & Reverse Proxy) #

10) Partitioning, Hot Keys, and Scale-Out #

11) Refresh-Ahead & Background Jobs #

12) Anti-Patterns to Avoid #

13) A Minimal Checklist #

14) Putting It Into Practice (Learning Path) #

15) Appendix: Tiny FastAPI Wrapper (Optional) #

Final Thoughts #