Level 3 · 25 min

Troubleshooting

Redis performance issues manifest as high latency, high miss rates, or memory exhaustion. Systematic diagnosis using Redis's built-in observability tools is essential for senior engineers.

Memory and Eviction

INFO memory returns the core memory health metrics. used_memory is the allocator-reported data size; used_memory_rss is the OS-reported RSS (resident set size). mem_fragmentation_ratio = used_memory_rss / used_memory. A ratio above 1.5 is an alert threshold: Redis has allocated significantly more OS memory than its data occupies, due to malloc fragmentation from repeated allocations and frees of varying sizes. Above 2.0 is critical — consider scheduling a restart during off-peak or enabling active defragmentation. CONFIG SET activedefrag yes enables online defrag (requires jemalloc; Redis 4.0+). Tune with: active-defrag-ignore-bytes 100mb (minimum fragmentation before defrag runs), active-defrag-threshold-lower 10 (fragmentation percentage trigger), active-defrag-max-scan-fields 1000 (fields scanned per iteration). MEMORY USAGE key SAMPLES 5 returns bytes consumed by a key including structural overhead — SAMPLES 0 is exact but expensive for complex nested types. MEMORY DOCTOR returns a human-readable diagnostic summary. OBJECT ENCODING key reveals whether a data structure is using its compact encoding (listpack, intset) or its full encoding (hashtable, skiplist).

Diagnosing High Miss Rate

A production miss-rate investigation: a recommendations service showed 61% hit rate despite correct Cache-Aside implementation. INFO keyspace showed db0: keys=12000,expires=11800 — nearly all keys had TTLs. INFO stats showed evicted_keys growing at 8,000/minute — Redis was under memory pressure, evicting keys faster than they were being read. maxmemory was set to 2GB; used_memory was 1.98GB. The miss cascade: request hits cache miss (key was evicted) → DB query fires → cache is populated → key is evicted again almost immediately by the next eviction cycle → next request misses again. Fix: increased maxmemory to 4GB (with an additional replica for HA), switched eviction policy from volatile-lru to allkeys-lfu to favor actually-hot keys. Hit rate recovered to 94% within 10 minutes. OBJECT FREQ key (available only when maxmemory-policy is lfu or allkeys-lfu) returns the 8-bit logarithmic access frequency counter — use this to identify which keys Redis considers cold vs hot before deciding what to manually evict or which TTLs to extend. Production insight from Redis in Action: Carlson documents that comparing application throughput against redis-benchmark output is instructive — if your application achieves only 25–30% of single-client redis-benchmark throughput (vs. the expected 50–60%), the most common root cause is creating a new connection per command rather than reusing a connection pool, adding round-trip overhead that dominates over command execution time; check INFO clients for connected_clients and blocked_clients as a first diagnostic step before investigating slower Redis-internal causes.

Latency Issues and Persistence

SLOWLOG mechanics: CONFIG SET slowlog-log-slower-than 10000 captures commands exceeding 10ms (value in microseconds). SLOWLOG LEN shows total buffered entries. SLOWLOG GET N returns the N most recent entries — each includes execution time in microseconds, Unix timestamp, client IP:port, and the full command with arguments. SLOWLOG RESET clears the buffer. LATENCY MONITOR tracks latency event categories: command, fast-command, aof-stat, rdb-fork-stat, expire-cycle. LATENCY HISTORY event returns a time-series of latency spikes for that event. LATENCY LATEST shows the most recent and peak spike per category. DEBUG SLEEP 0.1 injects an artificial 100ms delay — safe way to verify client timeout behavior and circuit breaker logic without touching production data. Persistence tradeoffs: appendfsync always flushes to disk on every write — maximum durability, 1/3 the throughput of everysec. appendfsync everysec (recommended) flushes once per second — risks at most 1 second of data loss on crash, 2-5× faster than always. no-appendfsync-on-rewrite yes prevents fsync during BGREWRITEAOF, reducing rewrite-induced latency spikes at the cost of potential data loss during the rewrite window.

Key Takeaways

SLOWLOG GET is the first tool to check for latency issues — identifies specific slow commands.
KEYS pattern in production is O(N) and blocks all other commands — use SCAN instead.
High mem_fragmentation_ratio causes Redis to use much more OS memory than expected.

Code example

# Diagnose slow commands
SLOWLOG GET 10  # last 10 slow commands
SLOWLOG RESET   # clear slowlog

# Memory analysis
INFO memory
# used_memory: 1073741824  (1GB data)
# used_memory_rss: 2147483648  (2GB OS allocation)
# mem_fragmentation_ratio: 2.0  ← high, consider restart

# Hit rate
INFO stats | grep -E 'keyspace_hits|keyspace_misses'
# keyspace_hits: 950000
# keyspace_misses: 50000
# hit_rate = 950000 / 1000000 = 95%

# Safe alternative to KEYS (use in production)
SCAN 0 MATCH user:* COUNT 100  # iterates without blocking