Level 3 · 30 min

Garbage Collection

Garbage collection is the most complex tuning dimension in production Java systems. Choosing the wrong GC algorithm or misconfiguring pause budgets is a primary source of latency spikes, SLA violations, and unexpected outages. A senior engineer must be able to read GC logs, diagnose problems, and tune intelligently.

GC Algorithms: G1GC, ZGC, Shenandoah

G1GC (default since Java 9) divides the heap into equal-sized regions and collects incrementally, targeting a pause goal set via -XX:MaxGCPauseMillis (default 200ms). It handles large heaps (> 4GB) well but can still have full GC pauses if concurrent marking falls behind allocation rate. ZGC (production-ready since Java 15) is designed for sub-millisecond pauses regardless of heap size — it uses colored pointers and load barriers to do most work concurrently. Shenandoah (Red Hat, in OpenJDK since Java 12) similarly targets low pauses with concurrent compaction. Serial GC and Parallel GC are throughput-optimized but have stop-the-world pauses proportional to heap size.

GC Log Analysis and Key Metrics

Enable GC logging with -Xlog:gc*:file=gc.log:time,uptime in Java 9+. Key metrics: pause time (stop-the-world duration), allocation rate (MB/s — high rates overwhelm the collector), promotion rate (objects moving from Young to Old), heap occupancy before/after GC (determines if sizing is correct), and GC throughput (% of time NOT in GC). A full GC is a red flag — it means concurrent collection cannot keep up with allocation rate. G1GC (default since Java 9) divides the heap into equal-sized regions (1–32 MB each, default ~2000 regions) and prioritizes collection of regions with the most garbage — the 'Garbage First' heuristic. Key G1 tuning knob: -XX:MaxGCPauseMillis=200 (default 200 ms) is a soft target, not a hard limit. G1 will attempt to stay under this target by limiting the number of regions it collects per pause cycle. If allocation rate exceeds what concurrent marking can keep up with, G1 falls back to a stop-the-world Full GC — detectable in logs as 'Pause Full (G1 Evacuation Pause)'. ZGC (Java 15+) and Shenandoah aim for sub-millisecond pauses at the cost of 5-20% throughput overhead. Production insight: allocation rate above ~1 GB/s on a 4 GB heap typically causes G1 to struggle — reduce object churn or increase heap before tuning pause targets.

Throughput vs Latency Trade-offs

No GC algorithm gives you both maximum throughput and minimum latency simultaneously. Parallel GC maximizes throughput (more CPU to GC work) at the cost of long STW pauses. G1GC balances both with a pause target. ZGC/Shenandoah minimize pause time but use more CPU for concurrent work, reducing application throughput by ~5-15%. The classic tuning knobs: heap size (more heap → less frequent GC but longer full GC), NewRatio (more Young gen → less promotion → fewer Old gen GCs), and MaxGCPauseMillis (lower target → shorter pauses but potentially more frequent GC).

Key Takeaways

ZGC is the answer for sub-millisecond latency requirements. It handles terabyte heaps with pauses measured in microseconds.
A full GC under G1GC means concurrent marking cannot keep up. Fix: increase heap size, reduce allocation rate, or tune -XX:InitiatingHeapOccupancyPercent.
GC throughput < 95% (more than 5% of time in GC) is a serious problem. Profile allocation hot spots first before tuning GC flags.

Code example

# Java 9+ unified logging — essential for diagnosis
-Xlog:gc*:file=gc.log:time,uptime:filecount=5,filesize=20m

# G1GC with aggressive pause target
-XX:+UseG1GC -XX:MaxGCPauseMillis=100
-XX:InitiatingHeapOccupancyPercent=35

# ZGC for ultra-low latency
-XX:+UseZGC -Xms4g -Xmx4g

# Shenandoah
-XX:+UseShenandoahGC

# Analyze with GCEasy or GCViewer:
# cat gc.log | grep "Pause Full" — full GC events
# cat gc.log | grep "to-space exhausted" — G1 failure mode