Level 2 · 25 min

Aggregations

Elasticsearch aggregations transform your data into analytics — counts, sums, histograms, and multi-dimensional facets. Understanding the three aggregation families and how to combine them is essential for building dashboards and search facets.

Metric Aggregations

Metric aggregations compute numeric values from a set of documents. Single-value metrics return one number: avg, sum, max, min, cardinality (approximate distinct count via HyperLogLog), value_count. Multi-value metrics return multiple numbers: stats (min, max, avg, sum, count in one pass), extended_stats (adds variance, std_deviation), percentiles (approximate via TDigest algorithm), percentile_ranks. Cardinality and percentiles use probabilistic algorithms — they trade precision for memory efficiency. The precision_threshold parameter on cardinality controls this trade-off.

Bucket Aggregations

Bucket aggregations group documents into buckets and optionally apply sub-aggregations within each bucket. terms buckets group by field value — top N values by document count (or by a sub-aggregation metric). date_histogram creates time buckets (calendar_interval: day/week/month or fixed_interval: 1h/7d) — essential for time-series analysis. range creates custom numeric ranges. nested handles array-of-objects fields, allowing sub-aggregations to reference nested document fields. Composite aggregation paginates through all unique combinations of multiple fields. Gormley and Tong describe the core architecture: "Fielddata is used in several places in Elasticsearch: sorting on a field, aggregations on a field, certain filters, scripts that refer to fields. This can consume a lot of memory, especially for high-cardinality string fields — string fields that have many unique values — like the body of an email." — Clinton Gormley & Zachary Tong, Elasticsearch: The Definitive Guide. Modern ES (5.0+) replaced most fielddata usage with doc_values (on-disk columnar storage), which avoids heap pressure for numeric and keyword fields. However, text fields still require fielddata to be explicitly enabled (fielddata: true in mapping) to support aggregations — a setting that should be avoided in production. The cardinality aggregation estimates unique value counts using the HyperLogLog++ algorithm. It trades perfect accuracy for bounded memory (default precision_threshold: 3000), guaranteeing less than 5% error at a cost of ~40KB of memory per aggregation instance.

Pipeline Aggregations

Pipeline aggregations operate on the output of other aggregations (sibling or parent) rather than directly on documents. Parent pipeline aggregations compute values per bucket of a parent aggregation: derivative (rate of change between buckets), moving_avg (rolling average), cumulative_sum. Sibling pipeline aggregations compute a single value from all buckets: max_bucket, min_bucket, avg_bucket, bucket_selector (filter out buckets based on their metric values). Pipeline aggregations enable advanced analytics like anomaly detection, trend analysis, and threshold filtering without post-processing in application code.

Key Takeaways

Metric aggregations compute statistics from document values. Bucket aggregations group documents. Pipeline aggregations compute from other aggregation results.
cardinality and percentiles use approximate algorithms (HyperLogLog, TDigest). They are memory-efficient but not exact. For exact counts, use value_count on a keyword field.
Sub-aggregations nest inside bucket aggregations, enabling powerful multi-level analytics — e.g., top categories by revenue, with average price per category.

Code example

POST /orders/_search\n{\n  "size": 0,\n  "query": {"range": {"date": {"gte": "now-30d"}}},\n  "aggs": {\n    "by_category": {\n      "terms": {"field": "category", "size": 5},\n      "aggs": {\n        "total_revenue": {"sum": {"field": "price"}}',\n        "avg_price": {"avg": {"field": "price"}}' \n      }\n    }\n  }\n}