Level 2 · 25 min

Aggregation Pipeline

The MongoDB aggregation pipeline processes documents through a sequence of stages, each transforming the data. It is the primary tool for analytics, reporting, and complex data transformations in MongoDB.

Core Pipeline Stages

$match filters documents (like WHERE in SQL) — put it first to reduce the working set. Uses indexes when first in the pipeline. $group groups documents by an _id expression and computes accumulators ($sum, $avg, $max, $min, $push, $addToSet) — like SQL GROUP BY. $project reshapes documents: include/exclude fields, compute new fields with expressions, rename fields. $sort sorts documents (can use an index if immediately after $match). $limit and $skip for pagination. $unwind deconstructs an array field — creates one output document per array element. Essential for aggregating on nested array values. $addFields adds new computed fields without removing existing ones.

$lookup and $unwind

$lookup performs a left outer join between the current collection and another collection. Basic syntax: from (target collection), localField (field in current doc), foreignField (field in target collection), as (output array field name). The result of $lookup is an array — even for single matches. Use $unwind after $lookup to flatten the joined documents. $lookup with pipeline (MongoDB 3.6+) allows complex join conditions and additional filtering on the joined collection. Important performance note: $lookup is not like a SQL JOIN — it executes a separate query per document in the pipeline. Ensure the foreignField has an index. Key insight from MongoDB: The Definitive Guide (3rd ed., Bradshaw, Brazil, Chodorow): '$lookup' was introduced in MongoDB 3.2, with more complex joins using multiple conditions and unrelated subqueries added in 3.6. The authors are explicit that 'MongoDB's treatment of joins were architectural decisions to allow for greater scalability, because both of those features are difficult to provide in a distributed system.' Use $unwind carefully — when applied to an array with N elements, it produces N output documents, potentially multiplying your pipeline working set by orders of magnitude before the $group stage collapses it back.

Pipeline Optimization

MongoDB automatically optimizes some pipeline patterns: consecutive $match stages are merged, $sort + $limit is optimized into a top-N sort. Manual optimization best practices: put $match as early as possible to reduce the number of documents processed by subsequent stages. Put $project to drop unused fields early — reduces memory usage for later stages. Use $match before $unwind to filter documents before array expansion (avoids multiplying document count). allowDiskUse: true enables spilling to disk for large aggregations that exceed the 100 MB memory limit — use for batch analytics, not real-time queries.

Key Takeaways

$match and $project early in the pipeline are the most impactful optimizations — they reduce the working set for all subsequent stages.
$lookup executes a query per input document — index the foreignField. Use $lookup with pipeline for complex conditions instead of multiple separate queries.
$unwind before $lookup avoids joining inflated documents. $unwind after $lookup flattens results. Order matters significantly for both correctness and performance.

Code example

db.orders.aggregate([\n  // Stage 1: filter last 90 days (uses index)\n  {$match: {createdAt: {$gte: new Date(Date.now() - 90*24*60*60*1000)}}}  ,\n  // Stage 2: join users collection\n  {$lookup: {from: "users", localField: "userId", foreignField: "_id", as: "user"}}',\n  // Stage 3: flatten joined user array\n  {$unwind: "$user"}',\n  // Stage 4: group by user, compute totals\n  {$group: {_id: "$userId", totalValue: {$sum: "$amount"}, orderCount: {$sum: 1}, avgValue: {$avg: "$amount"}}},\n  // Stage 5: sort by total value desc\n  {$sort: {totalValue: -1}}',\n  // Stage 6: top 10\n  {$limit: 10}\n])