Level 1 · 20 min

Data Modeling

MongoDB's flexible document model enables rich data structures, but schema design decisions made early are hard to reverse. Understanding when to embed vs reference data is the most critical MongoDB design decision — it determines query patterns, performance, and scalability.

Embedding vs Referencing

Embedding places related data inside the same document. Best for: data accessed together (eliminates joins), data with bounded growth (embedded arrays stay manageable), one-to-one or one-to-few relationships, data that belongs to the parent (an address belongs to a user). Referencing stores a foreign key (ObjectId) pointing to another collection. Best for: large or unbounded data (comments on a viral post), data shared across many documents (a product referenced by many orders), data accessed independently (user profile accessed without their orders). The key insight: MongoDB has no server-side joins by default — every reference requires a separate query or a $lookup aggregation stage.

1-N Relationship Patterns

Three patterns for one-to-many relationships. Embedded array (one-to-few): embed the N side as an array in the parent document — e.g., a blog post with its tags. Simple, fast reads, but arrays must stay bounded (16 MB document limit). Referenced (one-to-many): the N-side documents each have a parent_id field — e.g., orders with a customer_id. Allows unbounded N, enables independent querying of N-side documents. Extended reference (one-to-squillions): when N is truly huge (IoT events, log entries), store the parent_id on the child documents and denormalize a subset of parent fields into each child to avoid a separate lookup for every read. This trades storage for read performance. Key insight from MongoDB: The Definitive Guide (3rd ed., Bradshaw, Brazil, Chodorow): the 16 MB document limit is not arbitrary performance theater — it enforces schema discipline. As the authors note, 'all documents must be smaller than 16 MB... it is mostly intended to prevent bad schema design and ensure consistent performance.' Use Object.bsonsize(doc) in the shell to measure a document's actual BSON size before committing to an embedded design. The Bucket Pattern is specifically designed for time-series workloads: group N measurements per bucket document (e.g., one hour of sensor readings), with each bucket holding start/end timestamps. This reduces document count by orders of magnitude and enables efficient range queries within a time window.

Schema Anti-Patterns

Massive arrays: embedding an unbounded array (all comments on a post) risks hitting the 16 MB document limit and causes slow updates (entire document rewritten). Bloated documents: embedding infrequently accessed data (full order history in a user document) increases memory pressure — MongoDB loads the entire document into working set. Unnecessary indexes: every index costs write overhead and RAM. Too many collections: MongoDB can handle many collections, but overly granular collections (one per tenant) create management complexity. Normalization for its own sake: unlike SQL, normalizing to eliminate all duplication often hurts performance in MongoDB — controlled denormalization is idiomatic.

Key Takeaways

Embed when data is accessed together and has bounded growth. Reference when data is unbounded, shared, or independently queried.
MongoDB has no server-side joins — every $lookup is expensive. Design schemas around your query patterns, not normalization theory.
The 16 MB document limit is a hard constraint on embedded arrays. Design with growth in mind — what fits today may not fit in 6 months.

Code example

// Embedding (one-to-few, bounded)\n{\n  "_id": "post_1",\n  "title": "MongoDB Data Modeling",\n  "tags": ["mongodb", "nosql", "database"],\n  "author": {"name": "Alice", "email": "alice@example.com"}\n}\n\n// Referencing (one-to-many, unbounded)\n// Comment document: {"post_id": "post_1", "body": "...", "user_id": "user_1"}\n// Query comments: db.comments.find({post_id: 'post_1})