Level 3 · 25 min

Analyzers

Analyzers control how text is transformed before being indexed and how query strings are parsed. Choosing the right analyzer determines whether users find what they are looking for — wrong analyzer choices are invisible bugs that show up as missing results.

Analyzer Pipeline

An analyzer consists of three components applied in sequence: character filters (transform raw text before tokenization — e.g., html_strip removes HTML tags, mapping replaces characters), tokenizer (splits text into tokens — standard splits on whitespace/punctuation, keyword keeps the entire string as one token, whitespace splits only on spaces), token filters (transform tokens — lowercase, stop word removal, stemming, synonym injection, edge n-gram). The analyze API lets you test any analyzer: POST /_analyze with analyzer and text. Every text field uses one analyzer at index time and optionally a different one (search_analyzer) at query time.

Language Analyzers and Stemming

Language-specific analyzers handle the morphological rules of each language: the english analyzer applies Porter stemming (running → run, runs → run, ran → run), removes English stop words (the, is, at), and lowercases. The snowball token filter is an alternative stemmer. Language analyzers improve recall (finding 'running' when searching 'run') at the cost of some precision. Stemming can be too aggressive: 'universal' and 'university' may both stem to 'univers'. For multilingual content, use the icu_tokenizer (via the ICU plugin) which handles Unicode word boundaries correctly across scripts. The Definitive Guide describes the standard analyzer's behavior on the string 'Set the shape to semi-transparent by calling set_trans(5)': it produces the tokens 'set, the, shape, to, semi, transparent, by, calling, set_trans, 5' — splitting on Unicode word boundaries and lowercasing. The simple analyzer produces 'set, the, shape, to, semi, transparent, by, calling, set, trans' — splitting on any non-letter, losing the underscore-joined token. — Clinton Gormley & Zachary Tong, Elasticsearch: The Definitive Guide. An analyzer consists of three sequential stages: character filters (pre-process raw text: strip HTML, normalize characters), a single tokenizer (split into tokens: whitespace, standard/UAX URL email, pattern, ngram), and zero or more token filters (lowercase, stop words, stemming, synonyms, ASCII folding). A custom analyzer composes these via index settings and must be defined before any documents are indexed — it cannot be added to an existing mapping without closing/reopening the index.

Custom Analyzers for Autocomplete

Search-as-you-type (autocomplete) requires matching partial words. The standard approach uses edge n-grams: at index time, 'laptop' generates 'l', 'la', 'lap', 'lapt', 'lapto', 'laptop'. At search time, use the keyword tokenizer (no tokenization — match the exact query against n-grams). Critically, you must use a different analyzer at search time to prevent n-gram vs n-gram matching — searching 'lapt' would otherwise generate n-grams from 'lapt' and try to match them against the index n-grams, producing false positives. The search_analyzer field on the mapping specifies the query-time analyzer.

Key Takeaways

Analyzer pipeline: character filters → tokenizer → token filters. Test with POST /_analyze before indexing production data.
Language analyzers improve recall via stemming but can reduce precision. Test with domain-specific vocabulary — stemming rules are not always correct for technical terms.
For autocomplete with edge n-grams: use a custom analyzer at index time (generates n-grams) and a simple/keyword analyzer at search time (prevents n-gram vs n-gram false positives).

Code example

// Test an analyzer before indexing\nPOST /_analyze\n{\n  "analyzer": "english",\n  "text": "The quick brown foxes are jumping"\n}\n// Returns tokens: quick, brown, fox, jump (stemmed, stops removed)\n\n// Custom autocomplete analyzer in mapping\nPUT /products\n{\n  "settings": {\n    "analysis": {\n      "analyzer": {\n        "autocomplete": {\n          "type": "custom",\n          "tokenizer": "standard",\n          "filter": ["lowercase", "edge_ngram_filter"]\n        }\n      }\n    }\n  },\n  "mappings": {\n    "properties": {\n      "name": {\n        "type": "text",\n        "analyzer": "autocomplete",\n        "search_analyzer": "standard"\n      }\n    }\n  }\n}