In our projects, we use Elasticsearch for indexing data. It’s a text search engine built over Lucene indexes. Recently I was investigating a bug about how the system wasn’t finding something it was supposed to find using a specific query. Looking through the index metadata, I saw something like this:
This was something I haven’t seen before. Usually we just specified “analyzer”, but for this field, we had two. Why would one have different analyzers for indexing and for search?
Turns out that edgengram tokenizer that we use there is just the case where this might be useful. But first about the analyzers and tokenizers.
Elasticsearch uses tokenizers to split data into tokens and token filters to apply some additional processing. An analyzer usually has one tokenizer and can have several (or none) token filters. There are standard analysers: standard, simple, whitespace, keyword, etc. There are quite a few standard tokenizers and filters, too. But if they don’t fill the bill, one can always define custom tokenizers (based on regexp) and analyzers.
To see how an analyzer processes the text, you can use elasticsearch analyze api like this:
curl -XGET 'http://localhost:9200/_analyze?analyzer=english&pretty=true' -d 'This is a test, as you might have noticed.'
curl -XGET 'http://esd01.dev.dal.us.publishthis.com:9200/_analyze?tokenizer=whitespace&pretty=true' -d 'This is a test, as you might have noticed.'
Ngram tokenizer is used to search for partial matches. It has min_gram and max_gram config parameters that allow to set up the length of the ngrams. EdgeNGram is very like it, except that it only uses the ngrams at the beginning of a token. So, if the token is “elastic” and the min_ngram is say 3 and max_ngram is 10, it’ll index “ela”, “elas”, “elast” and “elastic”.
And this is exactly the case where it makes sense to use a different index in search. Because when we input “elastic” to search for, we don’t want to search for “ela”, “elas”, “elast” and “elastic”. We just want to find “elastic”. Using all those shorter ngrams in search would give us too wide a result set. So, this is where we would use our edgeNGram analyzer as index_analyzer, but something else as search_analyzer.
(Thanks javanna from StackOverflow. This was a great explanation.)