Elasticsearch – why have index_analyzer AND search_analyzer

In our projects, we use Elasticsearch for indexing data. It’s a text search engine built over Lucene indexes. Recently I was investigating a bug about how the system wasn’t finding something it was supposed to find using a specific query. Looking through the index metadata, I saw something like this:

url: {
include_in_all: false
index_analyzer: edgyUrlText
search_analyzer: edgyQuery
type: string
}

This was something I haven’t seen before. Usually we just specified “analyzer”, but for this field, we had two. Why would one have different analyzers for indexing and for search?

Turns out that edgengram tokenizer that we use there is just the case where this might be useful. But first about the analyzers and tokenizers.

Elasticsearch uses tokenizers to split data into tokens and token filters to apply some additional processing.  An analyzer usually has one tokenizer and can have several (or none) token filters. There are standard analysers: standard, simple, whitespace, keyword, etc. There are quite a few standard tokenizers and filters, too. But if they don’t fill the bill, one can always define custom tokenizers (based on regexp) and analyzers.

To see how an analyzer processes the text, you can use elasticsearch analyze api like this:

curl -XGET 'http://localhost:9200/_analyze?analyzer=english&pretty=true' -d 'This is a test, as you might have noticed.'
curl -XGET 'http://esd01.dev.dal.us.publishthis.com:9200/_analyze?tokenizer=whitespace&pretty=true' -d 'This is a test, as you might have noticed.'

Ngram tokenizer is used to search for partial matches.  It has min_gram and max_gram config parameters that allow to set up the length of the ngrams. EdgeNGram is very like it, except that it only uses the ngrams at the beginning of a token. So, if the token is “elastic” and the min_ngram is say 3 and max_ngram is 10, it’ll index “ela”, “elas”, “elast” and “elastic”.

And this is exactly the case where it makes sense to use a different index in search. Because when we input “elastic” to search for, we don’t want to search for “ela”, “elas”, “elast” and “elastic”. We just want to find “elastic”. Using all those shorter ngrams in search would give us too wide a result set. So, this is where we would use our edgeNGram analyzer as index_analyzer, but something else as search_analyzer.

(Thanks javanna from StackOverflow. This was a great explanation.)

Advertisements

About Maryna Cherniavska

I have productively spent 10+ years in IT industry, designing, developing, building and deploying desktop and web applications, designing database structures and otherwise proving that females have a place among software developers. And this is a good place.
This entry was posted in ElasticSearch and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s