MongoDb 2.4 and the new (not yet improved) text search

In 2.4 release, MongoDb acquired some very useful and long awaited features. One of them is full-text search. It’s still in beta though and therefore not recommended for production use. Still, you can enable it to play with and see if it works for you.

As a beta feature, it’s not enable by default, so first you have to enable it explicitly. It can be done in two ways: either as a startup parameter

mongod --setParameter textSearchEnabled=true

or you can set a parameter in mongo shell:

use admin
db.runCommand({ “setParameter”: 1, “textSearchEnabled”: true })

Then you have to create a text index on the selected fields.
Suppose we have some info about the documents on the web. Example document:

{ 
	"_id" : ObjectId("5155ed42e3b71cacf23cfcd7"), 
	"id" : "1f5e2f6b89ef796b2fad9ad7d42df7e3c0028b37", 
	"url" : "http://www.arkansasonline.com/news/2013/feb/08/regrets-bomb-plot-man-says-plea-20130208/", 
	"title" : "Regrets bomb plot, man says in plea", 
	"summary" : "A Bangladesh native accused of trying to blow up the Federal Reserve Bank in New York with what he thought was a 1,000-pound car bomb pleaded guilty Thursday to terrorism charges stemming from an FBI sting." 
}

We want to search for text in title and summary, so we create two text indexes:

db.collection.ensureIndex({ "title" : "text" })
db.collection.ensureIndex({ "summary": "text" })

It is also possible to create a compound index on the two fields:

db.collection.ensureIndex(
	{
		"title" : "text",
		"summary" : "text" 
	},
	{
		default_language: "english",
		weights: 
		{
			title: 2
		}
	})

The weight mean the priority of the fields when searching for text. Default priority is 1, so in this case we want the title to be twice as important as the summary.
To search for text, you should run a text command against the collection:

db.articles.runCommand( "text", { search: "china" } )

Text command is case insensitive, so the case of the query won’t matter. The result is going to look like this:

{ 
	"queryDebugString" : "china||||||", 
	"language" : "english", 
	"results" : [ 
		{ 
			"score" : 1.1428571428571428, 
			"obj" : { 
				"_id" : ObjectId("5155ed42e3b71cacf23cfcdb"), 
				"id" : "1261e9d8a597677d6390860b2c3c01ddbf6cbf7c", 
				"url" : "http://www.npr.org/templates/story/story.php?storyId=171456338&ft=1&f=", 
				"title" : "Nissan Quarterly Profit Dives On China Sales Slump", 
				"summary" : "Nissan quarterly profit dives on China sales slump" 
			} 
		} 
	], 
	"stats" : { 
		"nscanned" : 1, 
		"nscannedObjects" : 0, 
		"n" : 1, 
		"nfound" : 1, 
		"timeMicros" : 229 
	}, 
	"ok" : 1 
}

To search for several words – one or another, one just has to separate them with spaces in the search query:

db.articles.runCommand( "text", { search: "native pilot bomb reserve" } ) 
{ 
	"queryDebugString" : "bomb|nativ|pilot|reserv||||||", 
	"language" : "english", 
	"results" : [ 
		{ 
			"score" : 4.145833333333334, 
			"obj" : { 
				"_id" : ObjectId("5155ed42e3b71cacf23cfcd7"), 
				"id" : "1f5e2f6b89ef796b2fad9ad7d42df7e3c0028b37", 
				"url" : "http://www.arkansasonline.com/news/2013/feb/08/regrets-bomb-plot-man-says-plea-20130208/", 
				"title" : "Regrets bomb plot, man says in plea", 
				"summary" : "A Bangladesh native accused of trying to blow up the Federal Reserve Bank in New York with what he thought was a 1,000-pound car bomb pleaded guilty Thursday to terrorism charges stemming from an FBI sting." 
			} 
		} 
	], 
	"stats" : { 
		"nscanned" : 3, 
		"nscannedObjects" : 0, 
		"n" : 1, 
		"nfound" : 1, 
		"timeMicros" : 234 
	}, 
	"ok" : 1 
}

The debug string shows the actual query run against the db (as opposed to a query constructed by a user):

"queryDebugString" : "bomb|nativ|pilot|reserv||||||"

As you can see, the query is tokenized, that is, only the meaningful part of each word is left. The suffixes are stripped, so that each form of the word would match the query.

To search for a phrase, one has to put it in double quotes:

db.articles.runCommand( "text", { search: "\"native accused\"" } ) 
{ 
	"queryDebugString" : "accus|nativ||||native accused||", 
	"language" : "english", 
	"results" : [ 
		{ 
			"score" : 2.041666666666667, 
			"obj" : { 
				"_id" : ObjectId("5155ed42e3b71cacf23cfcd7"), 
				"id" : "1f5e2f6b89ef796b2fad9ad7d42df7e3c0028b37", 
				"url" : "http://www.arkansasonline.com/news/2013/feb/08/regrets-bomb-plot-man-says-plea-20130208/", 
				"title" : "Regrets bomb plot, man says in plea", 
				"summary" : "A Bangladesh native accused of trying to blow up the Federal Reserve Bank in New York with what he thought was a 1,000-pound car bomb pleaded guilty Thursday to terrorism charges stemming from an FBI sting." 
			} 
		} 
	], 
	"stats" : { 
		"nscanned" : 2, 
		"nscannedObjects" : 0, 
		"n" : 1, 
		"nfound" : 1, 
		"timeMicros" : 221 
	}, 
	"ok" : 1 
}

Still, MongoDb text search is not yet very advanced, compared to a text search engine. The comparison between MongoDb full-text search and a full-fledged text-search index (Lucene) is shown in the table below.

Features MongoDb 2.4 Lucene
Term search Yes Yes
Phrase search

Yes

Put the phrase in double quotes (backslashed):

“\”native accused\””

Yes
Boolean operators:
OR

Yes

Just type the two terms in a query: “native accused”

Yes

There’s an “OR” or just the space between the terms

AND

Yes

If you quote each of the terms as in phrase search, that will work like an “AND” query
db.articles.runCommand(“text”, { search: “\”one\” \”two\”” })

Yes

There’s an “AND” operator (or a + operator which means that the word is required)

NOT

Yes

Add a “-” in front of the negated word:

“something -other” means you want to search for “something” but not for “other”

Yes

There’s a “NOT” operator (and a “-” which works the same way)

Grouping No

The terms can be grouped by parentheses, like:

(one two) AND three

which means either “one” or “two” should be in the text, and “three” is required

Wildcard searches

No

Not in the text index. There’s a regex search, though, which can be used for the purpose:

db.articles.find( { title: /^regrets.*/i } )

Yes

“*” and “?” are applied

Fuzzy searches No

Yes

Tilde notation is used.

“roam”~ means “find me all the items containing a word similar to roam”

Proximity searches

No

Similar functionality can be reached with regex, but it would be complicated

Yes

“one two”~10 means “find me words ‘one’ and ‘two’ within 10 words of each other”

Boosting a term No Yes“one^2 two” means that “one” is more important than “two”

Text indexes will have an impact on youd db (memory-wide and performance-wide).

  • They change the space allocation method for all future record allocations in a collection to allocate sizes in powers of two.
  • They can be large. They contain one index entry for each unique post-stemmed word in each indexed field for each document inserted.
  • Building a text index is very similar to building a large multi-key index and will take longer than building a simple ordered (scalar) index on the same data.
  • When building a large text index on an existing collection, ensure that you have a sufficiently high limit on open file descriptors.
  • Text indexes will impact insertion throughput because MongoDB must add an index entry for each unique post-stemmed word in each indexed field of each new source document.
  • Additionally, text indexes do not store phrases or information about the proximity of words in the documents. As a result, phrase queries will run much more effectively when the entire collection fits in RAM.

There’re several more things that should be taken into consideration when you’re deciding whether you need a text-search engine in addition to your MongoDb. The state-of-the art text search systems existing today have a lot of options that allow you to apply detailed control over tokenizing, stemming, create your own analyzers and tokenizers etc. As you can see, all this is currently non-existent in MongoDb, but then again, for MongoDb text search is just a supplementary feature, and it should be taken as such. It might be adequate for your needs though.

Advertisements

About Maryna Cherniavska

I have productively spent 10+ years in IT industry, designing, developing, building and deploying desktop and web applications, designing database structures and otherwise proving that females have a place among software developers. And this is a good place.
This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

One Response to MongoDb 2.4 and the new (not yet improved) text search

  1. Faliorn says:

    Impressive! Thanks for the overview. It’s really worth a try 😀

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s