Site Building

Configure Stop Words in Solr for Drupal 7

This page is archived

We're keeping this page up as a courtesy to folks who may need to refer to old instructions. We don't plan to update this page.

Sprout Video

Depending on the data that is being searched, some shorter general words, like "a", "the", or "is" can adversely effect search result relevancy. Consider the word "the", which in a standard description of a fish in our database could easily appear hundreds of times or more. When a search is performed, part of the algorithm that calculates the relevancy of any document in the index is to count the number of times a word appears in the text being searched. The more often it appears, the more relevant the document. Words like "the" however often have little to no real bearing on a document's actual relevancy. These words should instead be excluded from the ranking algorithm.

Stop words can also serve another purpose. You can filter out words that are so common in a particular set of data that the system can't handle them in a useful way. For example, consider the word "fish" in our dataset. It's probably very common. With only 500 fish being indexed it's not really going to make much difference, but what if we were indexing five million fish, and each one had the word "fish" in the description even just five times? That's 25 million occurrences of the word "fish". Eventually we might start to hit the upper limit of what Solr can handle. The word "fish" in this case is probably also not very useful in a search query. You're browsing a fish database. Are you really likely to search for the query fish and expect any meaningful results? Likely it would instead return every result. It would be like going to Drupal.org and searching for the word "drupal" and expecting to get something useful. Not going to happen.

Solr has the ability to read in a list of stop words, or words that should be ignored during indexing, so that those words do not clutter your index and are removed from influencing result relevancy. In this tutorial we'll take a look at configuring stop words for Solr.

First, we'll use the Solr web UI to see the most common terms in our index for the body field. Then, based on that list, and the list of common stop words provided by the Solr team, we'll configure our stopwords.txt file. Finally, we'll re-index all the content of our site so that it makes use of the new stop words configuration and re-examine the most common terms noting that our stop words no longer appear in the list.

By the end of this tutorial you should be able to use the Solr web UI to get a list of the most common terms in your index, and know how to add terms to Solr's stopwords.txt file to prevent them from showing up in your index.

Additional resources