Configure Search API Solr: Use Stopwords When Indexing Content

Drupalize.Me Tutorial

Most written text has a lot of functional words, like "a", "the", or "is" which are important to the person reading the content as they help it flow in a cohesive manner, but aren't necessary as important to someone searching the content of your site. Consider the word "the", which in a standard blog post could easily appear hundreds of times or more. When a user performs a search, part of the algorithm that calculates the relevancy of any document in the search index is to count the number of times a word appears in the text being searched. The more often it appears, the more relevant the document. Words like "the" however often have little to no real bearing on a documents actual relevancy. And should instead be excluded from the ranking algorithm.

This is generally done in one of two ways. Either, ignoring those utility words when they are present in the search query and simply not passing them on to the search appliance. Or, by removing them during the indexing process so that those words simply don't appear in your search index. The default configuration that comes with the Search API Solr module actually has both of these features enabled, but doesn't provide any listing of words to ignore.

This tutorial, based on the free Configure Stop Words in Solr videos from the Improving Drupal's Search with Apache Solr series, expects that you've already got the Search API module installed and configured to use Apache Solr as a backend.

Our goal: Filter out words that are so common in a particular set of data that the system can not handle them in any useful way. In Solr, and in most search indexing applications, these are referred to as "stop words".

Hands-on

Alright, let's create some stop words. We're going to locate the Solr server configuration and edit the included stopwords.txt file, then restart Solr so that those changes are picked up, and re-index our content so that the index is recreated with those stop words removed.

  1. Locate your stopwords.txt file, this is part of your Solr server's configuration, and not Drupal's configuration. The location of this file will vary depending on your installation but it should be with the rest of the Solr configuration files you installed when setting up your server. Common locations include, solr-4.10.3/example/solr/collection1/conf/stopwords.txt if you're doing local development and using the example Solr server. Or, somewhere like /usr/local/tomcat/solr/stopwords.txt if you're using Tomcat. Note: stop words configuration is per-core, so each Solr core will have a different stopwords.txt file.
  2. Once you've located the file, open it your text editor of choice. If this is your first time editing the file it is likely either completely empty, or has a handful of lines a the top starting with "#". These are all comments, and are ignored by the indexer.
  3. Add a list of common stop words to your stopwords.txt file. The documentation for the syntax that can be used in this file can be found here http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory. The short version: one word per line. So your file might look something like this:
    
    #Stopwords for example.com
    a
    an
    and
    are
    as
    at
    be
    but
    by
    for
    if
    ...
    
  4. Once you've made your changes to the stopwords.txt file you'll need to restart Solr so that your new configuration is used. Again, this will vary depending on your configuration by common examples are to either kill the running localhost process and restart the example application with java -jar start.jar or, if using Tomcat and an init.d like service, sudo service tomcat restart should do the trick.
  5. Finally, you'll need to rebuild your search index. With the Search API module this can either be done through the UI (in the Administrative menu, go to Configuration > Search and Metadata > Search API (admin/config/search/search_api)), or with Drush, drush search-api-rebuild && drush search-api-index . Once the indexing is complete queries to Solr should now ignore the stop words you've configured above.
  6. Tada!

Conclusion

That's it. Pretty easy, and with a bit of tweaking over time hopefully this helps you make your search results more relevant for your users. If you want to learn more about how Solr uses stop words here a couple of additional resources that I've found helpful:

Related Topics: 

Comments

Thanks! great stuff on eliminating stop words from the index. My question is about the other side of the equation. When i search for 'TERM1 in/or/the TERM2' i get no results. But when I search 'TERM1 TERM2' with no stop word, i get the expected results.

How do i setup solr (or search api) to allow the user to type stop words but have the search results ignore them, giving the appropriate results?

I think it's more simple than i thought! looks like when stop words are added to the txt file the search box ignores them when someone searches!? Cool!

Add new comment