Most written text has a lot of functional words, like "a", "the", or "is" which are important to the person reading the content as they help it flow in a cohesive manner, but aren't necessary as important to someone searching the content of your site. Consider the word "the", which in a standard blog post could easily appear hundreds of times or more. When a user performs a search, part of the algorithm that calculates the relevancy of any document in the search index is to count the number of times a word appears in the text being searched. The more often it appears, the more relevant the document. Words like "the" however often have little to no real bearing on a documents actual relevancy. And should instead be excluded from the ranking algorithm.
This is generally done in one of two ways. Either, ignoring those utility words when they are present in the search query and simply not passing them on to the search appliance. Or, by removing them during the indexing process so that those words simply don't appear in your search index. The default configuration that comes with the Search API Solr module actually has both of these features enabled, but doesn't provide any listing of words to ignore.
This tutorial, based on the free Configure Stop Words in Solr videos from the Improving Drupal's Search with Apache Solr series, expects that you've already got the Search API module installed and configured to use Apache Solr as a backend.
Our goal: Filter out words that are so common in a particular set of data that the system can not handle them in any useful way. In Solr, and in most search indexing applications, these are referred to as "stop words".
Alright, let's create some stop words. We're going to locate the Solr server configuration and edit the included
stopwords.txt file, then restart Solr so that those changes are picked up, and re-index our content so that the index is recreated with those stop words removed.
- Locate your
stopwords.txtfile, this is part of your Solr server's configuration, and not Drupal's configuration. The location of this file will vary depending on your installation but it should be with the rest of the Solr configuration files you installed when setting up your server. Common locations include,
solr-4.10.3/example/solr/collection1/conf/stopwords.txtif you're doing local development and using the example Solr server. Or, somewhere like
/usr/local/tomcat/solr/stopwords.txtif you're using Tomcat. Note: stop words configuration is per-core, so each Solr core will have a different stopwords.txt file.
- Once you've located the file, open it your text editor of choice. If this is your first time editing the file it is likely either completely empty, or has a handful of lines a the top starting with "#". These are all comments, and are ignored by the indexer.
- Add a list of common stop words to your stopwords.txt file. The documentation for the syntax that can be used in this file can be found here http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory. The short version: one word per line. So your file might look something like this:
#Stopwords for example.com a an and are as at be but by for if ...
- Once you've made your changes to the
stopwords.txtfile you'll need to restart Solr so that your new configuration is used. Again, this will vary depending on your configuration by common examples are to either kill the running localhost process and restart the example application with
java -jar start.jaror, if using Tomcat and an init.d like service,
sudo service tomcat restartshould do the trick.
- Finally, you'll need to rebuild your search index. With the Search API module this can either be done through the UI (in the Administrative menu, go to Configuration > Search and Metadata > Search API (admin/config/search/search_api)), or with Drush,
drush search-api-rebuild && drush search-api-index. Once the indexing is complete queries to Solr should now ignore the stop words you've configured above.
That's it. Pretty easy, and with a bit of tweaking over time hopefully this helps you make your search results more relevant for your users. If you want to learn more about how Solr uses stop words here a couple of additional resources that I've found helpful: