Check your version

This video covers a topic in Drupal 7 which may or may not be the version you're using. We're keeping this tutorial online as a courtesy to users of Drupal 7, but we consider it archived.

Alternate resources: 

Configure Stop Words in Solr

Video loading...

  • 0:00
    Configure Stop Words in Solr with Joe Shindelar
  • 0:03
    In this tutorial, we're going to take a look at
  • 0:06
    configuring stop words for use with Solr.
  • 0:09
    We'll start by defining what stop words are.
  • 0:12
    And then we'll look at how to use the Solr Admin UI
  • 0:16
    in order to get a list of the most common words in our index.
  • 0:20
    Hopefully this will allow us to determine what we might want to add to our stop words list.
  • 0:26
    And finally, we'll update our stop words configuration
  • 0:30
    and then restart Solr so the Solr finds our list of stop words and excludes them from the index.
  • 0:36
    So what are stop words? Really stop words comes from this idea that really in any particular piece of content,
  • 0:43
    there are words that are so common that they're kind of useless in terms of creating relevant search results.
  • 0:49
    Think about the word the, for example, or a or is. These words are so commonly used in the English language
  • 0:58
    that searching for them or the fact that they exist in a title or in the body of our content
  • 1:04
    doesn't really have much impact on how relevant that piece of content is.
  • 1:08
    So what stop words allows us to do is configure a set of words that we would like to exclude
  • 1:14
    from the index. The thing that's tricky about stop words though is there is no one perfect set
  • 1:20
    for every application. In some cases, the word the might have important meaning in the context
  • 1:25
    of the content of your particular application. So I can't say with certainty that you should exclude the word the or a
  • 1:34
    or that you should include it. There are a lot of articles online that publish common stop words
  • 1:40
    in the English language. So I recommend taking a look at some of those for motivation or insight
  • 1:46
    into what might be good words to include in your list of stop words.
  • 1:50
    Another thing to consider too is the fact that stop words—a big part of why they exist is in older times, sites that had
  • 1:59
    so much content and so many words in them, trying to add words like the and is and what and are
  • 2:05
    that don't have a lot of meaning into your index just took a lot of extra processing power and space
  • 2:11
    making your search slower or harder to scale. These days I don't think that's as big of a concern
  • 2:16
    and excluding words like the from your index aren't really going to have any meaningful impact
  • 2:21
    on how fast Solr works, but it could influence the relevancy of results.
  • 2:26
    Anyways, let's get started configuring Solr to use stop words.
  • 2:31
    I think maybe the best place to demonstrate this is by first looking at the list of common terms
  • 2:37
    in our Solr index and then showing how we can exclude some of those using Solr's stopwords.txt file.
  • 2:43
    The easiest way to get this list of words is probably to use the Apache Solr web UI.
  • 2:48
    So if you've got your Solr server running and you connect to local host 8983 or wherever it is that Solr's running,
  • 2:55
    you should see this admin UI. And then I can select the core that contains my index.
  • 3:02
    And then down here on the left is this option for Schema Browser. So I'm going to use Schema Browser
  • 3:08
    and under the field here, I can choose from any of the fields. In our particular application,
  • 3:14
    the most meaningful one is probably the tm_body$value field. This is the node body field in our instance.
  • 3:22
    Depending on which fields you have in your index—you might have different ones here—but select any field
  • 3:27
    like so, and you get an analysis of that field. So we've got 550 documents in our index
  • 3:33
    that have content in this field, some information about yes it's tokenized, yes it's stored, et cetera.
  • 3:40
    I'm going to click on this Load Term Info button here. And what this does is it loads up the list
  • 3:46
    of the top terms, 10 by default. I could expand this. I could say, "Show me the top 50 terms."
  • 3:51
    So I can see the top 50 terms in my index. And you start to see that there's just the, to, of, is, and, in.
  • 4:01
    Fish in our case is a really common one, which makes sense because our application includes information
  • 4:06
    about fish. So what I want to look at is excluding some of these common words from the index
  • 4:13
    by using Solr's stop words. So you see them there now. What I'm going to do is switch over to my text editor.
  • 4:20
    In my text editor what I'm looking at is the root of my Drupal site. And I want to edit the configuration
  • 4:26
    for my Solr server. In this case, I'm going to actually edit the configuration that's part of the index
  • 4:33
    and not the configuration that came with the Search API Solr module that we had installed previously.
  • 4:38
    If I edit that configuration, it's not going to have any impact on Solr itself.
  • 4:42
    So I'm going to edit the example application, and I'm going to go to solr,
  • 4:47
    collection1 because I want to edit the configuration for this particular core. And then under conf or config,
  • 4:56
    there is a bunch of files, some of these which we copied in in a different tutorial from the Search API Solr module
  • 5:03
    into the Solr application, like, for example the schema.xml and solarconfig, et cetera.
  • 5:10
    One of them is this stopwords.txt file. This is the one that we're going to be configuring.
  • 5:16
    Before we do that though, let's take a look at the schema.xml file.
  • 5:20
    So if I open this up, and I just search for the word stopwords.txt,
  • 5:26
    I want to illustrate here the fact that how this is all configured within Solr
  • 5:30
    is that when you define a new field type, so here I'm defining a field type name equals text,
  • 5:37
    I can figure certain properties about how those fields work. And one of them that I can do
  • 5:41
    is configure a filter class or multiple filter classes. In this case I'm saying use the Solr StopFilterFactory.
  • 5:48
    This is the default configuration that we've been using that came with the Search API module.
  • 5:53
    And it says, "Go ahead and look for a file named stopwords.txt and use the contents of that file
  • 5:59
    in conjunction with the StopFilterFactory in order to figure out what to not exclude from the index.
  • 6:06
    The other thing that I'll point out here too is that in this case,
  • 6:10
    because we're inside of this analyzer child of the field type with the name of txt,
  • 6:16
    we're talking about configuration that happens on text when it's being processed or indexed.
  • 6:22
    You can also add stop words to queries, so you can say, "Exclude the stop word from the query,"
  • 6:29
    but have it indexed if you'd like to.
  • 6:33
    There's some interesting debate about whether or not you should include words like the
  • 6:36
    in your index but then just exclude them from the query and how that affects meaning.
  • 6:41
    We're going to stick with what Drupal does by default though and exclude them from the index.
  • 6:45
    Okay. So that's an example. You could continue if you kept searching for other instances
  • 6:51
    of the stopwords.txt text in this file. You'll see which other places it's used or which other field types
  • 6:58
    allow for stop words. It's really just the text fields though. So there we go.
  • 7:03
    So we've got our stopwords.txt file. That corresponds to this file right here, stopwords.txt
  • 7:10
    inside of our Solr configuration. If I edit this file, it's got some documentation here at the top.
  • 7:17
    This link in particular is probably the most interesting part, because it explains the syntax
  • 7:21
    of this file. You can go ahead and open that link and take a look at the syntax if you want.
  • 7:26
    But it really boils down to this. One word per line for the words that you would like to exclude.
  • 7:32
    So if we switch back over to our site quick, in the Apache Solr Admin UI,
  • 7:37
    let's take a look at excluding these few words up here at the top: the, to, of, is, and, and in.
  • 7:46
    Okay. So I can go over and I'll edit this file. The, to, of, is, and, and in I think are all the ones we said
  • 7:54
    we wanted to exclude. I'll go ahead and save that. In the real world, I would maybe recommend
  • 7:59
    putting these in alphabetical order so that the list is a bit easier to scan and you can find words that are in it.
  • 8:05
    Sometimes these stop words lists can get to be hundreds or more words long.
  • 8:10
    We'll save that. The thing is, this isn't going to have any effect on our index until we do a couple of things.
  • 8:17
    First, we're going to need to restart Solr to pick up the stopwords.txt that we added.
  • 8:22
    And second, we're going to need to re-index all of our existing content so that it gets re-indexed
  • 8:29
    and knows there is this list of stop words here that it needs to exclude.
  • 8:33
    So let's do that. I'm going to switch to my terminal.
  • 8:36
    The first thing I'll do is in the tab where I've got Solr running, I'll just quit Solr, command+C to quit it.
  • 8:43
    And then I'll simply start it up again. So I'm just using the simple start.jar version
  • 8:48
    that came with the download of Apache Solr.
  • 8:51
    So configuration changes generally require restarting Solr. The other thing I'll do is if I'm in the root directory
  • 9:02
    for my Drupal site—so you can see here I'm in the users/joe/Sites/docroot directory
  • 9:07
    for our fish finder application. I'm going to use drush to clear and re-index this.
  • 9:13
    I like to do this. I like to say drush, then pipe the output to grep, and then search for the word search.
  • 9:20
    And this gives me a list of all of the commands related to the Search API module.
  • 9:25
    And I often can't remember exactly what they are. So search-api-clear allows me to clear the index.
  • 9:31
    And then I'm also going to use the search-api-index to re-index all of our content.
  • 9:37
    So first we'll clear like so, and then we'll run it again. But this time instead we'll say index,
  • 9:45
    and this will take a second to go through all of our content and re-index it.
  • 9:49
    I'm going to switch back over to the Apache web UI now that all of this has been re-indexed.
  • 9:56
    And what we're going to do is reload the term info here.
  • 10:00
    However, we're going to have to wait 120 seconds, because you remember that Solr has this idea
  • 10:05
    of making commits. And so when you re-index things in the way Drupal is configured by default,
  • 10:10
    it takes 120 seconds for those changes to the index to appear.
  • 10:14
    So go ahead and wait before you hit refresh.
  • 10:17
    After your changes have been committed to the index though and you hit refresh, notice how those words
  • 10:23
    that we added to the stop words list are no longer appearing in the top terms.
  • 10:27
    So we're still looking at the 50 top terms in the body, but the words the, if, is, and, and so forth
  • 10:33
    aren't appearing there. So that's how we would make use of stop words.
  • 10:38
    Configuring stop words is the easy part. You add them to your stopwords.txt file and re-index your content.
  • 10:44
    Knowing which stop words to use though is the tricky part.
  • 10:48
    It will involve having a better understanding of your own content.
  • 10:52
    In this tutorial, we talked about the use case for stop words.
  • 10:56
    Really it comes down to excluding words from appearing in your Apache Solr index
  • 11:01
    in order to exclude them from having any relevancy or influencing the relevancy of search results.
  • 11:08
    We then looked at a list of common terms
  • 11:11
    in our own index using the Apache Solr web UI to get some idea of what terms we might want to exclude.
  • 11:17
    Finally we added some words to the Solr stopwords.txt
  • 11:22
    configuration file, words like the and and, that were really common and we didn't want to include in the index,
  • 11:30
    restarted Solr, and then re-indexed all of our content, checked the list of common terms again,
  • 11:36
    and noticed that those were all gone. As an exercise, I recommend taking a look
  • 11:40
    at the most common words in the data for your site and trying to determine if any of them should be included
  • 11:46
    in your own stopwords.txt file.

Configure Stop Words in Solr

Loading...

Depending on the data that is being searched, some shorter general words, like "a", "the", or "is" can adversely effect search result relevancy. Consider the word "the", which in a standard description of a fish in our database could easily appear hundreds of times or more. When a search is performed, part of the algorithm that calculates the relevancy of any document in the index is to count the number of times a word appears in the text being searched. The more often it appears, the more relevant the document. Words like "the" however often have little to no real bearing on a document's actual relevancy. These words should instead be excluded from the ranking algorithm.

Stop words can also serve another purpose. You can filter out words that are so common in a particular set of data that the system can't handle them in a useful way. For example, consider the word "fish" in our dataset. It's probably very common. With only 500 fish being indexed it's not really going to make much difference, but what if we were indexing five million fish, and each one had the word "fish" in the description even just five times? That's 25 million occurrences of the word "fish". Eventually we might start to hit the upper limit of what Solr can handle. The word "fish" in this case is probably also not very useful in a search query. You're browsing a fish database. Are you really likely to search for the query fish and expect any meaningful results? Likely it would instead return every result. It would be like going to Drupal.org and searching for the word "drupal" and expecting to get something useful. Not going to happen.

Solr has the ability to read in a list of stop words, or words that should be ignored during indexing, so that those words do not clutter your index and are removed from influencing result relevancy. In this tutorial we'll take a look at configuring stop words for Solr.

First, we'll use the Solr web UI to see the most common terms in our index for the body field. Then, based on that list, and the list of common stop words provided by the Solr team, we'll configure our stopwords.txt file. Finally, we'll re-index all the content of our site so that it makes use of the new stop words configuration and re-examine the most common terms noting that our stop words no longer appear in the list.

By the end of this tutorial you should be able to use the Solr web UI to get a list of the most common terms in your index, and know how to add terms to Solr's stopwords.txt file to prevent them from showing up in your index.

Downloads: 
Log in or sign up to download companion files.