Check your version

This video covers a topic in Drupal 7 which may or may not be the version you're using. We're keeping this tutorial online as a courtesy to users of Drupal 7, but we consider it archived.

Alternate resources: 

Data Alterations and Processors

Video loading...

  • 0:00
    Data Alterations and Processors with Joe Shindelar
  • 0:04
    In this tutorial, we're going to look at how the Search API modules,
  • 0:08
    data alterations, and processes
  • 0:10
    allow us to modify the information that's being indexed
  • 0:14
    or modify the way that search results are being displayed.
  • 0:18
    So the first thing we'll do is take a look at what data alterations are
  • 0:23
    and what processors are and the effect that they have on indexing.
  • 0:26
    Basically, we'll just look at the list that Search API module provides
  • 0:30
    and talk about what they all do.
  • 0:32
    We'll look at using an alteration to exclude content types from our index.
  • 0:37
    So on our site we want to make sure that we're only indexing content of the type fish.
  • 0:42
    And then we'll use a processor
  • 0:44
    to highlight keywords in search results.
  • 0:47
    The truth is most of the data alterations and processors
  • 0:51
    that come with the Search API module aren't really that relevant
  • 0:55
    when we're using Solr as a back end
  • 0:57
    because Solr actually handles most of these things for us
  • 1:00
    and probably does a better job of it
  • 1:03
    than the PHP code written into the Search API module.
  • 1:07
    However, I think that they're important things to understand,
  • 1:10
    and we'll talk about how each of them works.
  • 1:12
    By the end of this lesson, you should have a better understanding
  • 1:15
    of how data alterations and processors work
  • 1:18
    and be able to configure them for your Search API index.
  • 1:22
    When I'm looking at the site for our fish finder application,
  • 1:26
    if I click on the configuration tab at the top,
  • 1:28
    and then go to Search in Metadata Search API,
  • 1:32
    we've got the configuration for our Search API.
  • 1:36
    We've got our server and our index.
  • 1:38
    What I'm going to do is I'm going to edit our nodes index,
  • 1:41
    and I'm going to click on the filters link here
  • 1:44
    because what I'm interested in is the filters section,
  • 1:48
    which contains both the data alterations and the processors.
  • 1:51
    I think the term filters here is useful in understanding what these do.
  • 1:54
    They allow us to better filter the things that will ultimately end up in our index
  • 1:59
    and also to filter search results before they're displayed for end users
  • 2:04
    so that we can perform additional operations on them.
  • 2:06
    We configured some of these in an earlier tutorial
  • 2:10
    when we first set up our index.
  • 2:12
    I want to take a look at these again.
  • 2:14
    Data alterations are probably the thing that we'll end up using most of the time
  • 2:19
    when we're using Solr as a back end.
  • 2:22
    Data alterations are all about executing
  • 2:26
    before items are indexed and modifying what's sent to the index.
  • 2:32
    We can modify things like saying exclude an entire content type.
  • 2:36
    So our index is right now the idea is that it's indexing all nodes.
  • 2:41
    But when we enable the bundle filter data alteration,
  • 2:45
    and then use the configuration down here
  • 2:47
    to say only those from the selected bundles,
  • 2:50
    fish, we're telling the Search API to exclude
  • 2:54
    any node type from this index that isn't the type fish.
  • 2:58
    That's one thing that we can do with alterations,
  • 3:01
    sort of change what's getting sent.
  • 3:04
    Another data alteration is the node access data alteration.
  • 3:08
    This one, combined with the exclude unpublished nodes,
  • 3:11
    I think are really important, especially if what you're indexing is nodes.
  • 3:15
    The thing is Search API doesn't really care
  • 3:18
    about access control by default.
  • 3:21
    So without using the node access alteration,
  • 3:24
    all of the nodes get added to your index.
  • 3:27
    And somebody could search for content
  • 3:29
    that would ultimately return a result that they didn't have access to see.
  • 3:33
    Now, they wouldn't necessarily be able to click the link and get to the page
  • 3:38
    because Drupal would block that from happening.
  • 3:40
    But they would see whatever was in the index.
  • 3:43
    The node access data alteration works by adding
  • 3:47
    an additional node access information field to your index.
  • 3:51
    And if that field is present in the index,
  • 3:54
    appropriate filters will be automatically added to all the searches
  • 3:57
    so that they only return results that the current user is allowed to view.
  • 4:02
    So keep that in mind, and I highly recommend turning on node access
  • 4:06
    if you're indexing nodes.
  • 4:08
    Exclude unpublished nodes doesn't necessarily control access,
  • 4:13
    but what it does do is ensure that nodes that aren't marked as published
  • 4:17
    don't get sent to the indexer.
  • 4:20
    The index hierarchy data alteration is kind of neat.
  • 4:23
    What it does is it allows you to index hierarchical fields along with most of their parents.
  • 4:28
    Most importantly, this can be used to index things like taxonomy term references,
  • 4:32
    along with all of the parent terms.
  • 4:34
    So for example, if you've tagged something as
  • 4:36
    with the term "New York,"
  • 4:39
    but it has parent terms of "USA" and "North America,"
  • 4:43
    if you use the hierarchical indexer here,
  • 4:46
    what could happen is if somebody searches for the term "North America,"
  • 4:49
    your result could still show up.
  • 4:52
    The complete entity view data alteration
  • 4:55
    adds a field that contains the whole HTML content of the entity as it's viewed on the site.
  • 5:01
    The view mode can be selected, so you can use, like,
  • 5:03
    the search results view mode or the teaser or whatever.
  • 5:06
    This allows your index to contain exactly what the user sees,
  • 5:10
    which is often what is expected.
  • 5:12
    This has some performance implications too,
  • 5:14
    as it allows for data to be rendered directly out of the Solr server
  • 5:18
    rather than having to do so in Drupal.
  • 5:20
    So that's one to play around with.
  • 5:23
    Aggregated fields offer the ability to add additional fields
  • 5:27
    to an entity that's being indexed
  • 5:29
    and have that sent to the Solr server.
  • 5:33
    An example might be if you wanted to set the type of the title field,
  • 5:38
    you could clone the node title field.
  • 5:40
    So you could have a full text version of it, but you could also have a string version of it
  • 5:44
    and have them be indexed differently by the Solr server.
  • 5:46
    So aggregated fields allow you to do that.
  • 5:49
    They also allow you to compound multiple fields together
  • 5:52
    into an individual blob of text to apply a full text search to or something along those lines.
  • 5:57
    Let's take a look at the information about processors.
  • 6:01
    Processors are things that run pre-indexing and post-indexing.
  • 6:07
    There's a note here that most processors only influence full text fields,
  • 6:11
    but you need to refer to their individual descriptions for details regarding their effect.
  • 6:16
    What processors do is either
  • 6:19
    run some additional manipulation on fields before they're sent to the index,
  • 6:25
    so, for example, you might use the ignore case processor
  • 6:28
    to say before you send this field to be indexed,
  • 6:30
    lowercase all of the content in the field.
  • 6:34
    That way the search becomes case insensitive for full text fields.
  • 6:38
    There's an HTML filter processor,
  • 6:41
    which also runs prior to information being sent to the indexer.
  • 6:46
    We could use this if we wanted to strip HTML tags out of full text fields
  • 6:50
    so that they don't appear in our search results.
  • 6:52
    Tokenizer and stop words are also things that can happen
  • 6:56
    prior to being sent to the indexer.
  • 6:59
    Tokenizing being the process of breaking a string of text
  • 7:02
    up into the individual words, sort of finding word boundaries
  • 7:05
    and knowing how to split that string into something that can be indexed.
  • 7:08
    And stop words being the words that we would like to specifically exclude from our index.
  • 7:14
    All of these things—ignore case, HTML filter,
  • 7:18
    tokenizer, and stop words—
  • 7:20
    are actually handled by Solr for us already,
  • 7:22
    so there's really no need to enable these processors
  • 7:26
    within the Search API module.
  • 7:28
    If we wanted to make changes to the way that these work,
  • 7:32
    we would probably want to change our Solr configuration,
  • 7:36
    so that schema.xml or solrconfig.xml, and not make the changes here.
  • 7:41
    So I'm not actually going to enable any of these,
  • 7:44
    but I think it's nice to know what they are and what they're doing.
  • 7:47
    In a future tutorial, we'll look at configuring stop words within Solr.
  • 7:51
    What I am going to do is enable this search highlighting feature.
  • 7:55
    Highlighting, unlike the rest of these,
  • 7:57
    actually runs post-indexing,
  • 8:00
    so when somebody performs a search and the results are returned,
  • 8:03
    the highlighter allows us to highlight key phrases or words—
  • 8:08
    they keywords wherever they appear in the search results.
  • 8:11
    Once I've enabled it, there's some additional configuration that I can do down here.
  • 8:16
    Basically, what HTML tags would you like to use
  • 8:18
    in order to wrap the highlighted text?
  • 8:21
    By default it's just strong tags.
  • 8:24
    In order to make this really stand out,
  • 8:26
    I'm going to change this.
  • 8:29
    Then we'll just apply some inline CSS.
  • 8:31
    Probably a better way to do this would be to add a class,
  • 8:34
    but for purposes of this demonstration, we'll do it inline here.
  • 8:37
    We'll change the background to yellow and add a little bit of padding around it.
  • 8:40
    We can say, for this particular post processor,
  • 8:44
    which field would we like it to work on?
  • 8:47
    We can select specific fields,
  • 8:49
    or if we leave them all unchecked,
  • 8:51
    it will just assume all of them.
  • 8:53
    And then we can configure when we want highlighted data to be returned.
  • 8:59
    So there's some options there for always highlight the data,
  • 9:02
    highlight the data if the server returns fields, or never.
  • 9:06
    We're going to just leave those as-is and click Save.
  • 9:09
    If I close this configuration,
  • 9:11
    and I perform a search like for, say, the word trout,
  • 9:15
    when results are returned, you can now see that there's an excerpt
  • 9:20
    of the search document that was indexed,
  • 9:23
    and the keyword is highlighted in the body field whenever it's found.
  • 9:28
    Kind of useful in some situations.
  • 9:31
    It's a nice way to display for people kind of where words are found.
  • 9:34
    You also really commonly see this used in Google.
  • 9:37
    For example, when you perform a search on Google,
  • 9:39
    you get the title and then an excerpt of the page
  • 9:42
    and where the keyword or phrase was found.
  • 9:45
    One final note about data alterations and processors,
  • 9:50
    depending on the configuration and what you've changed,
  • 9:53
    you may need to reindex your content
  • 9:55
    in order for that configuration change to be picked up.
  • 9:58
    For example, if we had said, "Enable the tokenizer,"
  • 10:02
    or if we had modified the list of content types that were allowed with the bundle filter,
  • 10:07
    we need to reindex all of our existing content
  • 10:10
    to make sure that those changes are picked up.
  • 10:12
    In this tutorial, we talked about what data alterations and processors
  • 10:17
    in the Search API module are and a bit about how they're used.
  • 10:21
    We didn't actually do much with them
  • 10:23
    because the truth is Solr handles most of this for us.
  • 10:26
    But it's nice to know what exists.
  • 10:28
    It's also good to point out that as you enable additional modules
  • 10:32
    that integrate with the Search API module,
  • 10:34
    they may also be adding new data alterations and processors that you can make use of.
  • 10:39
    We then looked at using the bundle filter
  • 10:42
    and how that would allow us to exclude a content type from indexing,
  • 10:44
    talking about how alterations are all about modifying what is sent
  • 10:48
    to the indexer.
  • 10:50
    And then we used the highlighted keywords processor
  • 10:53
    in order to wrap the text in the search results excerpt
  • 10:57
    and give it a yellow background so we could point out to people,
  • 11:00
    when you did your search, here's where that keyword occurred.
  • 11:03
    That's data alterations and processors for Search API.
  • 11:07
    In the next tutorial, we're going to take a look at displaying search results
  • 11:11
    using the views module.

Data Alterations and Processors


The Search API module supports a handful of data alterations and processors; additional operations that can be performed on a document before it's indexed or during the display of search results. While Solr actually handles the majority of these for us already, this tutorial will look at the available options, talk about what each one does, and explain which ones are still relevant when using Solr as a backend.

Looking at data alterations in the Search API module also raises an important point about security. By default, Search API doesn't care about your content's access control settings. In order to prevent people from seeing results for their searches that contain data they shouldn't have access to we need to make sure we account for that in our configuration.

Here's a good list of the currently available data alterations and processors, though it's worth noting that not all of them are available for all search backends. Also, as we'll see, not all of them are recommended when using Solr even if they are available. Solr's tokenizer for example is much more full featured than the Search API tokenizer, so when using Solr as a backend it's best to keep the Search API tokenizer turned off and let Solr do its thing.

By the end of this lesson you should be able to use data alterations and processors to filter out specific content types from your Solr index and to highlight keywords found when displaying search results. You'll also be able to explain why some alterations and processors are better left off so that Solr can handle those tasks directly.

Log in or sign up to download companion files.
Additional resources: