Speed up Your Drupal Migrations with High Water Marks

As part of our workflow for writing code to migrate Drupalize.Me from Drupal 7 to Drupal 9 we're using Patheon's multi-dev feature and building new environments for every pull request in GitHub. We use CircleCI to help connect all the pieces. As part of the environment build script in CircleCI we run all the migrations for the site using Drush.

This ensures that if we add a new migration to the codebase it'll get run on the preview environment. Also, someone reviewing the new code can see results in the preview without having to run it locally. When we run the migrations on a new multi-dev environment a few things happen:

  • Existing migrations are executed, and any new content that was added to the source site since the last time it ran on dev gets imported.
  • New migrations that have never been run before are executed.

The relevant part of the script looks like this:

Make sure we have the latest version of the dmemigration module's configuration imported

terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- config-import --partial --source=modules/custom/dmemigration/config/install/ -y

Report the status of the migrations before and after running them

terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:status --tag dme

Run the migrations

terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=file
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=media
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=configuration
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=users
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=content
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=dme

Report the status of the migration after running them

terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:status --tag=dme

Normally this is pretty quick, because the migrations have already run on our development environment and thus don't need to do much work on new multi-dev environments. But lately it's been super slow -- sometimes so slow that CircleCI will timeout waiting for a response! So I wanted to figure out why.

Migrate import is slow even for migrations that have already completed

TL;DR: When you run a drush migrate:import command, it does a bunch of work to figure out how many rows there are to migrate, and whether or not any of them have already been migrated. Part of that work is a query that checks for fields on fieldable entities which runs once per row.

After checking for rows that need importing, the drush migrate:import command imports any rows that haven't already been migrated. The command also flags them so that they're not imported again next time. Subsequent calls to drush migrate:import for the same migration should output something like the following -- indicating no new records were processed.

[notice] Processed 0 items (0 created, 0 updated, 0 failed, 0 ignored) - done with 'dmemigration_flaggings'

If there were new records added to the source database, that handful of new records would get imported at this point. So why does running terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=dme take so long?

One of the migrations we recently completed migrates the data for a 'read' state flag that we use to allow members to mark a tutorial as read. There are currently 524,136 flaggings that need to get migrated. They've all been migrated on the dev server, but calling drush mim dmemigration_flaggings is still taking nearly 5 minutes to complete.

Digging in to this, it's because when \Drupal\migrate\MigrateExecutable::import is executed it initializes the source plugin ($source = $this->getSource();) and then calls $source->rewind() which populates the iterator object with a list of all the rows of source data. See \Drupal\migrate\Plugin\migrate\source\SqlBase::initializeIterator. It's a fancy way of creating an array of all the source records, and then looping over it.

Once the list exists, it loops over each item in the list \Drupal\migrate\Plugin\migrate\source\SourcePluginBase::next and checks to see whether that record has already been imported. If it has, it just moves on to the next item pretty quickly. However, here there be dragons!

The SourcePluginBase::next method makes a call to the source plugin's prepareRow() method, once for each row. In the case of our flaggings migration, the source plugin's prepareRow() method checks to see if the flag being migrated has any fields attached to it (this happens for any fieldable entity including Nodes, User, etc) and then if it does it loads the data for those fields. This is great when you need that data to import. But, when you're just checking to see if a row has been previously imported or not it's...well...slow. You're running at least one additional SQL query for every row regardless of whether that row is going to be imported during this pass or not. And when you've got 524,136 rows that adds up to a lot of time. Memory consumption can become a bottleneck here as well.

High water marks

The Migrate API can use high water marks to limit the list of rows to import to only those that are above the mark. Anything below the mark is considered already processed and can be ignored. The 2 most common use cases are:

  • Only import records that have been added to the source data since the last time the migration ran.
  • Only migrate records that have been updated since the last time the migration ran.

Consider Drupal 7 nodes as an example. You could use either the node.nid field or the node.changed field as a high water mark. Every time the migration is run, it'll keep track of the last nid value or the latest changed timestamp it sees. The next time the migration runs, it'll filter out rows that are below that value. In the case where you're using a timestamp as the high water mark, if a record was previously imported but changed, it'll still be in the list of records to import. Then, if you're using the track changes feature the record created in Drupal 9 will be updated with new values.

High water marks are configured per-migration, in the source plugin configuration of the migration's YAML (.yml) file.

Example: high_water_property in source plugin


source: plugin: d7_flagging batch_size: 5000 high_water_property: # The value, timestamp, is the name of a field on the source row to use as # the high water mark. name: timestamp

It's a good idea to ensure that the source data is sorted by the same field that you're using as the high water mark. If you don't, it's possible you could end up with rows slipping through the cracks.

The logic for high water marks happens in the \Drupal\migrate\Plugin\migrate\source\SqlBase::initializeIterator method. Which, as I explained above, is responsible for discerning the list of rows that will be considered for migration. If a row never ends up in that list, it'll never have its prepareRow() method called to query for any fields, and can't slow down the migration process.

So in our case, by adding a high water mark to our dmemigration_flaggings migration we can cut the processing time of calling drush mim dmemigration_flaggings when there are no (or few) records that actually need importing to a few seconds instead of a few minutes!

There might be some other implications to this that we'll run into and find frustrating later on, but for now using high water marks on a couple of migrations in our codebase that are migrating 500k plus rows has dramatically sped up the time it takes to deploy a new preview environment.

Recap

When a migration is executed, all the logic in the prepareRow() and hook_migrate_prepare_row() implementations is executed once for every row being considered -- even if the row has already been imported. Depending on what that logic does it can have a big impact on how long it takes the drush migrate:import command to figure out that it doesn't actually have anything to do.

Using a high water mark you can limit the list of rows that are considered for import by drawing a line that says, "Everything below this line is done and doesn't need to be processed again, and anything above the line we need to consider for importing or updating." Setting a high water mark for imports with a larger number of rows can help speed up subsequent calls to drush migrate:import by reducing the list of rows for which the prepareRow() code is called.

Related tutorials

Related Topics: 

Comments

Hello, thanks for the info! This was very helpful.
Do we know how exactly does the high_water_property works? I am working with some data-sensitive migrations and I am a bit scared of the part:
"It's a good idea to ensure that the source data is sorted by the same field that you're using as the high water mark. If you don't, it's possible you could end up with rows slipping through the cracks."

The code where this is implemented is in SqlBase::initializeIterator. And essentially it adds a WHERE clause like WHERE {highwater_field} > {last_recorded_value} to the query used to retrieve source data.

The reason you want to make sure you order the query by that field is that {last_recorded_value} will be the value of the highwater field on the last row processed. And what the row is is going to depend on the sort order. It'll be different if you order by node.nid vs. node.changed for example.

Add new comment