As part of our workflow for writing code to migrate Drupalize.Me from Drupal 7 to Drupal 9 we're using Patheon's multi-dev feature and building new environments for every pull request in GitHub. We use CircleCI to help connect all the pieces. As part of the environment build script in CircleCI we run all the migrations for the site using Drush.
This ensures that if we add a new migration to the codebase it'll get run on the preview environment. Also, someone reviewing the new code can see results in the preview without having to run it locally. When we run the migrations on a new multi-dev environment a few things happen:
- Existing migrations are executed, and any new content that was added to the source site since the last time it ran on dev gets imported.
- New migrations that have never been run before are executed.
The relevant part of the script looks like this:
Make sure we have the latest version of the dmemigration module's configuration imported
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- config-import --partial --source=modules/custom/dmemigration/config/install/ -y
Report the status of the migrations before and after running them
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:status --tag dme
Run the migrations
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=file
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=media
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=configuration
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=users
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=content
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=dme
Report the status of the migration after running them
terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:status --tag=dme
Normally this is pretty quick, because the migrations have already run on our development environment and thus don't need to do much work on new multi-dev environments. But lately it's been super slow -- sometimes so slow that CircleCI will timeout waiting for a response! So I wanted to figure out why.
Migrate import is slow even for migrations that have already completed
TL;DR: When you run a drush migrate:import
command, it does a bunch of work to figure out how many rows there are to migrate, and whether or not any of them have already been migrated. Part of that work is a query that checks for fields on fieldable entities which runs once per row.
After checking for rows that need importing, the drush migrate:import
command imports any rows that haven't already been migrated. The command also flags them so that they're not imported again next time. Subsequent calls to drush migrate:import
for the same migration should output something like the following -- indicating no new records were processed.
[notice] Processed 0 items (0 created, 0 updated, 0 failed, 0 ignored) - done with 'dmemigration_flaggings'
If there were new records added to the source database, that handful of new records would get imported at this point. So why does running terminus drush $TERMINUS_SITE.$TERMINUS_ENV -- migrate:import --tag=dme
take so long?
One of the migrations we recently completed migrates the data for a 'read' state flag that we use to allow members to mark a tutorial as read. There are currently 524,136 flaggings that need to get migrated. They've all been migrated on the dev server, but calling drush mim dmemigration_flaggings
is still taking nearly 5 minutes to complete.
Digging in to this, it's because when \Drupal\migrate\MigrateExecutable::import
is executed it initializes the source plugin ($source = $this->getSource();
) and then calls $source->rewind()
which populates the iterator object with a list of all the rows of source data. See \Drupal\migrate\Plugin\migrate\source\SqlBase::initializeIterator
. It's a fancy way of creating an array of all the source records, and then looping over it.
Once the list exists, it loops over each item in the list \Drupal\migrate\Plugin\migrate\source\SourcePluginBase::next
and checks to see whether that record has already been imported. If it has, it just moves on to the next item pretty quickly. However, here there be dragons!
The SourcePluginBase::next
method makes a call to the source plugin's prepareRow()
method, once for each row. In the case of our flaggings migration, the source plugin's prepareRow()
method checks to see if the flag being migrated has any fields attached to it (this happens for any fieldable entity including Nodes, User, etc) and then if it does it loads the data for those fields. This is great when you need that data to import. But, when you're just checking to see if a row has been previously imported or not it’s...well...slow. You're running at least one additional SQL query for every row regardless of whether that row is going to be imported during this pass or not. And when you've got 524,136 rows that adds up to a lot of time. Memory consumption can become a bottleneck here as well.
Highwater marks
The Migrate API can use highwater marks to limit the list of rows to import to only those that are above the mark. Anything below the mark is considered already processed and can be ignored. The 2 most common use cases are:
- Only import records that have been added to the source data since the last time the migration ran.
- Only migrate records that have been updated since the last time the migration ran.
Consider Drupal 7 nodes as an example. You could use either the node.nid
field or the node.changed
field as a high water mark. Every time the migration is run, it'll keep track of the last nid
value or the latest changed
timestamp it sees. The next time the migration runs, it'll filter out rows that are below that value. In the case where you're using a timestamp as the highwater mark, if a record was previously imported but changed, it'll still be in the list of records to import. Then, if you're using the track changes feature the record created in Drupal 9 will be updated with new values.
Highwater marks are configured per-migration, in the source
plugin configuration of the migration's YAML (.yml) file.
Example: high_water_property
in source plugin
source:
plugin: d7_flagging
batch_size: 5000
high_water_property:
# The value, timestamp, is the name of a field on the source row to use as
# the high water mark.
name: timestamp
It's a good idea to ensure that the source data is sorted by the same field that you're using as the high water mark. If you don't, it's possible you could end up with rows slipping through the cracks.
The logic for highwater marks happens in the \Drupal\migrate\Plugin\migrate\source\SqlBase::initializeIterator
method. Which, as I explained above, is responsible for discerning the list of rows that will be considered for migration. If a row never ends up in that list, it'll never have its prepareRow()
method called to query for any fields, and can't slow down the migration process.
So in our case, by adding a highwater mark to our dmemigration_flaggings
migration we can cut the processing time of calling drush mim dmemigration_flaggings
when there are no (or few) records that actually need importing to a few seconds instead of a few minutes!
There might be some other implications to this that we'll run into and find frustrating later on, but for now using highwater marks on a couple of migrations in our codebase that are migrating 500k plus rows has dramatically sped up the time it takes to deploy a new preview environment.
Recap
When a migration is executed, all the logic in the prepareRow()
and hook_migrate_prepare_row()
implementations is executed once for every row being considered — even if the row has already been imported. Depending on what that logic does it can have a big impact on how long it takes the drush migrate:import
command to figure out that it doesn't actually have anything to do.
Using a highwater mark you can limit the list of rows that are considered for import by drawing a line that says, “Everything below this line is done and doesn't need to be processed again, and anything above the line we need to consider for importing or updating.” Setting a highwater mark for imports with a larger number of rows can help speed up subsequent calls to drush migrate:import
by reducing the list of rows for which the prepareRow()
code is called.
Comments
Hello, thanks for the info! This was very helpful.
Do we know how exactly does the high_water_property works? I am working with some data-sensitive migrations and I am a bit scared of the part:
"It's a good idea to ensure that the source data is sorted by the same field that you're using as the high water mark. If you don't, it's possible you could end up with rows slipping through the cracks."
The code where this is implemented is in SqlBase::initializeIterator. And essentially it adds a WHERE clause like
WHERE {highwater_field} > {last_recorded_value}
to the query used to retrieve source data.The reason you want to make sure you order the query by that field is that
{last_recorded_value}
will be the value of the highwater field on the last row processed. And what the row is is going to depend on the sort order. It'll be different if you order by node.nid vs. node.changed for example.Add new comment