Problem/Motivation

Steps to reproduce

On a fresh (or fresh enough) install, when the releases have not been synchronized yet.
Visit /admin/config/l10n-server/connectors
For "Drupal.org packages from Rest API", choose "Scan" from "Operations" column.
In the confirm form, click "Confirm".

Expected:
A batch starts.
Every batch iteration only takes a limited amount of time.
When I close the browser tab, the current iteration will finish and then it will stop.

Actual:
A batch starts.
The batch takes forever for the first iteration.
When I close the browser tab, the process continues in the background.
(only ddev stop)

In mysql, "show full processlist" shows repeated queries of this format:

SELECT "base_table"."rid" AS "rid", "base_table"."rid" AS "base_table_rid"
FROM
"l10n_server_release" "base_table"
INNER JOIN "l10n_server_release" "l10n_server_release" ON "l10n_server_release"."rid" = "base_table"."rid"
WHERE "l10n_server_release"."download_link" LIKE 'https://ftp.drupal.org/files/projects/schemadotorg\\_starterkit\\_medical-1.0.0-alpha35.tar.gz' ESCAPE '\\'

It seems this happens in ScannerService::storeReleaseList().

    foreach ($this->releases as $release) {
      $download_link = "https://ftp.drupal.org/files/projects/{$release['machine_name']}-{$release['version']}.tar.gz";
      if ($release_storage->getQuery()->accessCheck(TRUE)->condition('download_link', $download_link)->execute()) {

The number of releases is 205061, they are coming from drupal.org.
(This is on first run, when the 'l10n_drupal_rest.last_sync_time' state value has not been written yet. Subsequent runs can have a lot fewer releases to handle, but I am not sure.)

Interestingly, other methods like parseProjectList() deal with the same number of releases there are other loops that iterate over all these releases and complete much faster. It must be the sql query.

Proposed resolution

Either we break this loop into batch chunks, or we optimize that query.

Remaining tasks

User interface changes

API changes

Data model changes

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

donquixote created an issue. See original summary.

donquixote’s picture

Problem 1:
The 'download_link' column is not indexed.
-> I am trying to have it indexed, by modifying L10nServerReleaseStorageSchema.

Problem 2:
Trying to have it indexed, I find the next problem:
The L10nServerReleaseStorageSchema class is not used, because it is misspelled in the L10nServerRelease entity definition.
-> I fix the entity definition.

With both of these changes, now the process is significantly faster.
Before: ~1 minute per 1000 releases.
After: ~2.5 seconds per 1000 releases.
(I will have to repeat this measurement, there may be some mistake)

The batch in the browser hits a time limit at ~130000 of the total 205061 releases.
But the process continues in the background.

Still this is not an ideal situation.

donquixote’s picture

Another smart idea would be to process releases in reverse order (oldest first), stop after a given time limit, and go to the next batch iteration.
We can write the 'l10n_drupal_rest.last_sync_time' state value and in next iteration we can pick up there.

One question would be whether to download and parse the releases.tsv in every iteration, or keep it in the tmp dir.
It seems a good idea to keep the csv between batch iterations and delete it after the final iteration, not sure what could go wrong with this.

donquixote’s picture

Alternatively we could say that the initial scan should be done with drush, not the UI.

donquixote’s picture

The MR fixes the indexing problem, but does not go further.

Setting to "Needs review", even though I don't think this is a complete solution.

donquixote’s picture

Status: Active » Needs review
donquixote’s picture

Title: Connector scan will process all releases in one batch » Connector scan will process all releases in one batch, and it is slow

fmb made their first commit to this issue’s fork.

  • fmb committed 1249e489 on 3.0.x
    Merge branch '3563318-release-connector-scan-slow' into '3.0.x'
    
    Resolve...

  • fmb committed f2743076 on 3.0.x authored by donquixote
    Issue #3563318: Add index for 'download_link' field.
    

  • fmb committed c38b1568 on 3.0.x authored by donquixote
    Issue #3563318: Fix storage schema class in server release entity...
fmb’s picture

Status: Needs review » Active

I have just merged the indexing part for the time being.