diff options
| author | Daniel Axtens <dja@axtens.net> | 2020-03-18 00:59:13 +1100 | 
|---|---|---|
| committer | Stephen Finucane <stephen@that.guru> | 2020-03-19 11:47:42 +0000 | 
| commit | 97155c0bc8881787f6c536031b678a4c3f89bda6 (patch) | |
| tree | 19950ee97db5d4458ca4fbc1c186446f145475cb | |
| parent | 3cc21f320f0de13c9df897204fc512458b658d05 (diff) | |
| download | patchwork-97155c0bc8881787f6c536031b678a4c3f89bda6.tar patchwork-97155c0bc8881787f6c536031b678a4c3f89bda6.tar.gz | |
REST: massively improve the patch counting query under filters
The DRF web view counts the patches as part of pagination.
The query it uses is a disaster zone:
  SELECT Count(*) FROM (
    SELECT DISTINCT
      `patchwork_submission`.`id`           AS Col1,
      `patchwork_submission`.`msgid`        AS Col2,
      `patchwork_submission`.`date`         AS Col3,
      `patchwork_submission`.`submitter_id` AS Col4,
      `patchwork_submission`.`project_id`   AS Col5,
      `patchwork_submission`.`name`         AS Col6,
      `patchwork_patch`.`submission_ptr_id` AS Col7,
      `patchwork_patch`.`commit_ref`        AS Col8,
      `patchwork_patch`.`pull_url`          AS Col9,
      `patchwork_patch`.`delegate_id`       AS Col10,
      `patchwork_patch`.`state_id`          AS Col11,
      `patchwork_patch`.`archived`          AS Col12,
      `patchwork_patch`.`hash`              AS Col13,
      `patchwork_patch`.`patch_project_id`  AS Col14,
      `patchwork_patch`.`series_id`         AS Col15,
      `patchwork_patch`.`number`            AS Col16,
      `patchwork_patch`.`related_id`        AS Col17
    FROM `patchwork_patch`
    INNER JOIN `patchwork_submission`
      ON (`patchwork_patch`.`submission_ptr_id`=`patchwork_submission`.`id`)
    WHERE `patchwork_submission`.`project_id`=1
  )
This is because django-filters adds a DISTINCT qualifier on a
ModelMultiChoiceFilter by default. I guess it makes sense and they do a
decent job of justifying it, but it causes the count to be made with
this awful subquery. (The justification is that they don't know if you're
filtering on a to-many relationship, in which case there could be
duplicate values that need to be removed.)
While fixing that, we can also tell the filter to filter on patch_project
rather than submission's project, which allows us in some cases to avoid
the join entirely.
The resultant SQL is beautiful when filtering by project only:
  SELECT COUNT(*) AS `__count`
  FROM `patchwork_patch`
  WHERE `patchwork_patch`.`patch_project_id` = 1
On my test setup (2x canonical kernel mailing list in the db, warm cache,
my laptop) this query goes from >1s to ~10ms, a ~100x improvement.
If we filter by project and date the query is still nice, but still also
very slow:
  SELECT COUNT(*) AS `__count`
  FROM `patchwork_patch`
  INNER JOIN `patchwork_submission`
    ON (`patchwork_patch`.`submission_ptr_id`=`patchwork_submission`.`id`)
  WHERE (`patchwork_patch`.`patch_project_id`=1 AND `patchwork_submission`.`date`>='2010-11-01 00:00:00')
This us from ~1.3s to a bit under 400ms - still not ideal, but I'll take
the 3x improvement!
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Stephen Finucane <stephen@that.guru>
| -rw-r--r-- | patchwork/api/filters.py | 3 | 
1 files changed, 2 insertions, 1 deletions
| diff --git a/patchwork/api/filters.py b/patchwork/api/filters.py index 60f1a36..a3b6f03 100644 --- a/patchwork/api/filters.py +++ b/patchwork/api/filters.py @@ -188,7 +188,8 @@ class CoverLetterFilterSet(TimestampMixin, BaseFilterSet):  class PatchFilterSet(TimestampMixin, BaseFilterSet): -    project = ProjectFilter(queryset=Project.objects.all()) +    project = ProjectFilter(queryset=Project.objects.all(), distinct=False, +                            field_name='patch_project')      # NOTE(stephenfin): We disable the select-based HTML widgets for these      # filters as the resulting query is _huge_      series = BaseFilter(queryset=Series.objects.all(), |