remove perseus from storage calculations #5601

ozer550 · 2025-12-11T06:38:15Z

Summary

WIP
…

References

…

Reviewer guidance

…

bjester · 2025-12-12T20:08:23Z

contentcuration/contentcuration/models.py

+        files_qs = cte.join(
+            self.files.get_queryset(), contentnode__tree_id=cte.col.tree_id
+        ).with_cte(cte)


As we looked at together, the combination of self.files.get_queryset() and the tree filtering is blowing up the performance of the query. Breaking this down into smaller blocks makes it more performant and allows for the additional filtering you're adding. I think something like this might work:

files_cte = With(self.files.get_queryset().values("checksum", "contentnode_id", "file_format_id")) files_qs = ( files_cte.queryset() .with_cte(files_cte) .filter( Exists( cte.join(ContentNode.objects.all(), tree_id=cte.col.tree_id) .with_cte(cte) .filter(id=OuterRef("contentnode_id")) ) ) ) files_qs = self._filter_storage_billable_files(files_qs)

See if you can apply some of the same ideas to the more complex check_channel_space method too.

The main files_qs might also need .with_cte(cte) too. I'm a bit unsure

rtibbles · 2025-12-12T20:29:12Z

contentcuration/contentcuration/models.py

+        if queryset is None:
+            return queryset
+        return queryset.exclude(file_format_id__isnull=True).exclude(
+            file_format_id=file_formats.PERSEUS


Not an immediate concern, but just a heads that when QTI assessments are more broadly available, and we are generating QTI ZIP files, then we may need to filter these too (and it would need to be on the format preset, rather than the file format id, because the format id would be 'zip'!)

ozer550 · 2025-12-19T09:00:18Z

This was the analysis after latest changes:

Unique  (cost=7.20..1007.44 rows=1 width=33)
        (actual time=1.205..1.213 rows=0 loops=1)
  Output: contentcuration_file.checksum

  -> Nested Loop  (cost=7.20..1007.44 rows=1 width=33)
                  (actual time=1.205..1.212 rows=0 loops=1)
        Output: contentcuration_file.checksum
        Inner Unique: true

        -> Merge Anti Join  (cost=7.05..1002.77 rows=1 width=66)
                             (actual time=1.204..1.211 rows=0 loops=1)
              Output: contentcuration_file.checksum,
                      contentcuration_file.contentnode_id
              Merge Cond:
                ((contentcuration_file.checksum)::text =
                 (user_files.checksum)::text)

              -> Sort  (cost=3.53..3.59 rows=25 width=66)
                        (actual time=0.056..0.059 rows=25 loops=1)
                    Output: contentcuration_file.checksum,
                            contentcuration_file.contentnode_id
                    Sort Key: contentcuration_file.checksum
                    Sort Method: quicksort  Memory: 28kB

                    -> Seq Scan on contentcuration_file
                          (cost=0.00..2.94 rows=25 width=66)
                          (actual time=0.010..0.029 rows=25 loops=1)
                          Filter:
                            (file_format_id IS NOT NULL
                             AND file_format_id <> 'perseus'
                             AND uploaded_by_id = 1)
                          Rows Removed by Filter: 38

              -> Unique  (cost=3.53..998.72 rows=12 width=33)
                           (actual time=1.138..1.144 rows=1 loops=1)
                    Output: user_files.checksum

                    -> Subquery Scan on user_files
                          (cost=3.53..998.69 rows=12 width=33)
                          (actual time=1.137..1.143 rows=1 loops=1)
                          Output: user_files.checksum
                          Filter:
                            (alternatives: SubPlan 1 or hashed SubPlan 2)

                          -> Unique  (cost=3.53..3.78 rows=24 width=72)
                                       (actual time=0.067..0.068 rows=1 loops=1)
                                Output:
                                  contentcuration_file_1.checksum,
                                  contentcuration_file_1.contentnode_id,
                                  contentcuration_file_1.file_format_id

                                -> Sort  (cost=3.53..3.59 rows=25 width=72)
                                            (actual time=0.067 rows=1 loops=1)
                                      Output:
                                        contentcuration_file_1.checksum,
                                        contentcuration_file_1.contentnode_id,
                                        contentcuration_file_1.file_format_id
                                      Sort Key:
                                        contentcuration_file_1.checksum,
                                        contentcuration_file_1.contentnode_id,
                                        contentcuration_file_1.file_format_id
                                      Sort Method: quicksort  Memory: 28kB

                                      -> Seq Scan on contentcuration_file
                                            contentcuration_file_1
                                            (cost=0.00..2.94 rows=25 width=72)
                                            (actual time=0.003..0.019
                                             rows=25 loops=1)
                                            Filter:
                                              (file_format_id IS NOT NULL
                                               AND file_format_id <> 'perseus'
                                               AND uploaded_by_id = 1)
                                            Rows Removed by Filter: 38

                          SubPlan 1
                            -> Nested Loop  (cost=33.37..41.44 rows=1 width=0)
                                  (never executed)
                                  ...

                          SubPlan 2
                            -> Hash Join  (cost=33.28..47.16 rows=17 width=32)
                                          (actual time=0.523..1.028
                                           rows=58 loops=1)
                                  Output: u0_1.id
                                  Hash Cond:
                                    (u0_1.tree_id =
                                     contentcuration_contentnode_2.tree_id)

                                  -> Seq Scan on contentcuration_contentnode
                                        u0_1
                                        (cost=0.00..13.42 rows=142 width=37)
                                        (actual time=0.219..0.695
                                         rows=143 loops=1)

                                  -> Hash  (cost=33.26..33.26 rows=2 width=4)
                                            (actual time=0.282..0.284
                                             rows=5 loops=1)
                                        Output:
                                          contentcuration_contentnode_2.tree_id

                                        -> Unique
                                             (cost=33.23..33.24
                                              rows=2 width=4)
                                             (actual time=0.273..0.277
                                              rows=5 loops=1)

                                              -> Sort
                                                   (cost=33.23..33.23
                                                    rows=2 width=4)
                                                   (actual time=0.272..0.274
                                                    rows=5 loops=1)

                                                    -> Nested Loop Left Join
                                                         (cost=4.32..33.22
                                                          rows=2 width=4)
                                                         (actual time=0.239..0.261
                                                          rows=5 loops=1)

                                                          -> Nested Loop
                                                               (cost=4.17..21.61
                                                                rows=2 width=82)
                                                               (actual time=0.226..0.235
                                                                rows=5 loops=1)
                                                               Join Filter:
                                                                 (channel.id =
                                                                  channel_editors.channel_id)
                                                               Rows Removed by Join Filter: 10

                                                               -> Seq Scan on
                                                                    contentcuration_channel
                                                                    (cost=0.00..10.10
                                                                     rows=5 width=164)
                                                                    (actual time=0.007..0.010
                                                                     rows=5 loops=1)
                                                                    Filter: (NOT deleted)

                                                               -> Materialize
                                                                    -> Bitmap Heap Scan on
                                                                         contentcuration_channel_editors
                                                                         (cost=4.17..11.28
                                                                          rows=3 width=82)
                                                                         (actual time=0.206..0.208
                                                                          rows=5 loops=1)
                                                                         Recheck Cond:
                                                                           (user_id = 1)

                                                                         -> Bitmap Index Scan on
                                                                              contentcuration_channel_editors_user_id_446ae41b
                                                                              (cost=0.00..4.17
                                                                               rows=3 width=0)
                                                                              (actual time=0.015
                                                                               rows=5 loops=1)

                                                          -> Index Scan on
                                                               contentcuration_contentnode
                                                               contentcuration_contentnode_2
                                                               (cost=0.14..5.76
                                                                rows=1 width=37)
                                                               (actual time=0.004
                                                                rows=1 loops=5)

        -> Index Scan using
             contentcuration_contentnode_id_2b2d9339_like
             on contentcuration_contentnode
             (cost=0.14..5.76 rows=1 width=33)
             (actual time=0.007..0.007 rows=0 loops=0)

Planning Time: 2.860 ms
Execution Time: 1.085 ms

bjester · 2025-12-19T15:21:19Z

This was the analysis after latest changes:

That was for the check_channel_space queries?

bjester · 2025-12-19T15:25:33Z

contentcuration/contentcuration/models.py

+        staging_files_qs = self._filter_storage_billable_files(
+            self.files.filter(contentnode__tree_id=channel.staging_tree.tree_id)
+        )


This still has the same issue as the original queries-- it queries on too many things at once. The user_files_cte can be reused for both editable and staged trees. So you can essentially duplicate editable_files_qs but instead of joining on tree_cte just check existence where tree_id=channel.staging_tree.tree_id.

Then in the core SELECT query, where it diffs between existing and new checksums, you can also filter off file_format_id

bjester

Overall, this is looking pretty great! Just some small comments

bjester · 2026-01-05T19:29:48Z

contentcuration/contentcuration/models.py

-            name="files",
+
+        user_files_cte = With(
+            self.files.get_queryset().only(


The django queryset method only has connection with using model objects, if I understand correctly. Since this doesn't deal with model objects, values seems more appropriate. Under the hood, they may result in the same SELECT query, but I'm unsure.

bjester · 2026-01-05T19:33:29Z

contentcuration/contentcuration/models.py

+            .filter(
+                Exists(
+                    tree_cte.join(
+                        ContentNode.objects.only("id", "tree_id"),


only or values should be unnecessary here because Django should eventually make this subquery (because it uses Exists) simply SELECT 1. Something like .all() should work

bjester · 2026-01-05T19:37:22Z

contentcuration/contentcuration/models.py

+                    ContentNode.objects.only("id").filter(
+                        tree_id=channel.staging_tree.tree_id,
+                        id=OuterRef("contentnode_id"),
+                    )


Like we talked about, I think this query adjustment should bring some improvement! Secondly, similar comment about only here

bjester · 2026-01-05T19:38:39Z

contentcuration/contentcuration/models.py

+
+        staging_files_qs = self._filter_storage_billable_files(staging_files_qs)
+
+        staging_files_qs = (


Maybe for clarity, call this queryset something else? new_staging_files_qs or something like that, since this is post-comparison with existing checksums

bjester · 2026-01-05T19:42:08Z

contentcuration/contentcuration/models.py

+                        checksum=OuterRef("checksum"),
+                        file_format_id=OuterRef("file_format_id"),


It's possible someone could craft two different files, with different formats, but the same checksum. Although I don't know that we need to be concerned about that, i.e. we can filter solely on checksum. We also reduce the results to the ids of distinct checksums, meaning we'd only count one of the files anyway. There was existing potential for this anyway, but I think limited usefulness for exploitation. Any particular scenario you're thinking about?

bjester · 2026-01-05T19:45:22Z

contentcuration/contentcuration/models.py

+            )
+        )
+
+        staging_files_qs = self._filter_storage_billable_files(staging_files_qs)


Instead of filtering both editable_files_qs and staging_files_qs, I think we could just filter the resulting queryset after we find the new files (after the checksum check)? I could envision some tradeoffs-- eliminating file formats we won't bother to count later on could reduce the size of the checksum comparison, but it means we do that twice instead of once. Thoughts?

bjester · 2026-01-05T19:48:50Z

contentcuration/contentcuration/models.py

        )
        staged_size = float(
-            staging_tree_files.aggregate(used=Sum("file_size"))["used"] or 0
+            staging_files_qs.filter(id__in=Subquery(unique_staging_ids)).aggregate(


We should keep the in-subquery in mind later during unstable/hotfixes testing. I believe the query planner should make similar decisions to an EXISTS check, but maybe not.

remove perseus from storage calculations

e19fedb

bjester reviewed Dec 12, 2025

View reviewed changes

rtibbles reviewed Dec 12, 2025

View reviewed changes

optimise query by shuffling cte order

eaba504

ozer550 requested a review from bjester December 19, 2025 09:09

bjester reviewed Dec 19, 2025

View reviewed changes

ozer550 added 2 commits January 1, 2026 09:48

fix failing tests

d509c37

reuse user_files_cte in staging_files logic

218f8d7

bjester reviewed Jan 5, 2026

View reviewed changes


		staging_files_qs = self._filter_storage_billable_files(staging_files_qs)

		staging_files_qs = (

		checksum=OuterRef("checksum"),
		file_format_id=OuterRef("file_format_id"),

remove perseus from storage calculations #5601

Are you sure you want to change the base?

remove perseus from storage calculations #5601

Uh oh!

Conversation

ozer550 commented Dec 11, 2025

Summary

References

Reviewer guidance

Uh oh!

bjester Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rtibbles Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ozer550 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjester commented Dec 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjester left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bjester Dec 12, 2025 •

edited

Loading

rtibbles Dec 12, 2025 •

edited

Loading

ozer550 commented Dec 19, 2025 •

edited

Loading