-
Notifications
You must be signed in to change notification settings - Fork 264
remove perseus from storage calculations #5601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
remove perseus from storage calculations #5601
Conversation
| files_qs = cte.join( | ||
| self.files.get_queryset(), contentnode__tree_id=cte.col.tree_id | ||
| ).with_cte(cte) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we looked at together, the combination of self.files.get_queryset() and the tree filtering is blowing up the performance of the query. Breaking this down into smaller blocks makes it more performant and allows for the additional filtering you're adding. I think something like this might work:
files_cte = With(self.files.get_queryset().values("checksum", "contentnode_id", "file_format_id"))
files_qs = (
files_cte.queryset()
.with_cte(files_cte)
.filter(
Exists(
cte.join(ContentNode.objects.all(), tree_id=cte.col.tree_id)
.with_cte(cte)
.filter(id=OuterRef("contentnode_id"))
)
)
)
files_qs = self._filter_storage_billable_files(files_qs)See if you can apply some of the same ideas to the more complex check_channel_space method too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main files_qs might also need .with_cte(cte) too. I'm a bit unsure
| if queryset is None: | ||
| return queryset | ||
| return queryset.exclude(file_format_id__isnull=True).exclude( | ||
| file_format_id=file_formats.PERSEUS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an immediate concern, but just a heads that when QTI assessments are more broadly available, and we are generating QTI ZIP files, then we may need to filter these too (and it would need to be on the format preset, rather than the file format id, because the format id would be 'zip'!)
|
This was the analysis after latest changes: |
That was for the |
| staging_files_qs = self._filter_storage_billable_files( | ||
| self.files.filter(contentnode__tree_id=channel.staging_tree.tree_id) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still has the same issue as the original queries-- it queries on too many things at once. The user_files_cte can be reused for both editable and staged trees. So you can essentially duplicate editable_files_qs but instead of joining on tree_cte just check existence where tree_id=channel.staging_tree.tree_id.
Then in the core SELECT query, where it diffs between existing and new checksums, you can also filter off file_format_id
bjester
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this is looking pretty great! Just some small comments
| name="files", | ||
|
|
||
| user_files_cte = With( | ||
| self.files.get_queryset().only( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The django queryset method only has connection with using model objects, if I understand correctly. Since this doesn't deal with model objects, values seems more appropriate. Under the hood, they may result in the same SELECT query, but I'm unsure.
| .filter( | ||
| Exists( | ||
| tree_cte.join( | ||
| ContentNode.objects.only("id", "tree_id"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only or values should be unnecessary here because Django should eventually make this subquery (because it uses Exists) simply SELECT 1. Something like .all() should work
| ContentNode.objects.only("id").filter( | ||
| tree_id=channel.staging_tree.tree_id, | ||
| id=OuterRef("contentnode_id"), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like we talked about, I think this query adjustment should bring some improvement! Secondly, similar comment about only here
|
|
||
| staging_files_qs = self._filter_storage_billable_files(staging_files_qs) | ||
|
|
||
| staging_files_qs = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for clarity, call this queryset something else? new_staging_files_qs or something like that, since this is post-comparison with existing checksums
| checksum=OuterRef("checksum"), | ||
| file_format_id=OuterRef("file_format_id"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible someone could craft two different files, with different formats, but the same checksum. Although I don't know that we need to be concerned about that, i.e. we can filter solely on checksum. We also reduce the results to the ids of distinct checksums, meaning we'd only count one of the files anyway. There was existing potential for this anyway, but I think limited usefulness for exploitation. Any particular scenario you're thinking about?
| ) | ||
| ) | ||
|
|
||
| staging_files_qs = self._filter_storage_billable_files(staging_files_qs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of filtering both editable_files_qs and staging_files_qs, I think we could just filter the resulting queryset after we find the new files (after the checksum check)? I could envision some tradeoffs-- eliminating file formats we won't bother to count later on could reduce the size of the checksum comparison, but it means we do that twice instead of once. Thoughts?
| ) | ||
| staged_size = float( | ||
| staging_tree_files.aggregate(used=Sum("file_size"))["used"] or 0 | ||
| staging_files_qs.filter(id__in=Subquery(unique_staging_ids)).aggregate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep the in-subquery in mind later during unstable/hotfixes testing. I believe the query planner should make similar decisions to an EXISTS check, but maybe not.
Summary
WIP
…
References
…
Reviewer guidance
…