-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add a memory bound FileStatisticsCache for the Listing Table #20047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mkleen
wants to merge
93
commits into
apache:main
Choose a base branch
from
mkleen:file-stats-cache
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
93 commits
Select commit
Hold shift + click to select a range
fd0f896
Add a default FileStatisticsCache implementation for the ListingTable
mkleen 2ff77a9
fixup! Add a default FileStatisticsCache implementation for the Listi…
mkleen 8b2d6ef
Adapt memory usage when removing entries
mkleen 95a08d8
Adapt heapsize for &str
mkleen 6da78ce
Fix formatting
mkleen bea408d
Adapt heapsize for &str and add another scalarvalue
mkleen 8f7f9d2
Add better error message
mkleen cd56471
Add todo to add heapsize for ordering in CachedFileMetadata
mkleen e2c2463
Fix comment/docs on DefaultFileStatisticsCache
mkleen 07b0c00
Simplify test data generation
mkleen 37385bb
Remove potential stale entry, if entry is too large
mkleen 960357d
Fix typo in sql logic test comment
mkleen 03ac26d
Fix comment about default behaviour in cache manager
mkleen 213e7b3
Fix variable name in test
mkleen 83c1ce5
Fix variable name in test
mkleen d41416b
Disable cache for sql logic test
mkleen 0426037
Include key into memory estimation
mkleen 264b251
Fix fmt
mkleen ddf5b6b
Fix clippy
mkleen dacbcf2
minor
mkleen d7066b0
Add more key memory accounting
mkleen d6aa6c4
Fix Formatting
mkleen 4d93c31
Account path as string and remove dependency to object_store
mkleen 98ea7fa
Improve error handling
mkleen 20cd255
Fix fmt
mkleen c4c782f
Remove path.clone
mkleen 0dead2e
Simplify accounting for statistics
mkleen 63201a0
Adapt offset buffer
mkleen 71ff66e
Fix heap size for Arc
mkleen 4a2de63
Adapt estimate in test
mkleen 71405fc
Fix sql logic test
mkleen 326b46e
Register cache from cachemanager at listing table
mkleen 12b2d81
Revert slt
mkleen fbb0dbb
Add tablescoping for file stats cache
mkleen 7ff7021
Adapt slt
mkleen f1db937
Fix linter
mkleen 1069914
Remove uneeded clone
mkleen 36ebf45
Rename cache_unit to file_statistics_cache
mkleen ecedf49
Simplify heap size accounting
mkleen ef4bb12
Adapt comments in test
mkleen bfa78c3
Seperate drop table clean-ups
mkleen 242f151
fixup! Seperate drop table clean-ups
mkleen 210ba77
Increase default limit to 10 mb
mkleen f2ea873
Increase default limit to 20 mb
mkleen 6eae740
Fix comment
mkleen 25e3a03
Fix deregister logic
mkleen 37cdf7a
Fix slt
mkleen daa3cb7
Add table reference to FileStatisticsCacheEntry
mkleen be0c170
fixup! Add table reference to FileStatisticsCacheEntry
mkleen 105ec9a
Fix comment
mkleen ff29eb4
Fix runtime_env entry
mkleen 9ed435f
Add cache for all benchmark runs
mkleen 1f9fe95
Add cache to listing table creation
mkleen 7babcd4
fixup! Add cache to listing table creation
mkleen 87b0773
Adapt limit to 20M in configs.md
mkleen 0759d2e
fixup! Adapt limit to 20M in configs.md
mkleen e35230a
Fix linter
mkleen 4f71c32
Add cache to listing table in _read_type()
mkleen e7f4607
Add ListView and LargeListView to heapsize
mkleen bfa54a9
fixup! Add ListView and LargeListView to heapsize
mkleen 8ea3af1
Remove array.slt
mkleen be1c766
Add table ref to ListingTableUrl
mkleen 3b30524
Add heapsize for table-scoped-path
mkleen ffcd39d
Make list_entries table-scoped
mkleen 78abfb1
fixup! Make list_entries table-scoped
mkleen 87d7418
fixup! fixup! Make list_entries table-scoped
mkleen 174328e
Improve heap size estimation for Arc
mkleen 2b84a82
fixup! Improve heap size estimation for Arc
mkleen eda9d32
Update migration guide
mkleen 9c185e3
fixup! Update migration guide
mkleen 98c3ca9
Improve heapsize estimation for TableReference
mkleen 640a0e7
Improve memory handling when inserting
mkleen 280897a
Fix comments in Cache Manager
mkleen e0a8ee9
Improve upgrade guide
mkleen 990e506
Fix upgrade guide
mkleen f679906
Return stale entries from cache
mkleen 89bbc7d
Fix upgrade guide
mkleen 358dd43
Fix Arc<str> heapsize test
mkleen 680da7f
Remove const i32 cast from heapsize estimation
mkleen 89b4e34
Fix heapsize estimation for Arc<T>
mkleen 027dbaa
Fix comment in cache_manager
mkleen ba71bde
Fix linter + clippy
mkleen 535352b
Adapt test acording to heapsize estimation changes
mkleen 14d97bd
Always add tableref to partioned files
mkleen 3034387
fixup! Always add tableref to partioned files
mkleen 90264e6
Add table to statistics_cache output
mkleen 06ad6d9
Adopt test to new output
mkleen df2ec4d
Adopt configuration with '0' value
mkleen 8cc9b3d
Update configs.md
mkleen ce0baeb
Add reset after show in slt
mkleen 63fafe4
Extract cache invalidation logic
mkleen b3649dd
Merge branch 'main' into file-stats-cache
mkleen f340fe8
Merge branch 'main' into file-stats-cache
mkleen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,9 +21,11 @@ use std::mem; | |
| use std::sync::Arc; | ||
|
|
||
| use datafusion_catalog::Session; | ||
| use datafusion_common::{HashMap, Result, ScalarValue, assert_or_internal_err}; | ||
| use datafusion_datasource::ListingTableUrl; | ||
| use datafusion_common::{ | ||
| HashMap, Result, ScalarValue, TableReference, assert_or_internal_err, | ||
| }; | ||
| use datafusion_datasource::PartitionedFile; | ||
| use datafusion_datasource::{FileExtensions, ListingTableUrl}; | ||
| use datafusion_expr::{BinaryExpr, Operator, lit, utils}; | ||
|
|
||
| use arrow::{ | ||
|
|
@@ -382,6 +384,7 @@ fn try_into_partitioned_file( | |
|
|
||
| let mut pf: PartitionedFile = object_meta.into(); | ||
| pf.partition_values = partition_values; | ||
| pf.table_reference.clone_from(table_path.get_table_ref()); | ||
|
|
||
| Ok(Some(pf)) | ||
| } | ||
|
|
@@ -416,8 +419,15 @@ pub async fn pruned_partition_list<'a>( | |
| table_path | ||
| ); | ||
|
|
||
| // if no partition col => simply list all the files | ||
| Ok(objects.map_ok(|object_meta| object_meta.into()).boxed()) | ||
| // if no partition col => list all the files | ||
| Ok(objects | ||
| .try_filter_map(|object_meta| { | ||
| futures::future::ready(object_meta_to_partitioned_file( | ||
| object_meta, | ||
| table_path.get_table_ref(), | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Table-reference needs to be always passed on the |
||
| )) | ||
| }) | ||
| .boxed()) | ||
| } else { | ||
| let df_schema = DFSchema::from_unqualified_fields( | ||
| partition_cols | ||
|
|
@@ -442,6 +452,22 @@ pub async fn pruned_partition_list<'a>( | |
| } | ||
| } | ||
|
|
||
| fn object_meta_to_partitioned_file( | ||
| object_meta: ObjectMeta, | ||
| table_ref: &Option<TableReference>, | ||
| ) -> Result<Option<PartitionedFile>> { | ||
| Ok(Some(PartitionedFile { | ||
| object_meta, | ||
| partition_values: vec![], | ||
| range: None, | ||
| statistics: None, | ||
| ordering: None, | ||
| extensions: FileExtensions::new(), | ||
| metadata_size_hint: None, | ||
| table_reference: table_ref.clone(), | ||
| })) | ||
| } | ||
|
|
||
| /// Extract the partition values for the given `file_path` (in the given `table_path`) | ||
| /// associated to the partitions defined by `table_partition_cols` | ||
| pub fn parse_partitions_for_path<'a, I>( | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.