-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Enable parquet filter pushdown by default #19477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
What a 🎄 🎁 ! I see mostly speedups, vastly outweighing the slowdowns, but there are still slowdowns... any idea what those cases have in common? Any way I can help brainstorm solutions? |
Thanks @adriangb -- I am not working full time the next few days but I am hoping to work on this particular item as it is something I desparately want to be done with (and I find it fun!) I spent some time analyzing where the time is going. I haven't written up my notes yet, but I will do so (probably tomorrow as I have a long train ride). I found a few places to improve, but nothing that will obviously get back all the time (yet). I'll kep at it |
# Which issue does this PR close? - RElated to apache/datafusion#19477 # Rationale for this change While profiling apache/datafusion#19477 I noticed some additional clones we could avoid <img width="1504" height="927" alt="Screenshot 2025-12-26 at 12 03 00 PM" src="https://github.com/user-attachments/assets/958f76e2-53d0-4008-b224-d9275984cd1a" /> I doubt this will be a huge deal but it does remove some allocations int he parquet read path # What changes are included in this PR? Use `into_data` rather than `to_data` # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
|
I also wrote up some notes about current status and plan here: |
|
run benchmark tpch |
|
🤖 |
|
run benchmark tpcds |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
Benchmark script failed with exit code 1. Last 10 lines of output: Click to expand |
This doesn't look great.... |
|
run benchmark tpch |
|
🤖 |
|
🤖: Benchmark completed Details
|
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - related to apache/datafusion#19477 # Rationale for this change While working on apache/datafusion#19477, and profiling ClickBench q7, I noticed that the RowSelectors was being cloned to resolve the strategy -- for a large number of selections this is expensive and shows up in the traces <img width="1724" height="1074" alt="Screenshot 2025-12-28 at 4 49 49 PM" src="https://github.com/user-attachments/assets/72c6fd22-9377-48ef-ba80-6bc03b177cf7" /> ```shell samply record -- ./datafusion-cli-alamb_enable_pushdown -f q.sql > /dev/null 2>& ``` We should change the code to avoid cloning the RowSelectors when resolving the strategy. # Changes Don't clone / allocate while resolving the strategy. I don't expect this to have a massive impact, but it did show up in the profile FYI @hhhizzz -- perhaps you could review this PR # Are these changes tested? Yes by CI # Are there any user-facing changes? small performance improvement
# Which issue does this PR close? - related to apache/datafusion#19477 # Rationale for this change I am tracking down allocations in various parts of the parquet reader (to remove them) and I ran across this in the cached reader. # What changes are included in this PR? Use `Arc::clone` to make it clear there is no deep cloning going on . I don't expect this will have any impact on actual performance # Are these changes tested? By CI # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
|
Given that the tpch queries show a clear slowdown I am going to try profiling them and see if I can find any additional issues. I will start with a query that is relatively simple (small number of joins): |
Which issue does this PR close?
filter_pushdown) by default #3463Rationale for this change
I am testing where we currently are with performance
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?