-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Make scan filter push-down idempotent #20003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
236ef40 to
6697447
Compare
|
Will fix CI soon. |
6697447 to
b3c756b
Compare
Part of apache#19929 Let "t" be a table provider that supports exactly any single filter but not a conjunction. Consider the following optimizer pipeline: 1. Try to push `a = 1, b = 1`. `supports_filters_pushdown` returns [Exact, Inexact] OK: the optimizer records that a = 1 is pushed and creates a filter node for b = 1. ... Another optimization iteration. 2. Try to push `b = 1`. `supports_filters_pushdown` returns [Exact]. Of course, the table provider can't remember all previously pushed filters, so it has no choice but to answer `Exact`. Now, the optimizer thinks the conjunction a = 1 AND b = 1 is supported exactly, but it is not. To prevent this problem, this patch passes filters that were already pushed into the scan earlier to `supports_filters_pushdown`.
Consider the following optimizer-run scenario: 1. `supports_filters_pushdown` returns `Exact` on some filter, e.g. "a = 1", where the column "a" is not required by the query projection. 2. "a" is removed from the table provider projection by "optimize projection" rule. 3. `supports_filters_pushdown` changes a decision and returns `Inexact` on this filter the next time. e.g., input filters are changed and it prefers to use a new one. 4. "a" is not returned to the table provider projection which leads to filter that references a column which is not a part of the input schema. This patch fixes issue introducing the following logic within a filter push-down rule: 1. Collect columns that are not used in the current table provider scan projection, but required for filter expressions. Call it `additional_projection`. 2. If `additional_projection` is empty -- leave logic as is prior the patch. 3. Otherwise extend a table provider projection and wrap a plan with an additional projection node to preserve schema used prior to the rule.
b3c756b to
52cb896
Compare
| // Extend a projection. | ||
| projection.extend(additional_projection); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a tradeoff between fetching an extra column from storage vs the selectivity of the related filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems for me that the change only touches correctness of the rule (not performance), here we must extend projection (because filter is explicitly not supported by the provider).
|
|
||
| if additional_projection.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The branch of scan is expanding, any thoughts to split the branch
Which issue does this PR close?
Rationale for this change
Non-idempotent rule logic leads to the artifacts described in the issue above if the rule is applied several times.
What changes are included in this PR?
supports_filter_pushdown(...)call.Are these changes tested?
There are unit tests.