-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: optimise copying in left for Utf8 and LargeUtf8
#19980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Improve performance to O(1) by eliminating more string copies. Discover a byte offset for the last character for both positive and negative length argument and slice bytes directly. For LargeUtf8, implement a zero-copy slice operation reusing same Arrow buffers. An Arrow view construction helper `shrink_string_view_array_view` is included to this PR (upstream to Arrow crates)
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR. Would be good to see some benchmark results too, see:
Moreover, I find some of the points in the PR body confusing:
Improve performance to O(1) by eliminating more string copies. Discover a byte offset for the last character for both positive and negative length arguments and slice bytes directly.
Could you help me understand which changes here make it O(1)?
For LargeUtf8 (
StringViewArray), implement a zero-copy slice operation reusing the same Arrow buffers. It is possible for both views since the string only shrinks. We only need to tune a German prefix.
LargeUtf8 and Utf8View are different types, so it is confusing to see them used interchangeably.
|
Run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Run benchmarks |
|
🤖 Hi @theirix, thanks for the request (#19980 (comment)). |
|
Thank you for the review!
It's for memory complexity. We avoid an extra copy of the string into I cannot say about time complexity - it is improved, but not for all queries (
My bad, corrected it in the description. |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd have to see benchmarks from the microbenchmark of left itself; the benchmark queries from the bot don't use this function I think? (Unless its used internally in joins, etc.)
Agree, here are the results of a more targeted |
|
I think it would be a good idea to enhance the benchmark to add Utf8View case as well; we can simply compare it to Utf8 numbers as the before case. Looks like we got a ~80% improvement on Utf8/LargeUtf8 by simply removing the intermediate |
Good idea, I'll add it shortly.
Indeed! Memory allocation and copying consume time. If we can go with zero copy, we should. |
Which issue does this PR close?
leftfunction #19749.Rationale for this change
A follow-up to an optimisation of the
leftfunction in #19571What changes are included in this PR?
Improve memory performance to O(1) by eliminating more string copies. Discover a byte offset for the last character for both positive and negative length arguments and slice bytes directly.
For
Utf8View(StringViewArray), implement a zero-copy slice operation reusing the same Arrow buffers. It is possible for both views since the string only shrinks. We only need to tune a German prefix.An Arrow view construction helper
shrink_string_view_array_viewis included in this PR. Unfortunately, string view builders cannot provide a way to reuse Arrow buffers. I believe it should better reside in the core Arrow crates instead - I can follow up on it.Are these changes tested?
Are there any user-facing changes?
No