Optimize selection of part list during query planning#85535
Conversation
|
Thank you, James! Would love to get your feedback, @bharatnc and @alexey-milovidov. Switching to a shared lock demonstrated a significant performance boost in our case. |
|
Workflow [PR], commit [a9265bb] Summary: ❌
|
e1b5cf2 to
fc97360
Compare
943d18e to
03217db
Compare
|
Hey @Avogar I noticed I made a somewhat nuanced conflict with your changes here from last week #83877 I've tried to resolve that in a way I think is valid but I'm open to your feedback on it. Basically I'm taking the I think we shouldn't need the unlock since now the parts lock is shared, SELECTs and other read only operations can still make progress while this lock is held, but this would mean INSERTs for example can't progress under that circumstance. |
|
I did some experiments with it - https://pastila.nl/?010a5e28/09fb8843f199ea5f21e56ff25a3ea8f2#inJq/SZSB0djgkosoErDwg== And yes it makes queries with lots of parts in the table ~50% faster I also saw that INSERT became slower, 5-6x, Yes, the INSERT is artificial, it create parts with only one row, but still, maybe we call @jawm it will be interesting to see your numbers as well |
Thanks for those experiments. I did try to add a benchmark test locally but it seemed like I was going to have to modify the actual framework a bit so I dropped it, but that did show a healthy speedup. Also fwiw you're making 10k parts, these tables on our side have uh, quite a bit more than that (100k) 🙈
So we made these changes in two phases
I realize those numbers are extremely vague, but basically it resolved a significant scaling issue we were having. As for INSERT performance, we do typically insert batches of a million rows so whatever cost we incurred was considered acceptable. I'm looking at a chart of insert durations since January, when we started having this scaling issue, and I see that insert durations have actually barely changed (while we saw SELECT degrade a lot until we made those fixes). As for IMO the extra cost of an INSERT is in making an extra copy of the parts vector, which actually you've already got happening on every single SELECT query currently. So it basically should just be moving that cost from one place to another. If you had a usecase where there's more inserts than selects then I guess it's a bad change, but for us at least we get a lot more selects. Possibly I am missing something else that adds cost too, lmk what you think Thanks for the reviews :) |
|
Thanks for sharing!
You are right, I was not suggesting to use
I doubt that copying vector will slows INSERTs 5x times, maybe there is something else? |
Ah right I see what you mean. For the most part the compiler already forced this; Because I removed
I'm looking now at a flamegraph (CPU) of insert queries from our production cluster and see ~8% of cycles going into this |
|
Thanks! Indeed, looks like copying of parts is the only added overhead, and it is enough to give 5.x slowdown, by improving |
786f601 to
e3c9d0a
Compare
|
I'm not really sure how to interpret these test failures
As for the AST fuzzer, it fails on an assertion seemingly in tcp handler code. Nothing in the stack trace looks close to code that I've touched. |
…efreshDataParts()
|
So:
And it is ready to land |
|
Final results - SELECT became 8x faster (was only 2x in the initial version) - https://pastila.nl/?003033a7/21217bba3c0e5174038bb4f998d6c930#HxLYLrj0MFz2a283lYl/Ug== |
44a10a9
|
We should also remove |
|
Woah, that's awesome, thanks for helping to get it over the line :) |
There was a problem hiding this comment.
Pull request overview
This PR optimizes query planning for SELECT queries on MergeTree tables by splitting the data parts lock into read/write locks and introducing shared cached copies of active parts and part ranges. This reduces lock contention and eliminates redundant copying of part lists during query planning, particularly benefiting tables with high part counts (10K+) under concurrent SELECT load.
Key changes:
- Introduced
DataPartsSharedLockfor read-only access alongside the existing exclusiveDataPartsLock - Added cached
shared_parts_listandshared_ranges_in_partsthat are shared across SELECT queries - Modified lock acquisition throughout codebase to use shared locks where write access is not needed
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Storages/MergeTree/MergeTreeData.h | Added DataPartsSharedLock, DataPartsAnyLock classes and changed mutex from std::mutex to SharedMutex |
| src/Storages/MergeTree/MergeTreeData.cpp | Implemented lock classes, added getPossiblySharedVisibleDataPartsRanges() method, converted many lock acquisitions to shared locks |
| src/Storages/StorageReplicatedMergeTree.cpp | Changed lock acquisition calls from lockParts() to readLockParts() where appropriate |
| src/Storages/StorageMergeTree.cpp | Changed lock acquisition calls from lockParts() to readLockParts() where appropriate |
| src/Storages/MergeTree/MergeTreeDataSelectExecutor.h | Changed function signatures to accept RangesInDataPartsPtr and return filtered parts instead of mutating input |
| src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp | Refactored filtering functions to avoid mutation, work with const references |
| src/Processors/QueryPlan/ReadFromMergeTree.h | Changed prepared_parts field type to RangesInDataPartsPtr |
| src/Processors/QueryPlan/ReadFromMergeTree.cpp | Updated to work with shared pointer types and const references |
| tests/queries/0_stateless/02340_parts_refcnt_mergetree.sh | Updated expected refcount from 6 to 8 to account for additional references |
| tests/integration/test_cancel_freeze/test.py | Added sanitizer skip and fixed rm command flag |
| src/Common/ProfileEvents.cpp | Added new profile events for tracking shared lock metrics |
|
|
||
| /// Convert read to write lock, and re-check the shared_parts_list | ||
| shared_lock_holder.reset(); | ||
| lock_holder.emplace(data_parts_mutex, /*data_=*/ nullptr); |
There was a problem hiding this comment.
Passing nullptr for the data parameter in DataPartsLock constructor means the destructor will not reset shared_parts_list and shared_ranges_in_parts. This could leave stale cached data after an exclusive lock is acquired and released. The data parameter should be this instead of nullptr.
| lock_holder.emplace(data_parts_mutex, /*data_=*/ nullptr); | |
| lock_holder.emplace(data_parts_mutex, /*data_=*/ this); |
There was a problem hiding this comment.
Comment is useless, but the place is correct - #93022
Imagine the following scenario:
T1: TRUNCATE TABLE T2: OPTIMIZE TABLE
lockParts() # exclusive lock!
shared_parts_list = nullptr
getPossiblySharedVisibleDataPartsRanges()
shared_lock_holder = readLockParts() # shared lock, shared_parts_list == nullptr
auto parts = getDataPartsVectorForInternalUsage({DataPartState::Active}, *shared_lock_holder);
shared_lock_holder.reset();
lockParts() # exclusive lock
remove parts from list
shared_parts_list = nullptr
lock_holder.emplace(data_parts_mutex, /*data_=*/ nullptr); # exclusive lock!
shared_ranges_in_parts = std::make_shared<const RangesInDataParts>(parts);
shared_parts_list = std::make_shared<const DataPartsVector>(std::move(parts));
It will lead to T2 will cache outdated parts in `shared_parts_list`.
And I think this is why `01443_merge_truncate_long` started to fail
Fixes: ClickHouse#92098
Fixes: ClickHouse#85535
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Up to 8x faster
SELECTqueries with heavy partition pruning on tables with 10K+ partsDocumentation entry for user-facing changes
Benchmark - https://pastila.nl/?003033a7/21217bba3c0e5174038bb4f998d6c930#HxLYLrj0MFz2a283lYl/Ug==
Implementation details
SELECTqueries, that is updated on eachSELECTif required (each acquire of parts lock for write reset them)Refs
Details
We (Cloudflare) initially built out this change on an older checkout of ClickHouse, to help resolve a scaling issue in one of our clusters. This cluster has tables with quite a high part count, and a high level of select concurrency. Over time, we observed an increasing amount of time being spent in query planning, eventually more than half of the query duration.
This was initially observed to be due to lock contention in MergeTreeData, so I made a change that read-only functions could use a shared lock instead of a unique lock. After this it was still visible that a surprising amount of time was being spent copying the parts vector, so I've also made a shared copy of it, which is shared amongst all SELECT queries, until the set of parts is modified, at which point the cached list gets recomputed.
Overall these changes have significantly improved the performance in this cluster, reducing leaf query duration by around two thirds. We recognise that the table was probably not set up entirely in alignment with recommended guidelines, but I think it's still worthwhile applying this change, as it overall only brought improvements for us. In more conventional tables, which do follow guidelines, we observed no negative impact from these changes.
The PR is formed of two commits, which I cherry-picked from our internal checkout, with some conflict-resolution needed too. Each of those brings a separate optimisation as described above, and they can be reviewed individually, although the second commit builds on the first, so it does make sense to review them in order.