Skip to content

Optimize selection of part list during query planning#85535

Merged
azat merged 30 commits into
ClickHouse:masterfrom
jawm:jawm/readlock
Nov 14, 2025
Merged

Optimize selection of part list during query planning#85535
azat merged 30 commits into
ClickHouse:masterfrom
jawm:jawm/readlock

Conversation

@jawm
Copy link
Copy Markdown
Contributor

@jawm jawm commented Aug 13, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Up to 8x faster SELECT queries with heavy partition pruning on tables with 10K+ parts

Documentation entry for user-facing changes

Benchmark - https://pastila.nl/?003033a7/21217bba3c0e5174038bb4f998d6c930#HxLYLrj0MFz2a283lYl/Ug==

Implementation details

  • Split data parts lock into read and write
  • Introduce shared version of active parts and parts ranges for SELECT queries, that is updated on each SELECT if required (each acquire of parts lock for write reset them)

Refs

Details

We (Cloudflare) initially built out this change on an older checkout of ClickHouse, to help resolve a scaling issue in one of our clusters. This cluster has tables with quite a high part count, and a high level of select concurrency. Over time, we observed an increasing amount of time being spent in query planning, eventually more than half of the query duration.

This was initially observed to be due to lock contention in MergeTreeData, so I made a change that read-only functions could use a shared lock instead of a unique lock. After this it was still visible that a surprising amount of time was being spent copying the parts vector, so I've also made a shared copy of it, which is shared amongst all SELECT queries, until the set of parts is modified, at which point the cached list gets recomputed.

Overall these changes have significantly improved the performance in this cluster, reducing leaf query duration by around two thirds. We recognise that the table was probably not set up entirely in alignment with recommended guidelines, but I think it's still worthwhile applying this change, as it overall only brought improvements for us. In more conventional tables, which do follow guidelines, we observed no negative impact from these changes.

The PR is formed of two commits, which I cherry-picked from our internal checkout, with some conflict-resolution needed too. Each of those brings a separate optimisation as described above, and they can be reviewed individually, although the second commit builds on the first, so it does make sense to review them in order.

@jawm jawm changed the title Jawm/readlock Optimize selection of part list during query planning Aug 13, 2025
@victor-perov
Copy link
Copy Markdown

Thank you, James!

Would love to get your feedback, @bharatnc and @alexey-milovidov. Switching to a shared lock demonstrated a significant performance boost in our case.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Aug 13, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Aug 13, 2025

Workflow [PR], commit [a9265bb]

Summary:

job_name test_name status info comment
Stateless tests (amd_tsan, s3 storage, parallel) failure
03652_memory_usage_headers FAIL cidb
BuzzHouse (amd_debug) failure
Buzzing result failure cidb

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Aug 13, 2025
@jawm jawm force-pushed the jawm/readlock branch 4 times, most recently from e1b5cf2 to fc97360 Compare August 14, 2025 12:52
@nickitat nickitat self-assigned this Aug 19, 2025
@jawm jawm force-pushed the jawm/readlock branch 2 times, most recently from 943d18e to 03217db Compare August 27, 2025 14:12
@jawm
Copy link
Copy Markdown
Contributor Author

jawm commented Aug 27, 2025

Hey @Avogar

I noticed I made a somewhat nuanced conflict with your changes here from last week #83877

I've tried to resolve that in a way I think is valid but I'm open to your feedback on it. Basically I'm taking the columns_and_secondary_indices_sizes_mutex while holding a shared lock of the parts mutex (rather than a unique lock). I then remove the explicit unlock which you'd added.

I think we shouldn't need the unlock since now the parts lock is shared, SELECTs and other read only operations can still make progress while this lock is held, but this would mean INSERTs for example can't progress under that circumstance.

Comment thread src/Storages/MergeTree/MergeTreeData.cpp
Comment thread src/Storages/MergeTree/MergeTreeData.cpp Outdated
Comment thread src/Storages/MergeTree/MergeTreeData.h Outdated
Comment thread src/Storages/MergeTree/MergeTreeData.cpp Outdated
@azat
Copy link
Copy Markdown
Member

azat commented Sep 8, 2025

I did some experiments with it - https://pastila.nl/?010a5e28/09fb8843f199ea5f21e56ff25a3ea8f2#inJq/SZSB0djgkosoErDwg==

And yes it makes queries with lots of parts in the table ~50% faster

I also saw that INSERT became slower, 5-6x, Yes, the INSERT is artificial, it create parts with only one row, but still, maybe we call lockParts where we can call readLockParts instead?

@jawm it will be interesting to see your numbers as well

@jawm
Copy link
Copy Markdown
Contributor Author

jawm commented Sep 9, 2025

I did some experiments with it - https://pastila.nl/?010a5e28/09fb8843f199ea5f21e56ff25a3ea8f2#inJq/SZSB0djgkosoErDwg==
And yes it makes queries with lots of parts in the table ~50% faster

Thanks for those experiments. I did try to add a benchmark test locally but it seemed like I was going to have to modify the actual framework a bit so I dropped it, but that did show a healthy speedup. Also fwiw you're making 10k parts, these tables on our side have uh, quite a bit more than that (100k) 🙈

I also saw that INSERT became slower, 5-6x, Yes, the INSERT is artificial, it create parts with only one row, but still, maybe we call lockParts where we can call readLockParts instead?

@jawm it will be interesting to see your numbers as well

So we made these changes in two phases

  1. Adding the shared mutex / shared locking of the parts list gave us a ~3x improvement to leaf select durations.
  2. Making a shared copy of the parts list between all selects gave us a further ~40% reduction to leaf select durations

I realize those numbers are extremely vague, but basically it resolved a significant scaling issue we were having.

As for INSERT performance, we do typically insert batches of a million rows so whatever cost we incurred was considered acceptable. I'm looking at a chart of insert durations since January, when we started having this scaling issue, and I see that insert durations have actually barely changed (while we saw SELECT degrade a lot until we made those fixes).

As for lockParts vs readLockParts, I think we must call lockParts, since an INSERT is actually modifying the parts list (by adding new parts), so it would not be safe to do that concurrently with another insert or merge which would also be mutating the parts list at the same time.

IMO the extra cost of an INSERT is in making an extra copy of the parts vector, which actually you've already got happening on every single SELECT query currently. So it basically should just be moving that cost from one place to another. If you had a usecase where there's more inserts than selects then I guess it's a bad change, but for us at least we get a lot more selects. Possibly I am missing something else that adds cost too, lmk what you think

Thanks for the reviews :)

@azat
Copy link
Copy Markdown
Member

azat commented Sep 9, 2025

Thanks for sharing!

As for lockParts vs readLockParts, I think we must call lockParts, since an INSERT is actually modifying the parts list (by adding new parts), so it would not be safe to do that concurrently with another insert or merge which would also be mutating the parts list at the same time.

You are right, I was not suggesting to use readLockParts in INSERT, but more of a find missing places where we can use readLockParts instead, maybe there are some left.

IMO the extra cost of an INSERT is in making an extra copy of the parts vector, which actually you've already got happening on every single SELECT query currently. So it basically should just be moving that cost from one place to another. If you had a usecase where there's more inserts than selects then I guess it's a bad change, but for us at least we get a lot more selects. Possibly I am missing something else that adds cost too, lmk what you think

I doubt that copying vector will slows INSERTs 5x times, maybe there is something else?

@jawm
Copy link
Copy Markdown
Contributor Author

jawm commented Sep 10, 2025

You are right, I was not suggesting to use readLockParts in INSERT, but more of a find missing places where we can use readLockParts instead, maybe there are some left.

Ah right I see what you mean. For the most part the compiler already forced this; Because I removed const from lockParts, all our non-const functions already had to be changed to take a read lock instead. I've done a quick scan of remaining callsites and perhaps some of them could be changed, but my approach was mainly to keep the diff as small as possible to make it easier to validate and review. I was identifying things to change according to hotspots I was seeing on flamegraphs, and I don't see any remaining problems from remaining calls to lockParts.

I doubt that copying vector will slows INSERTs 5x times, maybe there is something else?

I'm looking now at a flamegraph (CPU) of insert queries from our production cluster and see ~8% of cycles going into this recomputeSharedPartsList. So I could imagine that in a benchmark where you do lots of small inserts to a table which has a lot of parts, that the percentage could easily be quite a lot higher than that, but I've not looked especially closely at that scenario. On that flamegraph I don't see any other significant hotspots on the codepaths which I've modified.
Considering that I previously could see selects spending ~40% of their overall duration just copying this vector, I don't think it's too crazy to think it could explain why we see a large part of insert duration for this particular benchmark doing the same operation.
If you want I can try to dig a bit deeper about what's happening there.

Comment thread src/Processors/QueryPlan/ReadFromMergeTree.cpp Outdated
@azat
Copy link
Copy Markdown
Member

azat commented Sep 10, 2025

Thanks! Indeed, looks like copying of parts is the only added overhead, and it is enough to give 5.x slowdown, by improving MarkRanges copying we can improve it - #86933

Comment thread tests/queries/0_stateless/02340_parts_refcnt_mergetree.sh Outdated
@jawm jawm force-pushed the jawm/readlock branch 2 times, most recently from 786f601 to e3c9d0a Compare September 11, 2025 12:58
@jawm
Copy link
Copy Markdown
Contributor Author

jawm commented Sep 12, 2025

I'm not really sure how to interpret these test failures

test_cancel_freeze seems to fail in the cleanup step to drop the shadow directory. Considering it only fails in one of the sanitizer builds but not of the others, I'm hoping maybe it's just something flaky?

As for the AST fuzzer, it fails on an assertion seemingly in tcp handler code. Nothing in the stack trace looks close to code that I've touched.

@azat
Copy link
Copy Markdown
Member

azat commented Nov 14, 2025

So:

  • The INSERT performance restored (now shared parts updated on first SELECT, and each write lock resets it)
  • Redundant copy of RangesInDataParts has been removed which gives 3.5x performance benefit with 100K parts for simple SELECT (when partition pruning removes everything except one part)

And it is ready to land

@azat azat enabled auto-merge November 14, 2025 09:33
@azat azat added this pull request to the merge queue Nov 14, 2025
@azat
Copy link
Copy Markdown
Member

azat commented Nov 14, 2025

Final results - SELECT became 8x faster (was only 2x in the initial version) - https://pastila.nl/?003033a7/21217bba3c0e5174038bb4f998d6c930#HxLYLrj0MFz2a283lYl/Ug==

Merged via the queue into ClickHouse:master with commit 44a10a9 Nov 14, 2025
127 of 130 checks passed
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Nov 14, 2025
@azat
Copy link
Copy Markdown
Member

azat commented Nov 14, 2025

We should also remove shared_ptr<RangesInDataParts>, MergeTreeDataSelectExecutor::filterPartsByPartition should accept const DataPartsVector & not RangesInDataParts, but there are some interface issues that makes it not so simple

@jawm
Copy link
Copy Markdown
Contributor Author

jawm commented Nov 14, 2025

Woah, that's awesome, thanks for helping to get it over the line :)
I guess we can look forward to a further improvement then with your extra changes

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes query planning for SELECT queries on MergeTree tables by splitting the data parts lock into read/write locks and introducing shared cached copies of active parts and part ranges. This reduces lock contention and eliminates redundant copying of part lists during query planning, particularly benefiting tables with high part counts (10K+) under concurrent SELECT load.

Key changes:

  • Introduced DataPartsSharedLock for read-only access alongside the existing exclusive DataPartsLock
  • Added cached shared_parts_list and shared_ranges_in_parts that are shared across SELECT queries
  • Modified lock acquisition throughout codebase to use shared locks where write access is not needed

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/Storages/MergeTree/MergeTreeData.h Added DataPartsSharedLock, DataPartsAnyLock classes and changed mutex from std::mutex to SharedMutex
src/Storages/MergeTree/MergeTreeData.cpp Implemented lock classes, added getPossiblySharedVisibleDataPartsRanges() method, converted many lock acquisitions to shared locks
src/Storages/StorageReplicatedMergeTree.cpp Changed lock acquisition calls from lockParts() to readLockParts() where appropriate
src/Storages/StorageMergeTree.cpp Changed lock acquisition calls from lockParts() to readLockParts() where appropriate
src/Storages/MergeTree/MergeTreeDataSelectExecutor.h Changed function signatures to accept RangesInDataPartsPtr and return filtered parts instead of mutating input
src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp Refactored filtering functions to avoid mutation, work with const references
src/Processors/QueryPlan/ReadFromMergeTree.h Changed prepared_parts field type to RangesInDataPartsPtr
src/Processors/QueryPlan/ReadFromMergeTree.cpp Updated to work with shared pointer types and const references
tests/queries/0_stateless/02340_parts_refcnt_mergetree.sh Updated expected refcount from 6 to 8 to account for additional references
tests/integration/test_cancel_freeze/test.py Added sanitizer skip and fixed rm command flag
src/Common/ProfileEvents.cpp Added new profile events for tracking shared lock metrics

Comment thread src/Storages/MergeTree/MergeTreeData.h
Comment thread src/Storages/MergeTree/MergeTreeData.h

/// Convert read to write lock, and re-check the shared_parts_list
shared_lock_holder.reset();
lock_holder.emplace(data_parts_mutex, /*data_=*/ nullptr);
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing nullptr for the data parameter in DataPartsLock constructor means the destructor will not reset shared_parts_list and shared_ranges_in_parts. This could leave stale cached data after an exclusive lock is acquired and released. The data parameter should be this instead of nullptr.

Suggested change
lock_holder.emplace(data_parts_mutex, /*data_=*/ nullptr);
lock_holder.emplace(data_parts_mutex, /*data_=*/ this);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is useless, but the place is correct - #93022

Comment thread tests/queries/0_stateless/02340_parts_refcnt_mergetree.sh
Comment thread src/Storages/MergeTree/MergeTreeData.cpp
Comment thread src/Processors/QueryPlan/ReadFromMergeTree.cpp
azat added a commit to azat/ClickHouse that referenced this pull request Dec 24, 2025
Imagine the following scenario:

    T1: TRUNCATE TABLE                       T2: OPTIMIZE TABLE

          lockParts() # exclusive lock!
            shared_parts_list = nullptr
                                               getPossiblySharedVisibleDataPartsRanges()
                                                 shared_lock_holder = readLockParts() # shared lock, shared_parts_list == nullptr
                                                 auto parts = getDataPartsVectorForInternalUsage({DataPartState::Active}, *shared_lock_holder);
                                                 shared_lock_holder.reset();
          lockParts() # exclusive lock
            remove parts from list
            shared_parts_list = nullptr
                                                 lock_holder.emplace(data_parts_mutex, /*data_=*/ nullptr); # exclusive lock!
                                                 shared_ranges_in_parts = std::make_shared<const RangesInDataParts>(parts);
                                                 shared_parts_list = std::make_shared<const DataPartsVector>(std::move(parts));

It will lead to T2 will cache outdated parts in `shared_parts_list`.

And I think this is why `01443_merge_truncate_long` started to fail

Fixes: ClickHouse#92098
Fixes: ClickHouse#85535
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants