Optimize selection of part list during query planning by jawm · Pull Request #85535 · ClickHouse/ClickHouse

jawm · 2025-08-13T10:42:47Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Up to 8x faster SELECT queries with heavy partition pruning on tables with 10K+ parts

Documentation entry for user-facing changes

Benchmark - https://pastila.nl/?003033a7/21217bba3c0e5174038bb4f998d6c930#HxLYLrj0MFz2a283lYl/Ug==

Implementation details

Split data parts lock into read and write
Introduce shared version of active parts and parts ranges for SELECT queries, that is updated on each SELECT if required (each acquire of parts lock for write reset them)

Refs

RFC: Optimize MarkRanges by using devector over deque #86933

Details

We (Cloudflare) initially built out this change on an older checkout of ClickHouse, to help resolve a scaling issue in one of our clusters. This cluster has tables with quite a high part count, and a high level of select concurrency. Over time, we observed an increasing amount of time being spent in query planning, eventually more than half of the query duration.

This was initially observed to be due to lock contention in MergeTreeData, so I made a change that read-only functions could use a shared lock instead of a unique lock. After this it was still visible that a surprising amount of time was being spent copying the parts vector, so I've also made a shared copy of it, which is shared amongst all SELECT queries, until the set of parts is modified, at which point the cached list gets recomputed.

Overall these changes have significantly improved the performance in this cluster, reducing leaf query duration by around two thirds. We recognise that the table was probably not set up entirely in alignment with recommended guidelines, but I think it's still worthwhile applying this change, as it overall only brought improvements for us. In more conventional tables, which do follow guidelines, we observed no negative impact from these changes.

The PR is formed of two commits, which I cherry-picked from our internal checkout, with some conflict-resolution needed too. Each of those brings a separate optimisation as described above, and they can be reviewed individually, although the second commit builds on the first, so it does make sense to review them in order.

victor-perov · 2025-08-13T10:50:17Z

Thank you, James!

Would love to get your feedback, @bharatnc and @alexey-milovidov. Switching to a shared lock demonstrated a significant performance boost in our case.

clickhouse-gh · 2025-08-13T15:06:39Z

Workflow [PR], commit [a9265bb]

Summary: ❌

job_name	test_name	status	info
Stateless tests (amd_tsan, s3 storage, parallel)		failure
	03652_memory_usage_headers	FAIL	cidb
BuzzHouse (amd_debug)		failure
	Buzzing result	failure	cidb

jawm · 2025-08-27T14:19:25Z

Hey @Avogar

I noticed I made a somewhat nuanced conflict with your changes here from last week #83877

I've tried to resolve that in a way I think is valid but I'm open to your feedback on it. Basically I'm taking the columns_and_secondary_indices_sizes_mutex while holding a shared lock of the parts mutex (rather than a unique lock). I then remove the explicit unlock which you'd added.

I think we shouldn't need the unlock since now the parts lock is shared, SELECTs and other read only operations can still make progress while this lock is held, but this would mean INSERTs for example can't progress under that circumstance.

azat · 2025-09-08T21:28:02Z

I did some experiments with it - https://pastila.nl/?010a5e28/09fb8843f199ea5f21e56ff25a3ea8f2#inJq/SZSB0djgkosoErDwg==

And yes it makes queries with lots of parts in the table ~50% faster

I also saw that INSERT became slower, 5-6x, Yes, the INSERT is artificial, it create parts with only one row, but still, maybe we call lockParts where we can call readLockParts instead?

@jawm it will be interesting to see your numbers as well

jawm · 2025-09-09T16:07:03Z

I did some experiments with it - https://pastila.nl/?010a5e28/09fb8843f199ea5f21e56ff25a3ea8f2#inJq/SZSB0djgkosoErDwg==
And yes it makes queries with lots of parts in the table ~50% faster

Thanks for those experiments. I did try to add a benchmark test locally but it seemed like I was going to have to modify the actual framework a bit so I dropped it, but that did show a healthy speedup. Also fwiw you're making 10k parts, these tables on our side have uh, quite a bit more than that (100k) 🙈

I also saw that INSERT became slower, 5-6x, Yes, the INSERT is artificial, it create parts with only one row, but still, maybe we call lockParts where we can call readLockParts instead?

@jawm it will be interesting to see your numbers as well

So we made these changes in two phases

Adding the shared mutex / shared locking of the parts list gave us a ~3x improvement to leaf select durations.
Making a shared copy of the parts list between all selects gave us a further ~40% reduction to leaf select durations

I realize those numbers are extremely vague, but basically it resolved a significant scaling issue we were having.

As for INSERT performance, we do typically insert batches of a million rows so whatever cost we incurred was considered acceptable. I'm looking at a chart of insert durations since January, when we started having this scaling issue, and I see that insert durations have actually barely changed (while we saw SELECT degrade a lot until we made those fixes).

As for lockParts vs readLockParts, I think we must call lockParts, since an INSERT is actually modifying the parts list (by adding new parts), so it would not be safe to do that concurrently with another insert or merge which would also be mutating the parts list at the same time.

IMO the extra cost of an INSERT is in making an extra copy of the parts vector, which actually you've already got happening on every single SELECT query currently. So it basically should just be moving that cost from one place to another. If you had a usecase where there's more inserts than selects then I guess it's a bad change, but for us at least we get a lot more selects. Possibly I am missing something else that adds cost too, lmk what you think

Thanks for the reviews :)

azat · 2025-09-09T16:21:18Z

Thanks for sharing!

As for lockParts vs readLockParts, I think we must call lockParts, since an INSERT is actually modifying the parts list (by adding new parts), so it would not be safe to do that concurrently with another insert or merge which would also be mutating the parts list at the same time.

You are right, I was not suggesting to use readLockParts in INSERT, but more of a find missing places where we can use readLockParts instead, maybe there are some left.

IMO the extra cost of an INSERT is in making an extra copy of the parts vector, which actually you've already got happening on every single SELECT query currently. So it basically should just be moving that cost from one place to another. If you had a usecase where there's more inserts than selects then I guess it's a bad change, but for us at least we get a lot more selects. Possibly I am missing something else that adds cost too, lmk what you think

I doubt that copying vector will slows INSERTs 5x times, maybe there is something else?

jawm · 2025-09-10T12:45:50Z

You are right, I was not suggesting to use readLockParts in INSERT, but more of a find missing places where we can use readLockParts instead, maybe there are some left.

Ah right I see what you mean. For the most part the compiler already forced this; Because I removed const from lockParts, all our non-const functions already had to be changed to take a read lock instead. I've done a quick scan of remaining callsites and perhaps some of them could be changed, but my approach was mainly to keep the diff as small as possible to make it easier to validate and review. I was identifying things to change according to hotspots I was seeing on flamegraphs, and I don't see any remaining problems from remaining calls to lockParts.

I doubt that copying vector will slows INSERTs 5x times, maybe there is something else?

I'm looking now at a flamegraph (CPU) of insert queries from our production cluster and see ~8% of cycles going into this recomputeSharedPartsList. So I could imagine that in a benchmark where you do lots of small inserts to a table which has a lot of parts, that the percentage could easily be quite a lot higher than that, but I've not looked especially closely at that scenario. On that flamegraph I don't see any other significant hotspots on the codepaths which I've modified.
Considering that I previously could see selects spending ~40% of their overall duration just copying this vector, I don't think it's too crazy to think it could explain why we see a large part of insert duration for this particular benchmark doing the same operation.
If you want I can try to dig a bit deeper about what's happening there.

azat · 2025-09-10T13:42:56Z

Thanks! Indeed, looks like copying of parts is the only added overhead, and it is enough to give 5.x slowdown, by improving MarkRanges copying we can improve it - #86933

jawm · 2025-09-12T15:08:00Z

I'm not really sure how to interpret these test failures

test_cancel_freeze seems to fail in the cleanup step to drop the shadow directory. Considering it only fails in one of the sanitizer builds but not of the others, I'm hoping maybe it's just something flaky?

As for the AST fuzzer, it fails on an assertion seemingly in tcp handler code. Nothing in the stack trace looks close to code that I've touched.

…efreshDataParts()

azat · 2025-11-14T09:25:00Z

So:

The INSERT performance restored (now shared parts updated on first SELECT, and each write lock resets it)
Redundant copy of RangesInDataParts has been removed which gives 3.5x performance benefit with 100K parts for simple SELECT (when partition pruning removes everything except one part)

And it is ready to land

azat · 2025-11-14T09:43:51Z

Final results - SELECT became 8x faster (was only 2x in the initial version) - https://pastila.nl/?003033a7/21217bba3c0e5174038bb4f998d6c930#HxLYLrj0MFz2a283lYl/Ug==

azat · 2025-11-14T12:06:32Z

We should also remove shared_ptr<RangesInDataParts>, MergeTreeDataSelectExecutor::filterPartsByPartition should accept const DataPartsVector & not RangesInDataParts, but there are some interface issues that makes it not so simple

jawm · 2025-11-14T14:27:07Z

Woah, that's awesome, thanks for helping to get it over the line :)
I guess we can look forward to a further improvement then with your extra changes

Copilot

Pull request overview

This PR optimizes query planning for SELECT queries on MergeTree tables by splitting the data parts lock into read/write locks and introducing shared cached copies of active parts and part ranges. This reduces lock contention and eliminates redundant copying of part lists during query planning, particularly benefiting tables with high part counts (10K+) under concurrent SELECT load.

Key changes:

Introduced DataPartsSharedLock for read-only access alongside the existing exclusive DataPartsLock
Added cached shared_parts_list and shared_ranges_in_parts that are shared across SELECT queries
Modified lock acquisition throughout codebase to use shared locks where write access is not needed

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/Storages/MergeTree/MergeTreeData.h	Added `DataPartsSharedLock`, `DataPartsAnyLock` classes and changed mutex from `std::mutex` to `SharedMutex`
src/Storages/MergeTree/MergeTreeData.cpp	Implemented lock classes, added `getPossiblySharedVisibleDataPartsRanges()` method, converted many lock acquisitions to shared locks
src/Storages/StorageReplicatedMergeTree.cpp	Changed lock acquisition calls from `lockParts()` to `readLockParts()` where appropriate
src/Storages/StorageMergeTree.cpp	Changed lock acquisition calls from `lockParts()` to `readLockParts()` where appropriate
src/Storages/MergeTree/MergeTreeDataSelectExecutor.h	Changed function signatures to accept `RangesInDataPartsPtr` and return filtered parts instead of mutating input
src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp	Refactored filtering functions to avoid mutation, work with const references
src/Processors/QueryPlan/ReadFromMergeTree.h	Changed `prepared_parts` field type to `RangesInDataPartsPtr`
src/Processors/QueryPlan/ReadFromMergeTree.cpp	Updated to work with shared pointer types and const references
tests/queries/0_stateless/02340_parts_refcnt_mergetree.sh	Updated expected refcount from 6 to 8 to account for additional references
tests/integration/test_cancel_freeze/test.py	Added sanitizer skip and fixed rm command flag
src/Common/ProfileEvents.cpp	Added new profile events for tracking shared lock metrics

Copilot · 2025-12-24T16:43:46Z

+
+            /// Convert read to write lock, and re-check the shared_parts_list
+            shared_lock_holder.reset();
+            lock_holder.emplace(data_parts_mutex, /*data_=*/ nullptr);


Passing nullptr for the data parameter in DataPartsLock constructor means the destructor will not reset shared_parts_list and shared_ranges_in_parts. This could leave stale cached data after an exclusive lock is acquired and released. The data parameter should be this instead of nullptr.

Suggested change

lock_holder.emplace(data_parts_mutex, /*data_=*/ nullptr);

lock_holder.emplace(data_parts_mutex, /*data_=*/ this);

Comment is useless, but the place is correct - #93022

Imagine the following scenario: T1: TRUNCATE TABLE T2: OPTIMIZE TABLE lockParts() # exclusive lock! shared_parts_list = nullptr getPossiblySharedVisibleDataPartsRanges() shared_lock_holder = readLockParts() # shared lock, shared_parts_list == nullptr auto parts = getDataPartsVectorForInternalUsage({DataPartState::Active}, *shared_lock_holder); shared_lock_holder.reset(); lockParts() # exclusive lock remove parts from list shared_parts_list = nullptr lock_holder.emplace(data_parts_mutex, /*data_=*/ nullptr); # exclusive lock! shared_ranges_in_parts = std::make_shared<const RangesInDataParts>(parts); shared_parts_list = std::make_shared<const DataPartsVector>(std::move(parts)); It will lead to T2 will cache outdated parts in `shared_parts_list`. And I think this is why `01443_merge_truncate_long` started to fail Fixes: ClickHouse#92098 Fixes: ClickHouse#85535

jawm changed the title ~~Jawm/readlock~~ Optimize selection of part list during query planning Aug 13, 2025

jawm force-pushed the jawm/readlock branch from 8c6f290 to af67c64 Compare August 13, 2025 13:30

alexey-milovidov added the can be tested Allows running workflows for external contributors label Aug 13, 2025

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Aug 13, 2025

jawm force-pushed the jawm/readlock branch 4 times, most recently from e1b5cf2 to fc97360 Compare August 14, 2025 12:52

nickitat self-assigned this Aug 19, 2025

jawm force-pushed the jawm/readlock branch 2 times, most recently from 943d18e to 03217db Compare August 27, 2025 14:12

jawm force-pushed the jawm/readlock branch from 03217db to 2ff1596 Compare August 28, 2025 14:03

nickitat reviewed Aug 28, 2025

View reviewed changes

Comment thread src/Storages/MergeTree/MergeTreeData.cpp

Comment thread src/Storages/MergeTree/MergeTreeData.cpp Outdated

Comment thread src/Storages/MergeTree/MergeTreeData.h Outdated

jawm force-pushed the jawm/readlock branch from 2ff1596 to 00ec85f Compare September 8, 2025 14:15

azat reviewed Sep 8, 2025

View reviewed changes

Comment thread src/Storages/MergeTree/MergeTreeData.cpp Outdated

azat reviewed Sep 10, 2025

View reviewed changes

Comment thread src/Processors/QueryPlan/ReadFromMergeTree.cpp Outdated

azat mentioned this pull request Sep 10, 2025

RFC: Optimize MarkRanges by using devector over deque #86933

Merged

azat reviewed Sep 10, 2025

View reviewed changes

Comment thread tests/queries/0_stateless/02340_parts_refcnt_mergetree.sh Outdated

jawm force-pushed the jawm/readlock branch 2 times, most recently from 786f601 to e3c9d0a Compare September 11, 2025 12:58

azat added 9 commits November 13, 2025 15:49

Fix some variable names in MergeTreeDataSelectExecutor

2e3d2ed

Return order of args for preparePartForCommit()

b15c5ce

Update some comments

7da3641

Polish MergeTreeData::getPossiblySharedVisibleDataPartsRanges()

3c45e61

Avoid holding data parts lock during loading part in MergeTreeData::r…

9bc62fa

…efreshDataParts()

Fix typo

f507e39

Fix race in MergeTreeData::getPossiblySharedVisibleDataPartsRanges()

07f9c38

Revert changes in tests since shared parts updated on SELECT now

822e9f6

Fix obtaining parts for SELECT in transactions

a9265bb

azat enabled auto-merge November 14, 2025 09:33

azat added this pull request to the merge queue Nov 14, 2025

Merged via the queue into ClickHouse:master with commit 44a10a9 Nov 14, 2025
127 of 130 checks passed

robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Nov 14, 2025

Avogar mentioned this pull request Nov 21, 2025

Crash in MergeTreeData::getConditionSelectivityEstimator #90626

Closed

azat mentioned this pull request Nov 22, 2025

Fix potential crash when executing mutations with subqueries and allow_statistics_optimize=1 #90664

Merged

c-end mentioned this pull request Dec 9, 2025

Experimental part pruning by partition ID prefix #91805

Open

1 task

azat requested a review from Copilot December 24, 2025 16:42

Copilot AI reviewed Dec 24, 2025

View reviewed changes

azat mentioned this pull request Dec 24, 2025

Flaky test: 01443_merge_truncate_long #92098

Closed

azat mentioned this pull request Dec 24, 2025

Fix possible usage of outdated parts due to TOCTOU race for shared parts #93022

Merged

azat mentioned this pull request Jan 26, 2026

Try to improve JSON column size lazy calculation #83877

Merged

1 task

alexey-milovidov mentioned this pull request Feb 15, 2026

Fix missing move of query_metadata_cache in BlockIO::operator= #96995

Merged

azat mentioned this pull request Feb 23, 2026

Minor reduction of parts lock scope #97727

Merged

	lock_holder.emplace(data_parts_mutex, /data_=/ nullptr);
	lock_holder.emplace(data_parts_mutex, /data_=/ this);

Conversation

jawm commented Aug 13, 2025 • edited by azat Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Implementation details

Refs

Details

Uh oh!

victor-perov commented Aug 13, 2025

Uh oh!

clickhouse-gh Bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jawm commented Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azat commented Sep 8, 2025

Uh oh!

jawm commented Sep 9, 2025

Uh oh!

azat commented Sep 9, 2025

Uh oh!

jawm commented Sep 10, 2025

Uh oh!

Uh oh!

azat commented Sep 10, 2025

Uh oh!

Uh oh!

jawm commented Sep 12, 2025

Uh oh!

azat commented Nov 14, 2025

Uh oh!

azat commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

azat commented Nov 14, 2025

Uh oh!

jawm commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

azat Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

jawm commented Aug 13, 2025 •

edited by azat

Loading

clickhouse-gh Bot commented Aug 13, 2025 •

edited

Loading

azat commented Nov 14, 2025 •

edited

Loading