gh-152100: Fuse set-operation character classes into a single charset#152214
Open
serhiy-storchaka wants to merge 1 commit into
Open
gh-152100: Fuse set-operation character classes into a single charset#152214serhiy-storchaka wants to merge 1 commit into
serhiy-storchaka wants to merge 1 commit into
Conversation
…harset Add a compile-time optimization pass (Lib/re/_optimizer.py) that rewrites set-operation character classes into a single character set where the engine's charset() representation allows it. charset() treats every NEGATE as a polarity toggle, so a mid-list NEGATE expresses set difference and a flat run expresses union. Set difference -- [A--B], emitted by the parser as A(?<![B]) -- fuses into the charset [NEGATE] B [NEGATE] A, matching A minus B in one test instead of a charset match plus a lookbehind rescan. _optimize_charset is made segment-aware so the interior NEGATE compiles correctly. A union with a non-flat operand, such as [0-9||[a-z--b]], is emitted by the parser as a BRANCH that it cannot merge. Once its alternatives are all one-character matchers, their item lists are concatenated into a single IN. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A compile-time optimization for the set operations added in gh-152100. No behaviour change -- same matches, fewer engine steps.
The parser lowers set difference
[A--B]toA(?<![B]): a character class followed by a negative lookbehind. But the engine'scharset()walk treats everyNEGATEas a polarity toggle (seeModules/_sre/sre_lib.h), so one character set can express the difference directly --[NEGATE] B [NEGATE] AmatchesAminusBin a single charset test instead of a charset match plus a lookbehind rescan.A new pass in
Lib/re/_optimizer.pyperforms two folds:[A--B](or an explicitA(?<![B])) into the single charset[NEGATE] B [NEGATE] A._optimize_charsetis made segment-aware so the interiorNEGATEcompiles correctly; the parser keeps emitting the plainA(?<![B])form.[0-9||[a-z--b]]is left by the parser as aBRANCHit cannot merge; once its alternatives are all one-character matchers, their item lists are concatenated into oneIN.Besides removing the lookbehind rescan, the fused charset also becomes the
INFOprefix, so the engine can fast-skip non-matching positions. On scan-heavy input this is several times faster (findallover 50 KB of non-matching text, best of 7):On match-dense input the per-match win is smaller, masked by match-object construction.
🤖 Generated with Claude Code