Skip to content

Optimize matching of category escapes (\d, \w, ...) outside character sets #152033

Description

@serhiy-storchaka

The character class escapes \d, \D, \s, \S, \w and \W are currently always compiled to an IN block containing a single CATEGORY item, even when they occur outside a character set:

>>> re.compile(r'\d', re.DEBUG)
...
 5: IN 4 (to 10)
 7.   CATEGORY UNI_DIGIT
 9.     FAILURE
10: SUCCESS

The IN wrapper costs three extra code words (IN, the skip, and FAILURE), an extra dispatch, and an indirect SRE(charset) call per character — even though there is only a single alternative.

A category escape that appears outside a set can instead be compiled directly to a bare CATEGORY opcode:

>>> re.compile(r'\d', re.DEBUG)
...
 5: CATEGORY UNI_DIGIT
 6: SUCCESS

This also makes such an escape a "simple" repeatable unit, so \d+ uses the REPEAT_ONE fast path (handled by SRE(count)) instead of the generic REPEAT/MAX_UNTIL loop; a CATEGORY case is added to SRE(count) accordingly.

The transformation preserves behaviour exactly (the engine already matched the same category) and only changes the compiled byte code.

In a release build I measure ~1.3x geometric-mean speedup across a range of category-heavy patterns — roughly 1.7–2.0x on scans like \d+, \s+, \S+, and ~1.1–1.2x on realistic tokenizing, date, and IP-address patterns — together with ~20% smaller compiled byte code for those patterns. Patterns that do not use bare category escapes are unaffected.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions