Optimize matching of category escapes (\d, \w, ...) outside character sets

The character class escapes ``\d``, ``\D``, ``\s``, ``\S``, ``\w`` and ``\W`` are currently always compiled to an ``IN`` block containing a single ``CATEGORY`` item, even when they occur outside a character set:

```pycon
>>> re.compile(r'\d', re.DEBUG)
...
 5: IN 4 (to 10)
 7.   CATEGORY UNI_DIGIT
 9.     FAILURE
10: SUCCESS
```

The ``IN`` wrapper costs three extra code words (``IN``, the skip, and ``FAILURE``), an extra dispatch, and an indirect ``SRE(charset)`` call per character — even though there is only a single alternative.

A category escape that appears outside a set can instead be compiled directly to a bare ``CATEGORY`` opcode:

```pycon
>>> re.compile(r'\d', re.DEBUG)
...
 5: CATEGORY UNI_DIGIT
 6: SUCCESS
```

This also makes such an escape a "simple" repeatable unit, so ``\d+`` uses the ``REPEAT_ONE`` fast path (handled by ``SRE(count)``) instead of the generic ``REPEAT``/``MAX_UNTIL`` loop; a ``CATEGORY`` case is added to ``SRE(count)`` accordingly.

The transformation preserves behaviour exactly (the engine already matched the same category) and only changes the compiled byte code.

In a release build I measure ~1.3x geometric-mean speedup across a range of category-heavy patterns — roughly 1.7–2.0x on scans like ``\d+``, ``\s+``, ``\S+``, and ~1.1–1.2x on realistic tokenizing, date, and IP-address patterns — together with ~20% smaller compiled byte code for those patterns. Patterns that do not use bare category escapes are unaffected.



### Linked PRs
* gh-152035

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimize matching of category escapes (\d, \w, ...) outside character sets #152033

Linked PRs

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Optimize matching of category escapes (\d, \w, ...) outside character sets #152033

Description

Linked PRs

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions