Skip to content

Commit bd4bd3e

Browse files
gh-152100: Support set operations in character classes (GH-152153)
Implement set difference [A--B], intersection [A&&B] and union [A||B] in regular expression character classes (Unicode Technical Standard #18), including nested, complemented and compound set operands. Symmetric difference [A~~B] remains reserved. Also use the new syntax in the standard library (_strptime, textwrap, doctest, pkgutil). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent a6c2d4a commit bd4bd3e

9 files changed

Lines changed: 324 additions & 162 deletions

File tree

Doc/library/re.rst

Lines changed: 34 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -279,25 +279,47 @@ The special characters are:
279279
``[]()[{}]`` will match a right bracket, as well as left bracket, braces,
280280
and parentheses.
281281

282-
.. .. index:: single: --; in regular expressions
283-
.. .. index:: single: &&; in regular expressions
284-
.. .. index:: single: ~~; in regular expressions
285-
.. .. index:: single: ||; in regular expressions
286-
287-
* Support of nested sets and set operations as in `Unicode Technical
288-
Standard #18`_ might be added in the future. This would change the
289-
syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
290-
in ambiguous cases for the time being.
291-
That includes sets starting with a literal ``'['`` or containing literal
292-
character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``. To
293-
avoid a warning escape them with a backslash.
282+
.. index::
283+
single: --; in regular expressions
284+
single: &&; in regular expressions
285+
single: ||; in regular expressions
286+
287+
* A character set may contain a nested set written in square brackets, and
288+
two sets may be combined with a set operator, as in `Unicode Technical
289+
Standard #18`_:
290+
291+
* ``[A--B]`` (*difference*) matches a character that is in *A* but not
292+
in *B*; for example ``[a-z--[aeiou]]`` matches an ASCII lowercase
293+
consonant.
294+
* ``[A&&B]`` (*intersection*) matches a character that is in both *A*
295+
and *B*; for example ``[\w&&[a-z]]`` matches an ASCII lowercase letter.
296+
* ``[A||B]`` (*union*) matches a character that is in *A* or in *B*; this
297+
is the same as listing the members of both sets in a single set, but
298+
allows combining nested sets.
299+
300+
Operators have no precedence and are applied from left to right. To
301+
group, write a nested set as the operand after an operator, as in
302+
``[a-z--[aeiou]]``. A leading ``'^'`` complements the whole result.
303+
A ``'['`` begins a nested set only immediately after a set operator;
304+
anywhere else -- including at the start of a character set -- it is an
305+
ordinary character, so existing patterns keep their meaning. Escape it
306+
as ``'\['`` to include a literal ``'['`` right after an operator.
294307

295308
.. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
296309

310+
.. note::
311+
312+
Symmetric difference (``A~~B``) is not yet supported; a literal ``'~~'``
313+
in a character set still raises a :exc:`FutureWarning`.
314+
297315
.. versionchanged:: 3.7
298316
:exc:`FutureWarning` is raised if a character set contains constructs
299317
that will change semantically in the future.
300318

319+
.. versionchanged:: next
320+
Added support for nested sets and the set operators ``--``, ``&&``
321+
and ``||``.
322+
301323
.. index:: single: | (vertical bar); in regular expressions
302324

303325
``|``

Doc/whatsnew/3.16.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -181,6 +181,18 @@ os
181181
(Contributed by Maurycy Pawłowski-Wieroński in :gh:`149464`.)
182182

183183

184+
re
185+
--
186+
187+
* :mod:`re` now supports set operations and nested sets in character classes,
188+
as described in `Unicode Technical Standard #18
189+
<https://unicode.org/reports/tr18/>`__: set difference (``[A--B]``),
190+
intersection (``[A&&B]``) and union (``[A||B]``), where an operand may be a
191+
nested set written in square brackets. For example, ``[a-z--[aeiou]]``
192+
matches an ASCII lowercase consonant.
193+
(Contributed by Serhiy Storchaka in :gh:`152100`.)
194+
195+
184196
shlex
185197
-----
186198

Lib/_strptime.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ def __calc_date_time(self):
238238
current_format = current_format.replace(tz, "%Z")
239239
# Transform all non-ASCII digits to digits in range U+0660 to U+0669.
240240
if not current_format.isascii() and self.LC_alt_digits is None:
241-
current_format = re_sub(r'\d(?<![0-9])',
241+
current_format = re_sub(r'[\d--0-9]',
242242
lambda m: chr(0x0660 + int(m[0])),
243243
current_format)
244244
for old, new in replacement_pairs:

Lib/doctest.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1768,7 +1768,7 @@ def check_output(self, want, got, optionflags):
17681768
'', want)
17691769
# If a line in got contains only spaces, then remove the
17701770
# spaces.
1771-
got = re.sub(r'(?m)^[^\S\n]+$', '', got)
1771+
got = re.sub(r'(?m)^[\s--\n]+$', '', got)
17721772
if got == want:
17731773
return True
17741774

Lib/pkgutil.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -443,7 +443,7 @@ def resolve_name(name, *, strict=False):
443443
within the imported package to get to the desired object.
444444
"""
445445
global _LENIENT_PATTERN, _STRICT_PATTERN
446-
dotted_words = r'(?!\d)(\w+)(\.(?!\d)(\w+))*'
446+
dotted_words = r'([\w--\d]\w*)(\.([\w--\d]\w*))*'
447447
if strict:
448448
if _STRICT_PATTERN is None:
449449
_STRICT_PATTERN = re.compile(

0 commit comments

Comments
 (0)