gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize() by sethmlarson · Pull Request #149080 · python/cpython

sethmlarson · 2026-04-27T21:38:23Z

Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters with a large number of inversions in combing class order.

Issue: O(n²) insertion sort in unicodedata.normalize("NFC") canonical ordering #149079

…ze() Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>

StanFromIreland · 2026-04-27T21:46:49Z

Reviewers: Note that there are pending changes from previous reviews.

maurycy · 2026-04-27T22:04:21Z

        self.assertEqual(self.db.normalize('NFC', a), b)

+    def test_long_combining_mark_run(self):
+        # GH-XXXXX: avoid quadratic canonical ordering.


- # GH-XXXXX: avoid quadratic canonical ordering. + # gh-149079: avoid quadratic canonical ordering.

maurycy · 2026-04-27T22:04:51Z

+        self.assertEqual(self.db.normalize("NFKC", payload), nfc)
+
+    def test_combining_mark_run_fast_paths(self):
+        # GH-XXXXX: cover short runs and already-sorted long runs.


- # GH-XXXXX: cover short runs and already-sorted long runs. + # gh-149079: cover short runs and already-sorted long runs.

maurycy · 2026-04-27T22:06:19Z

+
+        if (run_length > sortbuflen) {
+            Py_UCS4 *new_sortbuf = PyMem_Realloc(sortbuf,
+                                                 run_length * sizeof(Py_UCS4));


Maybe PyMem_Resize instead of calculating manually?

cpython/Include/pymem.h

Lines 58 to 60 in 005555a

* or NULL if the request was too large or memory allocation failed. Use

* these macros rather than doing the multiplication yourself so that proper

* overflow checking is always done.

serhiy-storchaka

There is a potential for optimization, but in general LGTM. 👍

serhiy-storchaka · 2026-04-28T10:43:51Z

    Py_ssize_t i, o, osize;
-    int kind;
-    const void *data;
+    int input_kind, result_kind;


Why not reuse the same variable?

serhiy-storchaka · 2026-04-28T10:44:45Z

-    data = PyUnicode_DATA(result);
+    result_kind = PyUnicode_KIND(result);
+    result_data = PyUnicode_DATA(result);
+    length = PyUnicode_GET_LENGTH(result);


It is the same as o.

serhiy-storchaka · 2026-04-28T10:59:57Z

Ideas for optimization:

We already have the Py_UCS4 output buffer. It is better to sort it, without using more costly PyUnicode_READ and PyUnicode_WRITE.
It is perhaps possible to combine sorting routines with the code that determines the length. This will reduce the number of costly _getrecord_ex() calls but requires heavy rewriting.
Since Unicode characters only need 21 bits of 32, they can be combined with 8-bit combining in the temporary buffer, reducing the number of costly _getrecord_ex() calls. But this will make the code more difficult to read.

sethmlarson requested review from malemburg and tim-one April 27, 2026 21:38

bedevere-app Bot added the awaiting review label Apr 27, 2026

sethmlarson added type-security A security issue topic-unicode and removed awaiting review labels Apr 27, 2026

bedevere-app Bot mentioned this pull request Apr 27, 2026

O(n²) insertion sort in unicodedata.normalize("NFC") canonical ordering #149079

Open

maurycy reviewed Apr 27, 2026

View reviewed changes

serhiy-storchaka self-requested a review April 27, 2026 22:16

serhiy-storchaka approved these changes Apr 28, 2026

View reviewed changes

bedevere-app Bot added the awaiting merge label Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080

gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080
sethmlarson wants to merge 1 commit intopython:mainfrom
sethmlarson:quadratic-unicodedata-normalize

sethmlarson commented Apr 27, 2026 •

edited by tim-one

Loading

Uh oh!

StanFromIreland commented Apr 27, 2026

Uh oh!

maurycy Apr 27, 2026

Uh oh!

maurycy Apr 27, 2026

Uh oh!

maurycy Apr 27, 2026 •

edited

Loading

Uh oh!

serhiy-storchaka left a comment

Uh oh!

serhiy-storchaka Apr 28, 2026

Uh oh!

serhiy-storchaka Apr 28, 2026

Uh oh!

serhiy-storchaka commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	* or NULL if the request was too large or memory allocation failed. Use
	* these macros rather than doing the multiplication yourself so that proper
	* overflow checking is always done.

Uh oh!

Conversation

sethmlarson commented Apr 27, 2026 • edited by tim-one Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StanFromIreland commented Apr 27, 2026

Uh oh!

maurycy Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

maurycy Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

maurycy Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sethmlarson commented Apr 27, 2026 •

edited by tim-one

Loading

maurycy Apr 27, 2026 •

edited

Loading