Skip to content

gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080

Open
sethmlarson wants to merge 1 commit intopython:mainfrom
sethmlarson:quadratic-unicodedata-normalize
Open

gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080
sethmlarson wants to merge 1 commit intopython:mainfrom
sethmlarson:quadratic-unicodedata-normalize

Conversation

@sethmlarson
Copy link
Copy Markdown
Contributor

@sethmlarson sethmlarson commented Apr 27, 2026

Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters with a large number of inversions in combing class order.

…ze()

Replace the insertion sort used for canonical ordering of combining
characters with a hybrid approach: insertion sort for short runs (< 20)
and counting sort for longer runs, reducing worst-case complexity from
O(n^2) to O(n). This prevents denial of service via crafted Unicode
strings with many combining characters in alternating CCC order.

Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
@StanFromIreland
Copy link
Copy Markdown
Member

Reviewers: Note that there are pending changes from previous reviews.

self.assertEqual(self.db.normalize('NFC', a), b)

def test_long_combining_mark_run(self):
# GH-XXXXX: avoid quadratic canonical ordering.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-        # GH-XXXXX: avoid quadratic canonical ordering.
+        # gh-149079: avoid quadratic canonical ordering.

self.assertEqual(self.db.normalize("NFKC", payload), nfc)

def test_combining_mark_run_fast_paths(self):
# GH-XXXXX: cover short runs and already-sorted long runs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-        # GH-XXXXX: cover short runs and already-sorted long runs.
+        # gh-149079: cover short runs and already-sorted long runs.

Comment thread Modules/unicodedata.c

if (run_length > sortbuflen) {
Py_UCS4 *new_sortbuf = PyMem_Realloc(sortbuf,
run_length * sizeof(Py_UCS4));
Copy link
Copy Markdown
Contributor

@maurycy maurycy Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe PyMem_Resize instead of calculating manually?

cpython/Include/pymem.h

Lines 58 to 60 in 005555a

* or NULL if the request was too large or memory allocation failed. Use
* these macros rather than doing the multiplication yourself so that proper
* overflow checking is always done.

@serhiy-storchaka serhiy-storchaka self-requested a review April 27, 2026 22:16
Copy link
Copy Markdown
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential for optimization, but in general LGTM. 👍

Comment thread Modules/unicodedata.c
Py_ssize_t i, o, osize;
int kind;
const void *data;
int input_kind, result_kind;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reuse the same variable?

Comment thread Modules/unicodedata.c
data = PyUnicode_DATA(result);
result_kind = PyUnicode_KIND(result);
result_data = PyUnicode_DATA(result);
length = PyUnicode_GET_LENGTH(result);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same as o.

@serhiy-storchaka
Copy link
Copy Markdown
Member

Ideas for optimization:

  • We already have the Py_UCS4 output buffer. It is better to sort it, without using more costly PyUnicode_READ and PyUnicode_WRITE.
  • It is perhaps possible to combine sorting routines with the code that determines the length. This will reduce the number of costly _getrecord_ex() calls but requires heavy rewriting.
  • Since Unicode characters only need 21 bits of 32, they can be combined with 8-bit combining in the temporary buffer, reducing the number of costly _getrecord_ex() calls. But this will make the code more difficult to read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants