Skip to content

Speed up multiline regexes anchored by ^. #148762

@haampie

Description

@haampie

Feature or enhancement

Proposal:

Very similar to #148729. I noticed that for regexes of the form

re.compile("^foo", re.MULTILINE)

there's still a character by character loop calling SRE(match).

The SRE(match) function call is expensive and can be avoided by scanning for newline characters first. The newline character scan can be vectorized in the UCS1 case by calling memchr for additional speedup.

A patch would look like this:

diff --git a/Modules/_sre/sre_lib.h b/Modules/_sre/sre_lib.h
index 4f1269988b9..11d341ea2ab 100644
--- a/Modules/_sre/sre_lib.h
+++ b/Modules/_sre/sre_lib.h
@@ -1869,12 +1869,42 @@ SRE(search)(SRE_STATE* state, SRE_CODE* pattern)
             state->start = state->ptr = ptr = end;
             return 0;
         }
-        while (status == 0 && ptr < end) {
-            ptr++;
-            RESET_CAPTURE_GROUP();
-            TRACE(("|%p|%p|SEARCH\n", pattern, ptr));
-            state->start = state->ptr = ptr;
-            status = SRE(match)(state, pattern, 0);
+        if (pattern[0] == SRE_OP_AT &&
+            pattern[1] == SRE_AT_BEGINNING_LINE)
+        {
+            /* Skip to line boundary */
+            while (status == 0 && ptr < end) {
+                ptr++;
+                if (!SRE_IS_LINEBREAK((int) ptr[-1])) {
+#if SIZEOF_SRE_CHAR == 1
+                    ptr = (SRE_CHAR *)memchr(ptr, '\n', end - ptr);
+                    if (!ptr) {
+                        break;
+                    }
+#else
+                    while (ptr < end && !SRE_IS_LINEBREAK((int) *ptr)) {
+                        ptr++;
+                    }
+                    if (ptr >= end) {
+                        break;
+                    }
+#endif
+                    ptr++;
+                }
+                RESET_CAPTURE_GROUP();
+                TRACE(("|%p|%p|SEARCH\n", pattern, ptr));
+                state->start = state->ptr = ptr;
+                status = SRE(match)(state, pattern, 0);
+            }
+        }
+        else {
+            while (status == 0 && ptr < end) {
+                ptr++;
+                RESET_CAPTURE_GROUP();
+                TRACE(("|%p|%p|SEARCH\n", pattern, ptr));
+                state->start = state->ptr = ptr;
+                status = SRE(match)(state, pattern, 0);
+            }
         }
     }
 

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions