Skip to content

GH-115060: Speed up pathlib.Path.glob() by removing redundant regex matching#115061

Merged
barneygale merged 10 commits intopython:mainfrom
barneygale:gh-115060
Feb 10, 2024
Merged

GH-115060: Speed up pathlib.Path.glob() by removing redundant regex matching#115061
barneygale merged 10 commits intopython:mainfrom
barneygale:gh-115060

Conversation

@barneygale
Copy link
Copy Markdown
Contributor

@barneygale barneygale commented Feb 6, 2024

When expanding and filtering paths for a ** wildcard segment, build an re.Pattern object from the subsequent pattern parts, rather than the entire pattern, and match against the os.DirEntry object prior to instantiating a path object.

Also skip compiling a pattern when expanding a * wildcard segment.

… regex matching

When expanding and filtering paths for a `**` wildcard segment, build an
`re.Pattern` object from the subsequent pattern parts, rather than the
entire pattern.

Also skip compiling a pattern when expanding a `*` wildcard segment.
@barneygale
Copy link
Copy Markdown
Contributor Author

barneygale commented Feb 6, 2024

Notable improvements:

$ ./python -m timeit -s "from pathlib import Path" "list(Path.cwd().glob('*', follow_symlinks=False))"
2000 loops, best of 5: 180 usec per loop  # before
2000 loops, best of 5: 159 usec per loop  # after
# --> 1.13x faster

$ ./python -m timeit -s "from pathlib import Path" "list(Path.cwd().glob('**/*.py', follow_symlinks=False))"
5 loops, best of 5: 54   msec per loop  # before
5 loops, best of 5: 40.9 msec per loop  # after
# --> 1.32x faster

Everything else is about the same.

@zooba
Copy link
Copy Markdown
Member

zooba commented Feb 8, 2024

For whatever reason, every time I try to review this, I struggle to figure out what the change is doing :D

Since it doesn't require changing any test cases, and I know the tests cases are pretty thorough for this area, I don't think there's any reason to not sign off. Maybe trigger a buildbot run with the tag to make sure it doesn't behave strangely on any of those setups - they can occasionally be a bit unusual and find some edge cases.

@barneygale
Copy link
Copy Markdown
Contributor Author

barneygale commented Feb 8, 2024

Thanks Steve.

For whatever reason, every time I try to review this, I struggle to figure out what the change is doing :D

The algorithm might be worthy of a blog post at this point!

The main change is that we now filter partial paths through a regex corresponding to a partial pattern in _select_recursive, rather than complete paths through a regex corresponding to a complete pattern in PathBase.glob(). We can do this because previous parts have already been filtered by _select_children(), and so there's no need to re-filter them.

The secondary change (which includes the addition of _entry_str()) is to match against os.DirEntry.path directly, which allows us to skip construction of path objects for files that don't match.

@zooba
Copy link
Copy Markdown
Member

zooba commented Feb 8, 2024

Okay, today it made sense :) Guess I'm more awake right now. Reading the changes from the bottom up might have helped as well.

Personally, I don't think you can have too many comments in an algorithm like this, particularly when it's recursive and split between a couple of functions. I'll suggest a few comments that would've helped me, but I don't think there are any code changes needed.

Copy link
Copy Markdown
Member

@zooba zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just comments that may help make it more understandable. No changes required

Comment thread Lib/pathlib/_abc.py
Comment thread Lib/pathlib/_abc.py
Comment thread Lib/pathlib/_abc.py
Comment thread Lib/pathlib/__init__.py Outdated
@barneygale barneygale merged commit 6f93b4d into python:main Feb 10, 2024
fsc-eriker pushed a commit to fsc-eriker/cpython that referenced this pull request Feb 14, 2024
… regex matching (python#115061)

When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern, and match against the `os.DirEntry` object prior to instantiating a path object. Also skip compiling a pattern when expanding a `*` wildcard segment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance or resource usage topic-pathlib

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants