CI: add Linux x86_64 clang ASAN+LSAN job#30800
Conversation
|
Neat, thanks for working on this! I'll try to take a close look at this next week sometime. |
|
Sorry for dropping this! I'll look this over today. |
| `find numpy -name "test*.py" | xargs grep -E -l "import threading|ThreadPoolExecutor" | tr '\n' ' '` \ | ||
| -- -v -s --timeout=600 --durations=10 | ||
|
|
||
| gcc_ASAN: |
There was a problem hiding this comment.
Sorry for taking two weeks to notice this.
For a while we were using the GCC compiler sanitizers in CI but turned them off after encountering mysterious flaky failures. Perhaps things have improved since then?
If I were setting this up from scratch on Linux, I'd use an LLVM toolchain.
I have a vague feeling that the tests will also run faster under LLVM's ASan implementation. Of course it's apples-to-oranges because of the different CPU and OS but the Mac ASan builds using LLVM run the tests in 8.5 minutes but the GCC ASan tests run in 15 minutes.
Thankfully, we have a neat fix for this already. @nascheme set up ASan docker images that have a CPython build using an LLVM toolchain, see https://github.com/nascheme/cpython_sanity. You should be able to follow the set up of the docker images for the TSan tests, which also use the cpython_sanity images and an LLVM toolchain. This also nicely sidesteps the need to build CPython from source under TSan and the docker images already have a build environment installed.
There was a problem hiding this comment.
I haven't seen flakiness running gcc but in any case moving to clang allows to reduce run time a bit, mostly numpy build time, see this workflow run in my fork.
I can move to clang in this PR if this is the preferred option.
As for the Asan docker image, it is not usable without further investigations into the leaks reported when running the free-threaded python build which is the only one available in those images as mentioned in my 3rd remark in the description: "Trying to use ghcr.io/nascheme/numpy-asan:3.14t shows many leaks and at this point, I did not check any of those and felt it was better to get something working reliably with the GIL-enabled interpreter rather than nothing."
I'm trying to narrow down the tests causing those leaks but it will take weeks given the time I can spend on this in my spare time.
There was a problem hiding this comment.
which is the only one available in those images as mentioned in my 3rd remark in the description
Sorry, missed this.
I'd like to fix these issues on the free-threaded build, but no worries if you're not interested.
I can move to clang in this PR if this is the preferred option.
Let's do that.
There was a problem hiding this comment.
Sorry, missed this.
no worries
I'd like to fix these issues on the free-threaded build, but no worries if you're not interested.
I'll try & post the list of test files that shows leak even if I can't narrow it down to individual tests just yet. This first pass should be faster and can help further investigations by others. Maybe this is better posted as a new issue in the bug tracker.
Let's do that.
done. rebased on main and added clang-21 installation.
8c96728 to
fc808db
Compare
|
This now comes in at 25 minutes, which is only about a minute longer than our longest QEMU-based test jobs. I think there are plenty of opportunities to speed this up and I'm OK with adding another long test job - particularly one that will help us avoid memory leaks in CI. |
|
I agree, I think this is awesome to have on regular CI, thanks for all the work! |
ngoldbaum
left a comment
There was a problem hiding this comment.
I looked over the CI logs and didn't see anything besides clang-21 adding some compiler warnings, which isn't something to worry about here. One question inline.
| - name: Test | ||
| run: | | ||
| # pass -s to pytest to see ASAN errors and warnings, otherwise pytest captures them | ||
| export ASAN_OPTIONS="detect_leaks=1:symbolize=1:strict_init_order=true:allocator_may_return_null=1:use_sigaltstack=0" |
There was a problem hiding this comment.
First time I've seen use_sigaltstack or strict_init_order. Should we maybe be enabling in other sanitizer jobs?
There was a problem hiding this comment.
I've never used them myself before and never saw them either. They come from a grep on ASAN_OPTIONS on the numpy code base which lead me to the following line:
numpy/pixi-packages/asan/pixi.toml
Line 16 in 089ceb7
Given it seemed to be used for the linux-x64 build in their, I've copied the same options in the new job.
|
Agreed, sounds like a good idea to me - I don't see a problem with one more heavy job. Thanks for doing this. |
rgommers
left a comment
There was a problem hiding this comment.
Overall LGTM, approving - just two small suggestions
|
|
||
| - name: Install clang 21 | ||
| run: | | ||
| curl -fsSLo /tmp/llvm.sh https://apt.llvm.org/llvm.sh |
There was a problem hiding this comment.
This looks a bit questionable, if acceptable. We should replace it as soon as Clang 21 packages are available from Ubuntu though, it looks like that's in the 26.04 image which should arrive in a month or two. Maybe add a comment?
There was a problem hiding this comment.
comment added.
This looks a bit questionable, if acceptable.
What kind of issue are you worried about ?
There was a problem hiding this comment.
I don't want to speak for Ralf, but I guess just depending on upstream binaries instead of system binaries is a little odd. Maybe also the security implications of downloading and executing binaries like this?
There was a problem hiding this comment.
Yes, just a general security principle. The more random downloads there are, with no pinning or hashes, the wider the trust circle. If it's the best option for right now, then sure it's not a major issue. But when we can replace it with trusting Ubuntu packages, we should.
There was a problem hiding this comment.
I don't think this could be considered as a major supply chain security risk given the source (not so random, upstream LLVM repos) and the fact that's this is only a test job which does not expose sensitive secrets and does not produce any artefacts. It could have been better by manually adding the repo and the public key rather than an unchecked script download but I did not considered the risk enough to warrant for the extra complexity.
I agree however that this can count as an exemple of bad habit regarding supply chain security risk.
@nascheme published some new images including a GIL-enabled ASAN python build. See #30905 for an update of this job. This moves the supply chain security risk to what's been accepted for the TSAN job.
There was a problem hiding this comment.
Yep, agreed with all your thoughts - thanks for the follow-up!
There was a problem hiding this comment.
I wonder if would be good if I change my docker images to use Ubuntu 25 rather than 24. The 25 release comes with Clang 21 as an Ubuntu package. For testing with TSAN or ASAN, perhaps it's better to be running on the newest stable OS version that's supported.
There was a problem hiding this comment.
That does seem useful indeed, newer Clang versions can help. But no hurry to change it right now on account of NumPy.
|
|
||
| - name: Build Python with address sanitizer | ||
| run: | | ||
| CONFIGURE_OPTS="--with-address-sanitizer" pyenv install 3.14 |
There was a problem hiding this comment.
The line numbers in the suppressions file might change with bug fix releases. Might be better to use 3.14.3 here?
There was a problem hiding this comment.
The line numbers in the suppressions file are just illustrative (the same goes for numpy line numbers in there).
They are not used to determine suppressions, it's just a comment.
I used the same major.minor minus free-threading as what's found in https://github.com/nascheme/cpython_sanity (although cpython 3.14 images have not been rebuilt in a while so stuck to 3.14.0 for now).
I see both pros and cons in pinning python to an exact release. In the pros, obviously stability of the job, it won't fail because a new cpython version is available. On the other hand, I think it would be best to test with the latest patch version as is done in most of the other jobs.
There was a problem hiding this comment.
I'm in the process of building images for 3.14.3. They will also be built automatically when there is a new release, which will be nice.
- add a comment to use regular ubuntu package once ubuntu 26.04 is available
|
Let's bring this in and see if it is noisy or, even better, starts telling us when we write memory leaks. Thanks so much for the contribution @mayeut and also for your work on manylinux. |
This PR adds a Linux x86_64 clang AddressSanitizer+LeakSanitizer job as discussed in #30761 (comment) and #30583
I'll try to address the various comments found in #30583 regarding leak checks in this description.
For now, AddressSanitizer is only ran on macOS arm64 so it does make sense to add a job for Linux x86_64 which likely covers a much larger share of users and uses different optimisations not yet covered by existing AddressSanitizer tests. LeakSanitizer should not add much overhead to testing time compared to AddressSanitizer tests.
Limitations:
detect_leaks=1and thus, leaks in CPython might be revealed here on CPython upgrades.Remarks:
__lsan_do_leak_check/__lsan_do_recoverable_leak_checkto check if it could help narrow down which specific tests are actually leaking. This could be investigated as a future enhancement.python -c "import numpy"or some subprocess tests in the test suite).ghcr.io/nascheme/numpy-asan:3.14tshows many leaks and at this point, I did not check any of those and felt it was better to get something working reliably with the GIL-enabled interpreter rather than nothing.LSAN_OPTIONSin environment variables. I did not want to overcomplicate things to determine if leak sanitizer was actually enabled. This also means that if the suppression file evolves to be empty, it shall probably shall be kept, rather than removed, with no updates to the job or, this skip mechanism should be reworked.