ENH: add maxlag and lags parameters to correlate/convolve#31261
ENH: add maxlag and lags parameters to correlate/convolve#31261HoniSanders wants to merge 13 commits intonumpy:mainfrom
Conversation
Brings the maxlag2025 branch up to date with current numpy/numpy main (843 commits since branch base). Conflict resolution in numpy/_core/src/multiarray/multiarraymodule.c: - Adopt upstream's _pyarray_correlate signature change from `int typenum` to `PyArray_Descr *typec` (PR numpy#30931). - Keep our refactor: the function still takes (minlag, maxlag, lagstep) instead of (mode, *inverted), and handles array swap, negative-step normalization, and output reversal internally. - Update PyArray_Correlate, PyArray_Correlate2, and PyArray_CorrelateLags to use upstream's PyArray_DTypeFromObject + NPY_DT_CALL_ensure_canonical pattern for type resolution, with proper Py_DECREF(typec) cleanup. - Update the new array_correlatelags argument parser to use upstream's brace-syntax for npy_parse_arguments. Add NPY_2_6_API_VERSION 0x00000016 and matching version-string clause in numpy/_core/include/numpy/numpyconfig.h to support the bumped C API version that registers PyArray_CorrelateLags at slot 369. Tests: numpy/_core/tests/test_numeric.py: 512 passed, 1 skipped.
Update pythoncapi-compat to match upstream
|
Thanks for the PR, I just enabled CI. If it fails you may need to enable CI on your fork to get CI passing in a timely manner. I don't have time to go through the backlog of discussion about this and am not registering an opinion about whether we want to add this or not. That said, generally NumPy is pretty conservative about adding new keywords and functions, since the API surface is already very large and complicated. One of the rules we have is that new features require sending a message to the numpy-discussion mailing list with a proposal to add the new feature. That gives the wider contributor base who don't actively follow github a chance to weigh in. We're also generally conservative about adding new things to the C API, so your proposal to add a new C API item needs some justification. Keep in mind that basically no one will be able to use the NumPy 2.6 C API until perhaps 2028 due to projects keeping support for older NumPy releases. Is it really worth it to add the new C API? You'll need to explain why we should add the function. Would you mind sending such a message to the mailing list about this proposal? |
|
Thanks. I did post about this on the numpy-discussion list at the time (see https://mail.python.org/archives/list/numpy-discussion@python.org/thread/JMXNOCFQRFJYISFKSRI7MRL7GPK5ZD4S/#JMXNOCFQRFJYISFKSRI7MRL7GPK5ZD4S and especially https://mail.python.org/archives/list/numpy-discussion@python.org/thread/6FC27XPOD4SBH3KJ7VY5DDAPS5Q46GPO/#6FC27XPOD4SBH3KJ7VY5DDAPS5Q46GPO where several people chimed in saying this would be useful). I figured that was sufficient, but can re-post since it's been a while. You can also see several people hopping in to support the feature on the previous PR and closed issue. As for the C API question, I believe it needs a new C API function because the core vector calculation implementation needs to be in C and the existing arguments are not sufficient to support the new implementation. Feel free to let me know if there is another way for the python function to call a modified version of the existing C functions. |
Ah, I didn't search the archives. I'll take a look at those discussions and let you know if I think a new mailing list ping is warranted.
But why does the new function need to be publicly exposed in the C API? Can't NumPy just call an internal C function?
You could not expose it publicly and instead just use the new C implementation internally. |
|
oh sure, I didn't realize limiting it to internal use was an option. I don't know who uses the public C API. I guess the value of having the public C API function would just be to expose the functionality to whoever uses the C API as long as we are building the functionality anyway. But I certainly wouldn't want the question of the C API to delay the inclusion of the functionality in the python API |
|
Oh great, good it was just a misconception. We definitely add features to the Python API without adding equivalent features in the public C API. |
|
I just took a look and given that it’s been more than ten years since the mailing list discussion, I do think a new ping is warranted. When you send your message, please include context about prior discussions and how this proposal differs from earlier proposals, if it does. Also, I haven’t checked but it would also be good to see what other array libraries do and what the array API standard says about this case. If adding this makes us more conformant with the array API or otherwise makes it easier to write code that is agnostic to the underlying array API, that would also argue towards including this. However, if this is an entirely new feature that only NumPy would implement, that makes it a harder sell. Also please feel free to come to a NumPy community meeting to chat face-to-face about this over Zoom: |
|
Ok, I will send an email to the list, and I will try to attend this week's zoom. What do you mean by "other array libraries" and "array API standard"? |
|
Array API Standard. In particular, note the Stakeholders which are those other array libraries.
You may want to consider the option to implement more specialized functions in |
|
Thanks for the pointer.
If you want, we can remove the returns_lagvector argument. That was a suggestion proposed in the first round of getting feedback on this PR ten years ago. We could have a separate function to calculate lags like scipy has if you prefer. |
|
I pushed a commit to remove the returns_lagvector argument and instead have a companion function like scipy has. |
That's not quite the logic at play here. It's not like a dependency (e.g. scipy depends on numpy, if a new optional argument is added in a backwards-compatible fashion, that's fine because all of the existing usages of |
PR summary
This PR adds
maxlagandlagskeyword parameters tonp.correlateandnp.convolve, plus areturns_lagvectorflag that returns the corresponding lag indices alongside the result. It also adds a new public C API functionPyArray_CorrelateLags.What was troubling me is that numpy.correlate does not have a maxlag feature. This means that even if I only want to see correlations between two time series with lags between -100 and +100 ms, for example, it will still calculate the correlation for every lag between -20000 and +20000 ms (which is the length of the time series).
This PR revives #5978, which I opened back in 2015 and never finished pushing through. I had introduced this question as issues on numpy and scipy: #5954, scipy/scipy#4940 and on the scipy-dev list. It also got attention on stackoverflow at the time. The new issue that this resolves is #17286.
Apologies for the very long downtime. I'm excited to be able to come back to this and finally get this much-needed feature included.
Also, I plan on squashing the commits in this branch and naming the single commit to be the same as the name of this PR. I am expecting the review process to require more commits so pushed my branch as is. I hope this is ok.
Proposed API
Parameter Design
Both keyword-only, mutually exclusive. Default mode auto-resolves to 'lags' when either is supplied.
Major file changes
Benchmark
For a user interested in a small number of lags around 0, there is a huge speedup:
maxlag=5(From
benchmarks/benchmarks/bench_core.py::CorrConvLags, included in this PR.)First time committer introduction
Hi! I have used NumPy for more than a decade now, both directly, and as a key supporter of the whole scientific python ecosystem (e.g. pandas, pytorch, etc). When I started working on this in 2015, I was only using NumPy for the first time, at the time to build simulations of biological neural networks that were more performant and open source than Matlab, which I had previously been using. This was a feature that would have helped me quite a bit at the time, as I was cross-correlating very long time series, but only needed a relatively small window of time lags around 0. I have wanted to come back to it because of the continued interest of others even though I myself no longer need this specific functionality.
AI Disclosure
Claude Code is what finally allowed me to finally bring this PR to completion.
I used Claude as a pair-programming assistant for: