You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
...either crashes immediately or is forced to abandon astropy entirely for vaex, polars, or raw pandas. All of these alternatives silently drop astropy-native types: units (Quantity), coordinates (SkyCoord), and time (Time columns).
This is not a new pain point. Two previous issues were opened but never resolved:
Integrate with dask #7748 ("Integrate with dask") — opened Aug 2018, closed by stale bot with zero implementation
Experiment with interfacing with Dask #8227 ("Experiment with interfacing with Dask") — opened Dec 2018, still open after 6+ years, labeled only Experimental with no concrete deliverable
Those issues identified specific blockers (Quantity eagerness, repr densification, SkyCoord as ndarray subclass) but no scoped implementation plan was ever proposed. This issue aims to fix that.
astropy is used by 55,000+ downstream packages including LSST pipelines, JWST reduction tools, and eROSITA analysis software. All of them hit this wall.
Describe the desired outcome
Add an optional lazy=True keyword to Table.read(), QTable.read(), and TimeSeries.read() that returns a LazyTable — a thin Table subclass backed by dask.array columns — without loading any data into RAM until explicitly requested.
Proposed API
fromastropy.tableimportTable# No data loaded — only FITS header / ECSV metadata parsed eagerlyt=Table.read('gaia_dr3_source.fits', lazy=True)
print(t.colnames) # ['source_id', 'ra', 'dec', ...] — instant, no I/Oprint(t['ra'].unit) # deg — from FITS header, no data loadedprint(len(t)) # 1_811_709_771 — from FITS NAXIS2, no data loaded# Build a lazy filtered viewbright=t[t['phot_g_mean_mag'] <10.0]
# Materialize only when neededresult=bright.compute() # returns a normal astropy Table# Stream through large files without ever loading everythingforchunkint.iterchunks(chunk_size=100_000):
process(chunk) # each chunk is a normal Table of ~100k rows
Phase 1 Scope
In scope:
LazyTable and LazyColumn classes with a repr that does not trigger data reads
Lazy FITS BinaryTable reader via dask.array.from_delayed per column
Lazy ECSV reader
Column selection and scalar/numeric boolean row filtering (lazy)
.compute() to materialize to normal Table
.iterchunks(chunk_size=N) for streaming pipelines
Unit metadata preserved on LazyColumn without triggering computation
QTable.read(..., lazy=True) — Quantity constructed only on .compute()
TimeSeries.read(..., lazy=True) with time_column preserved in metadata
Raises ImportError with a helpful message if dask is not installed
Handling the Quantity eagerness blocker (identified in #8227): Rather than modifying Quantity globally, LazyColumn stores unit as metadata and constructs Quantity only during .compute(). No changes to astropy.units needed.
What is the problem this feature will solve?
Modern sky surveys produce catalogs that fundamentally exceed available RAM on a typical workstation:
Any user attempting:
...either crashes immediately or is forced to abandon astropy entirely for
vaex,polars, or rawpandas. All of these alternatives silently drop astropy-native types: units (Quantity), coordinates (SkyCoord), and time (Timecolumns).This is not a new pain point. Two previous issues were opened but never resolved:
Experimentalwith no concrete deliverableThose issues identified specific blockers (Quantity eagerness, repr densification, SkyCoord as ndarray subclass) but no scoped implementation plan was ever proposed. This issue aims to fix that.
astropy is used by 55,000+ downstream packages including LSST pipelines, JWST reduction tools, and eROSITA analysis software. All of them hit this wall.
Describe the desired outcome
Add an optional
lazy=Truekeyword toTable.read(),QTable.read(), andTimeSeries.read()that returns aLazyTable— a thin Table subclass backed bydask.arraycolumns — without loading any data into RAM until explicitly requested.Proposed API
Phase 1 Scope
In scope:
LazyTableandLazyColumnclasses with a repr that does not trigger data readsdask.array.from_delayedper column.compute()to materialize to normalTable.iterchunks(chunk_size=N)for streaming pipelinesLazyColumnwithout triggering computationQTable.read(..., lazy=True)—Quantityconstructed only on.compute()TimeSeries.read(..., lazy=True)withtime_columnpreserved in metadataImportErrorwith a helpful message if dask is not installedOut of scope (explicit non-goals for Phase 1):
SkyCoordcolumn support (blocked by Quantity/ndarray subclass refactor in Experiment with interfacing with Dask #8227)Timecolumn support (same blocker)Key Design Decisions
Handling the
Quantityeagerness blocker (identified in #8227): Rather than modifyingQuantityglobally,LazyColumnstores unit as metadata and constructsQuantityonly during.compute(). No changes toastropy.unitsneeded.Handling
reprdensification (identified in #8227):LazyTable.__repr__explicitly avoids accessing column data:Additional context
No response