Out of core computation

Overview and motivation

Many times analyses involve performing some computation not implemented by a plugin (e.g. plotting) that require loading more data than can fit into memory, these type of tasks are commonly reffered to as out-of-core computations. Out-of-core algorithms usually involve a few repeating steps:

chunk the dataset into managable sizes
load the data chunk by chunk
perform some computation on each chunk
save a summary of the results for each chunk
perform some combination of the per-chunk results into a final result.

While it is of course possible to implement these operations yourself, it can be tedious and repetative and the code becomes very rigid to the specific calculations being performed. A better approach is to use abstractions of commonly performed operations that use out-of-core algorithms under the hood to get the same result as if the operations were performed on the entire dataset. Code written using these abstractions can then run both on in-memory datasets as well as out-of-core datasets alike. More importantly the implmentations of these algorithms can be written once and packaged to then be used by all.

Data chunking

The zarr package provides an abstraction of the data-access api of numpy arrays for chunked and compressed data stored in memory or disk. zarr provides an array abstraction with identical behavior to a numpy array when accessing data but where the underlyign data is actually a collection of compressed (optional) chunks. the strax context provides a convenience method for loading data directly into zarr arrays.

import strax

context = strax.Context(**CONTEXT_KWARGS)

# you can pass the same arguments you pass to context.get_array()
zgrp = context.get_zarr(RUN_IDs, DATA_TYPES, **GET_ARRAY_KWARGS)

# the zarr group contains multiple arrays, one for each data type
z = zgrp.data_type

# individual arrays are also accessible via the __getitem__ interface
z = zgrp['data_type']

# numpy-like data access, abstracting away the underlying
# data reading which may include readin multiple chunks from disk/memory
# and decompression then concatenation to return an in memory numpy array
z[:100]

Data processing

The dask package provides abstractions for most of the numpy and pandas apis. The dask.Array and dask.DataFrame objects implement their respective apis using fully distributed algorithms, only loading a fraction of the total data into memory at any given moment for a given computing partition (thread/process/HPC-job).

import dask.array as da

# easily convert to dask.Array abstraction for processing
darr = da.from_zarr(z)

# its recommended to rechunk to sizes more appropriate for processing
# see dask documentation for details
darr.rechunk(CHUNK_SIZE)

# you can also convert the dask.Array abstraction
# to a dask.DataFrame abstraction if you need the pandas api
ddf = darr.to_dask_dataframe()