strax package

Subpackages

Submodules

strax.chunk module

exception strax.chunk.CannotSplit[source]

Bases: Exception

class strax.chunk.Chunk(*, data_type, data_kind, dtype, run_id, start, end, data, subruns=None, target_size_mb=200)[source]

Bases: object

Single chunk of strax data of one data type.

classmethod concatenate(chunks)[source]

Create chunk by concatenating chunks of same data type You can pass None’s, they will be ignored.

data: ndarray
data_kind: str
data_type: str
dtype: dtype
property duration
end: int
property first_subrun
property is_superrun
property last_subrun
classmethod merge(chunks, data_type='<UNKNOWN>')[source]

Create chunk by merging columns of chunks of same data kind.

Parameters:
  • chunks – Chunks to merge. None is allowed and will be ignored.

  • data_type – data_type name of new created chunk. Set to <UNKNOWN> if not provided.

property nbytes
run_id: str
split(t: int | None, allow_early_split=False)[source]

Return (chunk_left, chunk_right) split at time t.

Parameters:
  • t – Time at which to split the data. All data in the left chunk will have their (exclusive) end <= t, all data in the right chunk will have (inclusive) start >=t.

  • allow_early_split – If False, raise CannotSplit if the requirements above cannot be met. If True, split at the closest possible time before t.

start: int
subruns: dict
target_size_mb: int
strax.chunk.continuity_check(chunk_iter)[source]

Check continuity of chunks yielded by chunk_iter as they are yielded.

strax.chunk.split_array(data, t, allow_early_split=False)[source]

Return (data left of t, data right of t, t), or raise CannotSplit if that would split a data element in two.

Parameters:
  • data – strax numpy data

  • t – Time to split data

  • allow_early_split – Instead of raising CannotSplit, split at t_split as close as possible before t where a split can happen. The new split time replaces t in the return value.

strax.chunk.transform_chunk_to_superrun_chunk(superrun_id, chunk)[source]

Function which transforms/creates a new superrun chunk from subrun chunk.

Parameters:
  • superrun_id – id/name of the superrun.

  • chunk – strax.Chunk of a superrun subrun.

Returns:

strax.Chunk

strax.config module

class strax.config.Config(**kwargs)[source]

Bases: Option

An alternative to the takes_config class decorator which uses the descriptor protocol to return the config value when the attribute is accessed from within a plugin.

fetch(plugin)[source]

This function is called when the attribute is being accessed.

Should be overridden by subclasses to customize behavior.

exception strax.config.InvalidConfiguration[source]

Bases: Exception

class strax.config.Option(name: str, type: str | type | tuple | list = '<OMITTED>', default: Any = '<OMITTED>', default_factory: str | Callable = '<OMITTED>', default_by_run='<OMITTED>', child_option: bool = False, parent_option_name: str | None = None, track: bool = True, infer_type=False, help: str = '')[source]

Bases: object

Configuration option taken by a strax plugin.

get_default(run_id, run_defaults: dict | None = None)[source]

Return default value for the option.

taken_by: str
validate(config, run_id=None, run_defaults=None, set_defaults=True)[source]

Checks if the option is in config and sets defaults if needed.

strax.config.combine_configs(old_config, new_config=None, mode='update')[source]
strax.config.takes_config(*options)[source]

Decorator for plugin classes, to specify which options it takes.

Parameters:

options – Option instances of options this plugin takes.

strax.context module

class strax.context.Context(storage=None, config=None, register=None, register_all=None, **kwargs)[source]

Bases: object

Context for strax analysis.

A context holds info on HOW to process data, such as which plugins provide what data types, where to store which results, and configuration options for the plugins.

You start all strax processing through a context.

accumulate(run_id: str, targets: Tuple[str] | List[str], fields=None, function=None, store_first_for_others=True, function_takes_fields=False, **kwargs)[source]

Return a dictionary with the sum of the result of get_array.

Parameters:
  • function

    Apply this function to the array before summing the results. Will be called as function(array), where array is a chunk of the get_array result. Should return either:

    • A scalar or 1d array -> accumulated result saved under ‘result’

    • A record array or dict -> fields accumulated individually

    • None -> nothing accumulated

    If not provided, the identify function is used.

    NB: Additionally and independently, if there are any functions registered under context_config[‘apply_data_function’] these are applied first directly after loading the data.

  • fields – Fields of the function output to accumulate. If not provided, all output fields will be accumulated.

  • store_first_for_others – if True (default), for fields included in the data but not fields, store the first value seen in the data (if any value is seen).

  • function_takes_fields – If True, function will be called as function(data, fields) instead of function(data).

All other options are as for get_iter.

Return dictionary:

Dictionary with the accumulated result; see function and store_first_for_others arguments. Four fields are always added:

start: start time of the first processed chunk end: end time of the last processed chunk n_chunks: number of chunks in run n_rows: number of data entries in run

classmethod add_method(f)[source]

Add f as a new Context method.

available_for_run(run_id: str, include_targets: None | list | tuple | str = None, exclude_targets: None | list | tuple | str = None, pattern_type: str = 'fnmatch') DataFrame

For a given single run, check all the targets if they are stored. Excludes the target if never stored anyway.

Parameters:
  • run_id – requested run

  • include_targets – targets to include e.g. raw_records, raw_records* or *_nv. If multiple targets (e.g. a list) is provided, the target should match any of the arguments!

  • exclude_targets – targets to exclude e.g. raw_records, raw_records* or *_nv. If multiple targets (e.g. a list) is provided, the target should match none of the arguments!

  • pattern_type – either ‘fnmatch’ (Unix filename pattern matching) or ‘re’ (Regular expression operations).

Returns:

Table of available data per target

compare_metadata(data1, data2, return_results=False)[source]

Compare the metadata between two strax data.

Parameters:

data2 (data1,) – either a list (tuple) of runid + target pair, or path to metadata to

compare, or a dictionary of the metadata :param return_results: bool, if True, returns a dictionary with metadata and lineages that

are found for the inputs does not do the comparison

example usage:

context.compare_metadata( (“053877”, “peak_basics”), “./my_path_to/JSONfile.json”) first_metadata = context.get_metadata(run_id, “events”) context.compare_metadata(

(“053877”, “peak_basics”), first_metadata)

context.compare_metadata(

(“053877”, “records”), (“053899”, “records”) )

results_dict = context.compare_metadata(
(“053877”, “peak_basics”), (“053877”, “events_info”),

return_results=True)

config: dict
context_config: dict
copy_to_frontend(run_id: str, target: str, target_frontend_id: int | None = None, target_compressor: str | None = None, rechunk: bool = False, rechunk_to_mb: int = 200)[source]

Copy data from one frontend to another.

Parameters:
  • run_id – run_id

  • target – target datakind

  • target_frontend_id – index of the frontend that the data should go to in context.storage. If no index is specified, try all.

  • target_compressor – if specified, recompress with this compressor.

  • rechunk – allow re-chunking for saving

  • rechunk_to_mb – rechunk to specified target size. Only works if rechunk is True.

data_info(data_name: str) DataFrame[source]

Return pandas DataFrame describing fields in data_name.

define_run(name: str, data: ndarray | DataFrame | dict | list | tuple, from_run: str | None = None)

Function for defining new superruns from a list of run_ids.

Note:

The function also allows to create a superrun from data (numpy.arrays/pandas.DataFframrs). However, this is currently not supported from the data loading side.

Parameters:
  • name – Name/run_id of the superrun. Suoerrun names must start with an underscore.

  • data – Data from which the superrun should be created. Can be either one of the following: a tuple/list of run_ids or a numpy.array/pandas.DataFrame containing some data.

  • from_run – List of run_ids which were used to create the numpy.array/pandas.DataFrame passed in data.

deregister_plugins_with_missing_dependencies()[source]

Deregister plugins in case a data_type the plugin depends on is not provided by any other plugin.

estimate_run_start_and_end(run_id, targets=None)[source]

Return run start and end time in ns since epoch.

This fetches from run metadata, and if this fails, it estimates it using data metadata from the targets or the underlying data-types (if it is stored).

get_array(run_id: str | tuple | list, targets, save=(), max_workers=None, **kwargs) ndarray[source]

Compute target for run_id and return as numpy array.

Parameters:
  • run_id – run id to get

  • targets – list/tuple of strings of data type names to get

  • ignore_errors – Return the data for the runs that successfully loaded, even if some runs failed executing.

  • save – extra data types you would like to save to cache, if they occur in intermediate computations. Many plugins save automatically anyway.

  • max_workers – Number of worker threads/processes to spawn. In practice more CPUs may be used due to strax’s multithreading.

  • allow_multiple – Allow multiple targets to be computed simultaneously without merging the results of the target. This can be used when mass producing plugins that are not of the same datakind. Don’t try to use this in get_array or get_df because the data is not returned.

  • add_run_id_field – Boolean whether to add a run_id field in case of multi-runs.

  • run_id_as_bytes – Boolean if true uses byte string instead of an unicode string added to a multi-run array. This can save a lot of memory when loading many runs.

  • selection – Query string, sequence of strings, or simple function to apply. The function must take a single argument which represents the structure numpy array of the loaded data.

  • selection_str – Same as selection (deprecated)

  • keep_columns – Array field/dataframe column names to keep. Useful to reduce amount of data in memory. (You can only specify either keep or drop column.)

  • drop_columns – Array field/dataframe column names to drop. Useful to reduce amount of data in memory. (You can only specify either keep or drop column.)

  • time_range – (start, stop) range to load, in ns since the epoch

  • seconds_range – (start, stop) range of seconds since the start of the run to load.

  • time_within – row of strax data (e.g. event) to use as time range

  • time_selection – Kind of time selection to apply: - fully_contained: (default) select things fully contained in the range - touching: select things that (partially) overlap with the range - skip: Do not select a time range, even if other arguments say so

  • _chunk_number – For internal use: return data from one chunk.

  • progress_bar – Display a progress bar if metedata exists.

  • multi_run_progress_bar – Display a progress bar for loading multiple runs

get_components(run_id: str, targets=(), save=(), time_range=None, chunk_number=None) ProcessorComponents[source]

Return components for setting up a processor.

Parameters:
  • run_id – run id to get

  • targets – list/tuple of strings of data type names to get

  • ignore_errors – Return the data for the runs that successfully loaded, even if some runs failed executing.

  • save – extra data types you would like to save to cache, if they occur in intermediate computations. Many plugins save automatically anyway.

  • max_workers – Number of worker threads/processes to spawn. In practice more CPUs may be used due to strax’s multithreading.

  • allow_multiple – Allow multiple targets to be computed simultaneously without merging the results of the target. This can be used when mass producing plugins that are not of the same datakind. Don’t try to use this in get_array or get_df because the data is not returned.

  • add_run_id_field – Boolean whether to add a run_id field in case of multi-runs.

  • run_id_as_bytes – Boolean if true uses byte string instead of an unicode string added to a multi-run array. This can save a lot of memory when loading many runs.

  • selection – Query string, sequence of strings, or simple function to apply. The function must take a single argument which represents the structure numpy array of the loaded data.

  • selection_str – Same as selection (deprecated)

  • keep_columns – Array field/dataframe column names to keep. Useful to reduce amount of data in memory. (You can only specify either keep or drop column.)

  • drop_columns – Array field/dataframe column names to drop. Useful to reduce amount of data in memory. (You can only specify either keep or drop column.)

  • time_range – (start, stop) range to load, in ns since the epoch

  • seconds_range – (start, stop) range of seconds since the start of the run to load.

  • time_within – row of strax data (e.g. event) to use as time range

  • time_selection – Kind of time selection to apply: - fully_contained: (default) select things fully contained in the range - touching: select things that (partially) overlap with the range - skip: Do not select a time range, even if other arguments say so

  • _chunk_number – For internal use: return data from one chunk.

  • progress_bar – Display a progress bar if metedata exists.

  • multi_run_progress_bar – Display a progress bar for loading multiple runs

get_df(run_id: str | tuple | list, targets, save=(), max_workers=None, **kwargs) DataFrame[source]

Compute target for run_id and return as pandas DataFrame.

Parameters:
  • run_id – run id to get

  • targets – list/tuple of strings of data type names to get

  • ignore_errors – Return the data for the runs that successfully loaded, even if some runs failed executing.

  • save – extra data types you would like to save to cache, if they occur in intermediate computations. Many plugins save automatically anyway.

  • max_workers – Number of worker threads/processes to spawn. In practice more CPUs may be used due to strax’s multithreading.

  • allow_multiple – Allow multiple targets to be computed simultaneously without merging the results of the target. This can be used when mass producing plugins that are not of the same datakind. Don’t try to use this in get_array or get_df because the data is not returned.

  • add_run_id_field – Boolean whether to add a run_id field in case of multi-runs.

  • run_id_as_bytes – Boolean if true uses byte string instead of an unicode string added to a multi-run array. This can save a lot of memory when loading many runs.

  • selection – Query string, sequence of strings, or simple function to apply. The function must take a single argument which represents the structure numpy array of the loaded data.

  • selection_str – Same as selection (deprecated)

  • keep_columns – Array field/dataframe column names to keep. Useful to reduce amount of data in memory. (You can only specify either keep or drop column.)

  • drop_columns – Array field/dataframe column names to drop. Useful to reduce amount of data in memory. (You can only specify either keep or drop column.)

  • time_range – (start, stop) range to load, in ns since the epoch

  • seconds_range – (start, stop) range of seconds since the start of the run to load.

  • time_within – row of strax data (e.g. event) to use as time range

  • time_selection – Kind of time selection to apply: - fully_contained: (default) select things fully contained in the range - touching: select things that (partially) overlap with the range - skip: Do not select a time range, even if other arguments say so

  • _chunk_number – For internal use: return data from one chunk.

  • progress_bar – Display a progress bar if metedata exists.

  • multi_run_progress_bar – Display a progress bar for loading multiple runs

get_iter(run_id: str, targets: Tuple[str] | List[str], save=(), max_workers=None, time_range=None, seconds_range=None, time_within=None, time_selection='fully_contained', selection=None, selection_str=None, keep_columns=None, drop_columns=None, allow_multiple=False, progress_bar=True, _chunk_number=None, **kwargs) Iterator[Chunk][source]

Compute target for run_id and iterate over results.

Do NOT interrupt the iterator (i.e. break): it will keep running stuff in background threads…

Parameters:
  • run_id – run id to get

  • targets – list/tuple of strings of data type names to get

  • ignore_errors – Return the data for the runs that successfully loaded, even if some runs failed executing.

  • save – extra data types you would like to save to cache, if they occur in intermediate computations. Many plugins save automatically anyway.

  • max_workers – Number of worker threads/processes to spawn. In practice more CPUs may be used due to strax’s multithreading.

  • allow_multiple – Allow multiple targets to be computed simultaneously without merging the results of the target. This can be used when mass producing plugins that are not of the same datakind. Don’t try to use this in get_array or get_df because the data is not returned.

  • add_run_id_field – Boolean whether to add a run_id field in case of multi-runs.

  • run_id_as_bytes – Boolean if true uses byte string instead of an unicode string added to a multi-run array. This can save a lot of memory when loading many runs.

  • selection – Query string, sequence of strings, or simple function to apply. The function must take a single argument which represents the structure numpy array of the loaded data.

  • selection_str – Same as selection (deprecated)

  • keep_columns – Array field/dataframe column names to keep. Useful to reduce amount of data in memory. (You can only specify either keep or drop column.)

  • drop_columns – Array field/dataframe column names to drop. Useful to reduce amount of data in memory. (You can only specify either keep or drop column.)

  • time_range – (start, stop) range to load, in ns since the epoch

  • seconds_range – (start, stop) range of seconds since the start of the run to load.

  • time_within – row of strax data (e.g. event) to use as time range

  • time_selection – Kind of time selection to apply: - fully_contained: (default) select things fully contained in the range - touching: select things that (partially) overlap with the range - skip: Do not select a time range, even if other arguments say so

  • _chunk_number – For internal use: return data from one chunk.

  • progress_bar – Display a progress bar if metedata exists.

  • multi_run_progress_bar – Display a progress bar for loading multiple runs

get_meta(run_id, target) dict[source]

Return metadata for target for run_id, or raise DataNotAvailable if data is not yet available.

Parameters:
  • run_id – run id to get

  • target – data type to get

get_metadata(run_id, target) dict

Return metadata for target for run_id, or raise DataNotAvailable if data is not yet available.

Parameters:
  • run_id – run id to get

  • target – data type to get

get_save_when(target: str) SaveWhen | int[source]

For a given plugin, get the save when attribute either being a dict or a number.

get_single_plugin(run_id, data_name)[source]

Return a single fully initialized plugin that produces data_name for run_id.

For use in custom processing.

get_source(run_id: str, target: str, check_forbidden: bool = True) set | None[source]

For a given run_id and target get the stored bases where we can start processing from, if no base is available, return None.

Parameters:
  • run_id – run_id

  • target – target

  • check_forbidden – Check that we are not requesting to make a plugin that is forbidden by the context to be created.

Returns:

set of plugin names that are needed to start processing from and are needed in order to build this target.

get_source_sf(run_id, target, should_exist=False)[source]

Get the source storage frontends for a given run_id and target.

Parameters:
  • target (run_id,) – run_id, target

  • should_exist – Raise a ValueError if we cannot find one (e.g. we already checked the data is stored)

Returns:

list of strax.StorageFrontend (when should_exist is False)

get_zarr(run_ids, targets, storage='./strax_temp_data', progress_bar=False, overwrite=True, **kwargs)[source]

Get persistent arrays using zarr. This is useful when loading large amounts of data that cannot fit in memory zarr is very compatible with dask. Targets are loaded into separate arrays and runs are merged. the data is added to any existing data in the storage location.

Parameters:
  • run_ids – (Iterable) Run ids you wish to load.

  • targets – (Iterable) targets to load.

  • storage – (str, optional) fsspec path to store array. Defaults to ‘./strax_temp_data’.

  • overwrite – (boolean, optional) whether to overwrite existing arrays for targets at given path.

Return zarr.Group:

zarr group containing the persistant arrays available at the storage location after loading the requested data the runs loaded into a given array can be seen in the array .attrs[‘RUNS’] field

is_stored(run_id, target, detailed=False, **kwargs)[source]

Return whether data type target has been saved for run_id through any of the registered storage frontends.

Note that even if False is returned, the data type may still be made with a trivial computation.

key_for(run_id, target)[source]

Get the DataKey for a given run and a given target plugin. The DataKey is inferred from the plugin lineage. The lineage can come either from the _fixed_plugin_cache or computed on the fly.

Parameters:
  • run_id – run id to get

  • target – data type to get

Returns:

strax.DataKey of the target

keys_for_runs(target: str, run_ids: ndarray | list | tuple | str) List[DataKey]

Get the data-keys for a multitude of runs. If use_per_run_defaults is False which it preferably is (#246), getting many keys should be fast as we only only compute the lineage once.

Parameters:
  • run_ids – Runs to get datakeys for

  • target – datatype requested

Returns:

list of datakeys of the target for the given runs.

lineage(run_id, data_type)[source]

Return lineage dictionary for data_type and run_id, based on the options in this context.

list_available(target, runs=None, **kwargs) list

Return sorted list of run_id’s for which target is available.

Parameters:
  • target – Data type to check

  • runs – Runs to check. If None, check all runs.

make(run_id: str | tuple | list, targets, save=(), max_workers=None, _skip_if_built=True, **kwargs) None[source]

Compute target for run_id. Returns nothing (None).

Parameters:
  • run_id – run id to get

  • targets – list/tuple of strings of data type names to get

  • ignore_errors – Return the data for the runs that successfully loaded, even if some runs failed executing.

  • save – extra data types you would like to save to cache, if they occur in intermediate computations. Many plugins save automatically anyway.

  • max_workers – Number of worker threads/processes to spawn. In practice more CPUs may be used due to strax’s multithreading.

  • allow_multiple – Allow multiple targets to be computed simultaneously without merging the results of the target. This can be used when mass producing plugins that are not of the same datakind. Don’t try to use this in get_array or get_df because the data is not returned.

  • add_run_id_field – Boolean whether to add a run_id field in case of multi-runs.

  • run_id_as_bytes – Boolean if true uses byte string instead of an unicode string added to a multi-run array. This can save a lot of memory when loading many runs.

  • selection – Query string, sequence of strings, or simple function to apply. The function must take a single argument which represents the structure numpy array of the loaded data.

  • selection_str – Same as selection (deprecated)

  • keep_columns – Array field/dataframe column names to keep. Useful to reduce amount of data in memory. (You can only specify either keep or drop column.)

  • drop_columns – Array field/dataframe column names to drop. Useful to reduce amount of data in memory. (You can only specify either keep or drop column.)

  • time_range – (start, stop) range to load, in ns since the epoch

  • seconds_range – (start, stop) range of seconds since the start of the run to load.

  • time_within – row of strax data (e.g. event) to use as time range

  • time_selection – Kind of time selection to apply: - fully_contained: (default) select things fully contained in the range - touching: select things that (partially) overlap with the range - skip: Do not select a time range, even if other arguments say so

  • _chunk_number – For internal use: return data from one chunk.

  • progress_bar – Display a progress bar if metedata exists.

  • multi_run_progress_bar – Display a progress bar for loading multiple runs

new_context(storage=(), config=None, register=None, register_all=None, replace=False, **kwargs)[source]

Return a new context with new setting adding to those in this context.

Parameters:

replace – If True, replaces settings rather than adding them. See Context.__init__ for documentation on other parameters.

provided_dtypes(runid='0')[source]

Summarize dtype information provided by this context.

Returns:

dictionary of provided dtypes with their corresponding lineage hash, save_when, version

purge_unused_configs()[source]

Purge unused configs from the context.

register(plugin_class)[source]

Register plugin_class as provider for data types in provides.

Parameters:

plugin_class – class inheriting from strax.Plugin. You can also pass a sequence of plugins to register, but then you must omit the provides argument. If a plugin class omits the .provides attribute, we will construct one from its class name (CamelCase -> snake_case) Returns plugin_class (so this can be used as a decorator)

register_all(module)[source]

Register all plugins defined in module.

Can pass a list/tuple of modules to register all in each.

register_cut_list(cut_list)[source]

Register cut lists to strax context.

Parameters:

cut_list – cut lists to be registered. can be cutlist object or list/tuple of cutlist objects

run_defaults(run_id)[source]

Get configuration defaults from the run metadata (if these exist)

This will only call the rundb once for each run while the context is in existence; further calls to this will return a cached value.

run_metadata(run_id, projection=None) dict[source]

Return run-level metadata for run_id, or raise DataNotAvailable if this is not available.

Parameters:
  • run_id – run id to get

  • projection – Selection of fields to get, following MongoDB syntax. May not be supported by frontend.

runs: DataFrame | None = None
scan_runs(check_available=(), if_check_available='raise', store_fields=()) DataFrame

Update and return self.runs with runs currently available in all storage frontends.

Parameters:
  • check_available – Check whether these data types are available Availability of xxx is stored as a boolean in the xxx_available column.

  • if_check_available – ‘raise’ (default) or ‘skip’, whether to do the check

  • store_fields – Additional fields from run doc to include as rows in the dataframe. The context options scan_availability and store_run_fields list data types and run fields, respectively, that will always be scanned.

search_field(pattern: str, include_code_usage: bool = True, return_matches: bool = False)[source]

Find and print which plugin(s) provides a field that matches pattern (fnmatch).

Parameters:
  • pattern – pattern to match, e.g. ‘time’ or ‘tim*’

  • include_code_usage – Also include the code occurrences of the fields that match the pattern.

  • return_matches – If set, return a dictionary with the matching fields and the occurrences in code.

Returns:

when return_matches is set, return a dictionary with the matching fields and the occurrences in code. Otherwise, we are not returning anything and just print the results

search_field_usage(search_string: str, plugin: Plugin | List[Plugin] | None = None) List[str][source]

Find and return which plugin(s) use a given field.

Parameters:
  • search_string – a field that matches pattern exact

  • plugin – plugin where to look for a field

Returns:

list of code occurrences in the form of PLUGIN.FUNCTION

select_runs(run_mode=None, run_id=None, include_tags=None, exclude_tags=None, available=(), pattern_type='fnmatch', ignore_underscore=True, force_reload=False)

Return pandas.DataFrame with basic info from runs that match selection criteria.

Parameters:
  • run_mode – Pattern to match run modes (reader.ini.name)

  • run_id – Pattern to match a run_id or run_ids

  • available – str or tuple of strs of data types for which data must be available according to the runs DB.

  • include_tags – String or list of strings of patterns for required tags

  • exclude_tags – String / list of strings of patterns for forbidden tags. Exclusion criteria have higher priority than inclusion criteria.

  • pattern_type – Type of pattern matching to use. Defaults to ‘fnmatch’, which means you can use unix shell-style wildcards (?, *). The alternative is ‘re’, which means you can use full python regular expressions.

  • ignore_underscore – Ignore the underscore at the start of tags (indicating some degree of officialness or automation).

  • force_reload – Force reloading of runs from storage. Otherwise, runs are cached after the first time they are loaded in self.runs.

Examples:
  • run_selection(include_tags=’blinded’)

    select all datasets with a blinded or _blinded tag.

  • run_selection(include_tags=’*blinded’)

    … with blinded or _blinded, unblinded, blablinded, etc.

  • run_selection(include_tags=[‘blinded’, ‘unblinded’])

    … with blinded OR unblinded, but not blablinded.

  • `run_selection(include_tags=’blinded’,

    exclude_tags=[‘bad’, ‘messy’])`

    … select blinded dsatasets that aren’t bad or messy

set_config(config=None, mode='update')[source]

Set new configuration options.

Parameters:
  • config – dict of new options

  • mode – can be either - update: Add to or override current options in context - setdefault: Add to current options, but do not override - replace: Erase config, then set only these options

set_context_config(context_config=None, mode='update')[source]

Set new context configuration options.

Parameters:
  • context_config – dict of new context configuration options

  • mode – can be either - update: Add to or override current options in context - setdefault: Add to current options, but do not override - replace: Erase config, then set only these options

show_config(data_type=None, pattern='*', run_id='99999999999999999999')[source]

Return configuration options that affect data_type.

Parameters:
  • data_type – Data type name

  • pattern – Show only options that match (fnmatch) pattern

  • run_id – Run id to use for run-dependent config options. If omitted, will show defaults active for new runs.

size_mb(run_id, target)[source]

Return megabytes of memory required to hold data.

storage: List[StorageFrontend]
stored_dependencies(run_id: str, target: str | list | tuple, check_forbidden: bool = True, _targets_stored: dict | None = None) dict | None[source]

For a given run_id and target(s) get a dictionary of all the datatypes that are required to build the requested target.

Parameters:
  • run_id – run_id

  • target – target or a list of targets

  • check_forbidden – Check that we are not requesting to make a plugin that is forbidden by the context to be created.

Returns:

dictionary of data types (keys) required for building the requested target(s) and if they are stored (values)

Raises:

strax.DataNotAvailable – if there is at least one data type that is not stored and has no dependency or if it cannot be created

takes_config = immutabledict({'storage_converter': <strax.config.Option object>, 'fuzzy_for': <strax.config.Option object>, 'fuzzy_for_options': <strax.config.Option object>, 'allow_incomplete': <strax.config.Option object>, 'allow_rechunk': <strax.config.Option object>, 'allow_multiprocess': <strax.config.Option object>, 'allow_shm': <strax.config.Option object>, 'allow_lazy': <strax.config.Option object>, 'forbid_creation_of': <strax.config.Option object>, 'store_run_fields': <strax.config.Option object>, 'check_available': <strax.config.Option object>, 'max_messages': <strax.config.Option object>, 'timeout': <strax.config.Option object>, 'saver_timeout': <strax.config.Option object>, 'use_per_run_defaults': <strax.config.Option object>, 'free_options': <strax.config.Option object>, 'apply_data_function': <strax.config.Option object>, 'write_superruns': <strax.config.Option object>})
to_absolute_time_range(run_id, targets=None, time_range=None, seconds_range=None, time_within=None, full_range=None)[source]

Return (start, stop) time in ns since unix epoch corresponding to time range.

Parameters:
  • run_id – run id to get

  • time_range – (start, stop) time in ns since unix epoch. Will be returned without modification

  • targets – data types. Used only if run metadata is unavailable, so run start time has to be estimated from data.

  • seconds_range – (start, stop) seconds since start of run

  • time_within – row of strax data (e.g. eent)

  • full_range – If True returns full time_range of the run.

exception strax.context.OutsideException[source]

Bases: Exception

strax.corrections module

I/O format for corrections using MongoDB.

This module list, reads, writes, among others functionalities to a MongoDB where corrections are stored.

class strax.corrections.CorrectionsInterface(client=None, database_name='corrections', host=None, username=None, password=None)[source]

Bases: object

A class to manage corrections that are stored in a MongoDB, corrections are defined as pandas.DataFrame with a pandas.DatetimeIndex in UTC, a v1 and online version must be specified, online versions are meant for online processing, whereas v1, v2, v3…

are meant for offline processing. A Global configuration can be set, this means a unique set of correction maps.

static after_date_query(date, limit=1)[source]
static before_date_query(date, limit=1)[source]
static check_timezone(date)[source]

Smart logic to check date is given in UTC time zone.

Raises ValueError if not. :param date: date e.g. datetime(2020, 8, 12, 21, 4, 32, 7, tzinfo=pytz.utc) :return: the inserted date

get_context_config(when, global_config='global', global_version='v1')[source]

Global configuration logic.

Parameters:
  • when – date e.g. datetime(2020, 8, 12, 21, 4, 32, 7, tzinfo=pytz.utc)

  • global_config – a map of corrections

  • global_version – global configuration’s version

Returns:

configuration (type: dict)

interpolate(what, when, how='interpolate', **kwargs)[source]

Interpolate values of a given quantity (‘what’) of a given correction.

For information of interpolation methods see: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html :param what: what do you want to interpolate, what correction(DataFrame) :param when: date, e.g. datetime(2020, 8, 12, 21, 4, 32, 7, tzinfo=pytz.utc) :param how: Interpolation method, can be either ‘interpolate’ or ‘fill’ :param kwargs: are forward to the interpolation :return: DataFrame of the correction with the interpolated time (‘when’)

list_corrections()[source]

Smart logic to list all corrections available in the corrections database.

read(correction)[source]

Smart logic to read corrections.

Parameters:

correction – pandas.DataFrame object name in the DB (str type).

Returns:

DataFrame as read from the corrections database with time index or None if an empty DataFrame is read from the database

read_at(correction, when, limit=1)[source]

Smart logic to read corrections at given time (index), i.e by datetime index.

Parameters:
  • correction – pandas.DataFrame object name in the DB (str type).

  • when – when, datetime to read the corrections, e.g. datetime(2020, 8, 12, 21, 4, 32, 7, tzinfo=pytz.utc)

  • limit – how many indexes after and before when, i.e. limit=1 will return 1 index before and 1 after

Returns:

DataFrame as read from the corrections database with time index or None if an empty DataFrame is read from the database

static sort_by_index(df)[source]

Smart logic to sort dataframe by index using time column.

Retrun:

df sorted by index(time)

write(correction, df, required_columns=('ONLINE', 'v1'))[source]

Smart logic to write corrections to the corrections database.

Parameters:
  • correction – corrections name (str type)

  • df – pandas.DataFrame object a DatetimeIndex

  • required_columns – DataFrame must include two columns online, an ONLINE version and OFFLINE version (e.g. v1)

strax.dtypes module

Fundamental dtypes for use in strax.

Note that if you change the dtype titles (comments), numba will crash if there is an existing numba cache. Clear __pycache__ and restart. TODO: file numba issue.

strax.dtypes.copy_to_buffer(source: ndarray, buffer: ndarray, func_name: str, field_names: Tuple[str] | None = None)[source]

Copy the data from the source to the destination e.g. raw_records to records. To this end, we dynamically create the njitted function with the name ‘func_name’ (should start with “_”).

Parameters:
  • source – array of input

  • buffer – array of buffer to fill with values from input

  • func_name – how to store the dynamically created function. Should start with an _underscore

  • field_names – dtype names to copy (if none, use all in the source)

strax.dtypes.hitlet_dtype()[source]

Hitlet dtype same as peaklet or peak dtype but for hit-kind of objects.

strax.dtypes.hitlet_with_data_dtype(n_samples=2)[source]

Hitlet dtype with data field. Required within the plugins to compute hitlet properties.

Parameters:

n_samples – Buffer length of the data field. Make sure it can hold the longest hitlet.

strax.dtypes.peak_dtype(n_channels=100, n_sum_wv_samples=200, n_widths=11, digitize_top=True, hits_timing=True)[source]

Data type for peaks - ranges across all channels in a detector Remember to set channel to -1 (todo: make enum)

strax.dtypes.raw_record_dtype(samples_per_record=110)[source]

Data type for a waveform raw_record.

Length can be shorter than the number of samples in data, this indicates a record with zero- padding at the end.

strax.dtypes.record_dtype(samples_per_record=110)[source]

Data type for a waveform record.

Length can be shorter than the number of samples in data, this indicates a record with zero- padding at the end.

strax.io module

Read/write numpy arrays to/from compressed files or file-like objects.

strax.io.dry_load_files(dirname, chunk_number=None)[source]
strax.io.load_file(f, compressor, dtype)[source]

Read and return data from file.

Parameters:
  • f – file name or handle to read from

  • compressor – compressor to use for decompressing. If not passed, will try to load it from json metadata file.

  • dtype – numpy dtype of data to load

strax.io.save_file(f, data, compressor='zstd')[source]

Save data to file and return number of bytes written.

Parameters:
  • f – file name or handle to save to

  • data – data (numpy array) to save

  • compressor – compressor to use

strax.mailbox module

exception strax.mailbox.InvalidMessageNumber[source]

Bases: MailboxException

exception strax.mailbox.MailBoxAlreadyClosed[source]

Bases: MailboxException

class strax.mailbox.Mailbox(name='mailbox', timeout=None, lazy=False, max_messages=None)[source]

Bases: object

Publish/subscribe mailbox for builing complex pipelines out of simple iterators, using multithreading.

A sender can be any iterable. To read from the mailbox, either:
  1. Use .subscribe() to get an iterator. You can only use one of these per thread.

  2. Use .add_subscriber(f) to subscribe the function f. f should take an iterator as its first argument (and actually iterate over it, of course).

Each sender and receiver is wrapped in a thread, so they can be paused:
  • senders, if the mailbox is full;

  • readers, if they call next() but the next message is not yet available.

Any futures sent in are awaited before they are passed to receivers.

Exceptions in a sender cause MailboxKilled to be raised in each reader. If the reader doesn’t catch this, and it writes to another mailbox, this therefore kills that mailbox (raises MailboxKilled for each reader) as well. Thus MailboxKilled exceptions travel downstream in pipelines.

Sender threads are not killed by exceptions raise in readers. To kill sender threads too, use .kill(upstream=True). Even this does not propagate further upstream than the immediate sender threads.

DEFAULT_MAX_MESSAGES = 4
DEFAULT_TIMEOUT = 300
add_reader(subscriber, name=None, can_drive=True, **kwargs)[source]

Subscribe a function to the mailbox.

Parameters:
  • subscriber – Function which accepts a generator over messages as the first argument. Any kwargs will also be passed to the function.

  • name – Name of the thread in which the function will run. Defaults to read_<number>:<mailbox_name>

  • can_drive – Whether this reader can cause new messages to be generated when in lazy mode.

add_sender(source, name=None)[source]

Configure mailbox to read from an iterable source.

Parameters:
  • source – Iterable to read from

  • name – Name of the thread in which the function will run. Defaults to source:<mailbox_name>

cleanup()[source]
close()[source]
kill(upstream=True, reason=None)[source]
kill_from_exception(e, reraise=True)[source]

Kill the mailbox following a caught exception e.

send(msg, msg_number: int | None = None)[source]

Send a message.

If the message is a future, receivers will be passed its result. (possibly waiting for completion if needed)

If the mailbox is currently full, sleep until there is room for your message (or timeout occurs)

start()[source]
subscribe(can_drive=True)[source]

Return generator over messages in the mailbox.

exception strax.mailbox.MailboxException[source]

Bases: Exception

exception strax.mailbox.MailboxFullTimeout[source]

Bases: MailboxException

exception strax.mailbox.MailboxKilled[source]

Bases: MailboxException

exception strax.mailbox.MailboxReadTimeout[source]

Bases: MailboxException

strax.mailbox.divide_outputs(source, mailboxes: Dict[str, Mailbox], lazy=False, flow_freely=(), outputs=None)[source]

This code is a ‘mail sorter’ which gets dicts of arrays from source and sends the right array to the right mailbox.

strax.plugin module

Plugin system for strax.

A ‘plugin’ is something that outputs an array and gets arrays from one or more other plugins.

exception strax.plugin.InputTimeoutExceeded

Bases: Exception

class strax.plugin.Plugin

Bases: object

Plugin containing strax computation.

You should NOT instantiate plugins directly. Do NOT add unpickleable things (e.g. loggers) as attributes.

can_rechunk(data_type)
child_plugin = False
chunk(*, start, end, data, data_type=None, run_id=None)
chunk_target_size_mb = 200
cleanup(wait_for)
compressor = 'blosc'
compute(**kwargs)
compute_takes_chunk_i = False
compute_takes_start_end = False
config: Dict
data_kind: str | immutabledict | dict
data_kind_for(data_type)
dependencies_by_kind()

Return dependencies grouped by data kind i.e. {kind1: [dep0, dep1], kind2: [dep, dep]}

Parameters:

require_time – If True, one dependency of each kind must provide time information. It will be put first in the list. If require_time is omitted, we will require time only if there is more than one data kind in the dependencies.

depends_on: tuple
deps: Dict
do_compute(chunk_i=None, **kwargs)

Wrapper for the user-defined compute method.

This is the ‘job’ that gets executed in different processes/threads during multiprocessing

dtype: tuple | dtype | immutabledict | dict
dtype_for(data_type)

Provide the dtype of one of the provide arguments of the plugin.

NB: does not simply provide the dtype of any datatype but must be one of the provide arguments known to the plugin.

empty_result()
fix_dtype()
infer_dtype()

Return dtype of computed data; used only if no dtype attribute defined.

input_buffer: Dict[str, Chunk]
input_timeout = 80
is_ready(chunk_i)

Return whether the chunk chunk_i is ready for reading.

Returns True by default; override if you make an online input plugin.

iter(iters, executor=None)

Iterate over dependencies and yield results.

Parameters:
  • iters – dict with iterators over dependencies

  • executor – Executor to punt computation tasks to. If None, will compute inside the plugin’s thread.

property log
max_messages = None
metadata(run_id, data_type)

Metadata to save along with produced data.

property multi_output
parallel: str | bool = False
provides: tuple
rechunk_on_save = True
run_i: int
run_id: str
save_when = 3
setup()

Hook if plugin wants to do something on initialization.

source_finished()

Return whether all chunks the plugin wants to read have been written.

Only called for online input plugins.

takes_config = immutabledict({})
version(run_id=None)

Return version number applicable to the run_id.

Most plugins just have a single version (in .__version__) but some may be at different versions for different runs (e.g. time-dependent corrections).

exception strax.plugin.PluginGaveWrongOutput

Bases: Exception

class strax.plugin.SaveWhen(value)

Bases: IntEnum

Plugin’s preference for having it’s data saved.

ALWAYS = 3
EXPLICIT = 1
NEVER = 0
TARGET = 2

strax.processor module

class strax.processor.ProcessorComponents(plugins: Dict[str, Plugin], loaders: Dict[str, Callable], loader_plugins: Dict[str, Plugin], savers: Dict[str, List[Saver]], targets: Tuple[str])[source]

Bases: NamedTuple

Specification to assemble a processor.

loader_plugins: Dict[str, Plugin]

Alias for field number 2

loaders: Dict[str, Callable]

Alias for field number 1

plugins: Dict[str, Plugin]

Alias for field number 0

savers: Dict[str, List[Saver]]

Alias for field number 3

targets: Tuple[str]

Alias for field number 4

class strax.processor.ThreadedMailboxProcessor(components: ProcessorComponents, allow_rechunk=True, allow_shm=False, allow_multiprocess=False, allow_lazy=True, max_workers=None, max_messages=4, timeout=60, is_superrun=False)[source]

Bases: object

iter()[source]
mailboxes: Dict[str, Mailbox]

strax.run_selection module

Context methods dealing with run scanning and selection.

strax.run_selection.flatten_run_metadata(md)[source]

strax.testutils module

strax.utils module

class strax.utils.NumpyJSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: JSONEncoder

Special json encoder for numpy types Edited from mpl3d: mpld3/_display.py

default(obj)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
strax.utils.apply_selection(x, selection=None, selection_str=None, keep_columns=None, drop_columns=None, time_range=None, time_selection='fully_contained')[source]

Return x after applying selections.

Parameters:
  • x – Numpy structured array

  • selection – Query string, sequence of strings, or simple function to apply.

  • selection_str – Same as selection (deprecated)

  • time_range – (start, stop) range to load, in ns since the epoch

  • keep_columns – Field names of the columns to keep.

  • drop_columns – Field names of the columns to drop.

  • time_selection – Kind of time selectoin to apply: - skip: Do not select a time range, even if other arguments say so - touching: select things that (partially) overlap with the range - fully_contained: (default) select things fully contained in the range

The right bound is, as always in strax, considered exclusive. Thus, data that ends (exclusively) exactly at the right edge of a fully_contained selection is returned.

strax.utils.camel_to_snake(x)[source]

Convert x from CamelCase to snake_case.

strax.utils.compare_dict(old: dict, new: dict)[source]

Compare two dictionaries and print the differences.

strax.utils.convert_tuple_to_list(init_func_input)[source]

Convert the tuples into list in an arbitrarily nested dictionary.

strax.utils.count_tags(ds)[source]

Return how often each tag occurs in the datasets DataFrame ds.

strax.utils.deterministic_hash(thing, length=10)[source]

Return a base32 lowercase string of length determined from hashing a container hierarchy.

strax.utils.dict_to_rec(x, dtype=None)[source]

Convert dictionary {field_name: array} to record array Optionally, provide dtype.

strax.utils.exporter(export_self=False)[source]

Export utility modified from https://stackoverflow.com/a/41895194 Returns export decorator, __all__ list

strax.utils.flatten_dict(d, separator=':', _parent_key='', keep=())[source]

Flatten nested dictionaries into a single dictionary, indicating levels by separator.

Don’t set _parent_key argument, this is used for recursive calls. Stolen from http://stackoverflow.com/questions/6027558 :param keep: key or list of keys whose values should not be flattened.

strax.utils.formatted_exception()[source]

Return human-readable multiline string with info about the exception that is currently being handled.

If no exception, or StopIteration, is being handled, returns an empty string.

For MailboxKilled exceptions, we return the original exception instead.

strax.utils.group_by_kind(dtypes, plugins=None, context=None) Dict[str, List][source]

Return dtypes grouped by data kind i.e. {kind1: [d, d, …], kind2: [d, d, …], …}

Parameters:
  • plugins – plugins providing the dtypes.

  • context – context to get plugins from if not given.

strax.utils.growing_result(dtype=<class 'numpy.int64'>, chunk_size=10000)[source]

Decorator factory for functions that fill numpy arrays.

Functions must obey following API:

  • accept _result_buffer keyword argument with default None; this will be the buffer array of specified dtype and length chunk_size (it’s an optional argument so this decorator preserves signature)

  • ‘yield N’ from function will cause first elements to be saved

  • function is responsible for tracking offset, calling yield on time, and clearing the buffer afterwards.

  • optionally, accept result_dtype argument with default None; this allows function user to specify return dtype

See test_utils.py for a simple example (I can’t get it to run as a doctest unfortunately)

strax.utils.hashablize(obj)[source]

Convert a container hierarchy into one that can be hashed.

See http://stackoverflow.com/questions/985294

strax.utils.inherit_docstring_from(cls)[source]

Decorator for inheriting doc strings, stolen from https://groups.google.com/forum/#!msg/comp.lang.python/HkB1uhDcvdk/lWzWtPy09yYJ

strax.utils.iter_chunk_meta(md)[source]

Iterate over chunk info from metadata md adding n_from and n_to fields.

strax.utils.merge_arrs(arrs, dtype=None)[source]

Merge structured arrays of equal length. On field name collisions, data from later arrays is kept.

If you pass one array, it is returned without copying. TODO: hmm… inconsistent

Much faster than the similar function in numpy.lib.recfunctions.

strax.utils.merged_dtype(dtypes)[source]
strax.utils.multi_run(exec_function, run_ids, *args, max_workers=None, throw_away_result=False, multi_run_progress_bar=True, ignore_errors=False, log=None, **kwargs)[source]

Execute exec_function(run_id, *args, **kwargs) over multiple runs, then return list of result arrays, each with a run_id column added.

Parameters:
  • exec_function – Function to run

  • run_ids – list/tuple of run_ids

  • max_workers – number of worker threads/processes to spawn. If set to None, defaults to 1.

  • throw_away_result – instead of collecting result, return None.

  • multi_run_progress_bar – show a tqdm progressbar for multiple runs.

  • ignore_errors – Return the data for the runs that successfully loaded, even if some runs failed executing.

  • log – logger to be used. Other (kw)args will be passed to the exec_function.

strax.utils.parse_selection(x, selection)[source]

Parse a selection string into a mask that can be used to filter data.

Parameters:

selection – Query string, sequence of strings, or simple function to apply.

Returns:

Boolean indicating the selected items.

strax.utils.print_record(x, skip_array=True)[source]

Print record(s) d in human-readable format.

Parameters:

skip_array – Omit printing array fields.

strax.utils.profile_threaded(filename)[source]
strax.utils.remove_titles_from_dtype(dtype)[source]

Return np.dtype with titles removed from fields.

strax.utils.to_numpy_dtype(field_spec)[source]
strax.utils.to_str_tuple(x) Tuple[Any, ...][source]
strax.utils.unpack_dtype(dtype)[source]

Return list of tuples needed to construct the dtype.

dtype == np.dtype(unpack_dtype(dtype))

Module contents