Recompressing & moving data

There are two options for recompressing data:
  • via the context context.copy_to_frontend()

  • via a dedicated script rechunker that only works for filesystem backends and works outside the context.

In order to recompress data with another compression algorithm the context.copy_to_frontend() function can be used. The function works on a per run_id-, per datatype- basis. In the example below, peaks data is copied to a second frontend.

import strax
import os
# Naturally, these plugins (Records and Peaks) only serve as examples
# and are best replaced by a fully constructed context
from strax.testutils import Records, Peaks, run_id

# Initialize context (st):
st = strax.Context(register=[Records, Peaks])

# Initialize frontends
storage_frontend_A = strax.DataDirectory('./folder_A')
storage_frontend_B = strax.DataDirectory('./folder_B',
                                      readonly=True)
st.storage = [storage_frontend_A,
              storage_frontend_B]

# In this example, we will only consider records
target = "records"

print(f'Are records stored?\n{st.is_stored(run_id, target)}')

# Make the data (stores to every frontend available)
st.get_array(run_id, 'records')

for sf in st.storage:
    print(f'{target} stored in\n\t{sf}?\n\t{st._is_stored_in_sf(run_id, target, sf)}')

Which prints:

Are records stored?
False
records stored in
    strax.storage.files.DataDirectory, path: ./folder_A?
    True
records stored in
    strax.storage.files.DataDirectory, readonly: True, path: ./folder_B?
    False

Copy

In the example above the storage_frontend_B was readonly, therefore, when creating records, no is data stored there. Below, we will copy the data from storage_frontend_A to storage_frontend_B.

# First set the storage_frontend_B for readonly=False such that we can copy
# data there
storage_frontend_B.readonly = False

# In the st.storage-list, storage_frontend_B is index 1
index_frontend_B = 1
st.copy_to_frontend(run_id, target,
                    target_frontend_id=index_frontend_B)

for sf in [storage_frontend_A,  storage_frontend_B]:
    print(f'{target} stored in\n\t{sf}?\n\t{st._is_stored_in_sf(run_id, target, sf)}')

Which prints the following (so we can see that the copy to folder_B was successful.

records stored in
    strax.storage.files.DataDirectory, path: ./folder_A?
    True
records stored in
    strax.storage.files.DataDirectory, path: ./folder_B?
    True

Copy and recompress

Now, with a third storage frontend, we will recompress the data to reduce the size on disk.

# Recompression with a different compressor
# See strax.io.COMPRESSORS for more compressors
target_compressor = 'bz2'

# Add the extra storage frontend
index_frontend_C = 2
storage_frontend_C = strax.DataDirectory('./folder_C')
st.storage.append(storage_frontend_C)

# Copy and recompress
st.copy_to_frontend(run_id, target,
                    target_frontend_id=index_frontend_C,
                    target_compressor=target_compressor)

for sf in st.storage:
    first_cunk = os.path.join(sf.path,
                             '0-records-sqcyyhsfpv',
                             'records-sqcyyhsfpv-000000')
    print(f'In {sf.path}, the first chunk is {os.path.getsize(first_cunk)} kB')

Which outputs:

In ./folder_A, the first chunk is 275 kB
In ./folder_B, the first chunk is 275 kB
In ./folder_C, the first chunk is 65 kB

From the output we can see that the size of the first chunk of folder_C, the data much smaller than in folder_A/folder_B. This comes from the fact that bz2 compresses the data much more than the default compressor blosc.

How does this work?

Strax knows from the metadata stored with the data with witch compressor the data was written. It is possible to use a different compressor when re-writing the data to disk (as done for strax knows from the metadata stored with the data with witch compressor the data was written. It is possible to use a different compressor when re-writing the data to disk (as done folder_C in the example above).

As such, for further use, it does not matter if the data is coming from either of folders folder_A-folder_C as the metadata will tell strax which compressor to use. Different compressors may have different performance for loading/writing data.

Rechunker script

From strax v1.2.2 onwards, a rechunker script is automatically installed with strax. It can be used to re-write data in the FileSystem backend.

For example:

rechunker --source 009104-raw_records_aqmon-rfzvpzj4mf --compressor zstd

will output:

Will write to /tmp/tmpoj0xpr78 and make sub-folder 009104-raw_records_aqmon-rfzvpzj4mf
Rechunking 009104-raw_records_aqmon-rfzvpzj4mf to /tmp/tmpoj0xpr78/009104-raw_records_aqmon-rfzvpzj4mf
move /tmp/tmpoj0xpr78/009104-raw_records_aqmon-rfzvpzj4mf to 009104-raw_records_aqmon-rfzvpzj4mf
Re-compressed 009104-raw_records_aqmon-rfzvpzj4mf
        backend_key             009104-raw_records_aqmon-rfzvpzj4mf
        load_time               0.4088103771209717
        write_time              0.07699322700500488
        uncompressed_mb         1.178276
        source_compressor       zstd
        dest_compressor         zstd
        source_mb               0.349217
        dest_mb                 0.349218

This script can easily be used to profile different compressors:

for COMPRESSOR in zstd bz2 lz4 blosc zstd; \
    do echo $COMPRESSOR; \
    rechunker \
        --source 009104-raw_records-rfzvpzj4mf \
        --write_stats_to test.csv \
        --compressor $COMPRESSOR; \
    done

We can check the output in python using:

>>> import pandas as pd
>>> df = pd.read_csv('test.csv')
>>> df['read_mbs'] = df['uncompressed_mb']/df['load_time']
>>> df['write_mbs'] = df['uncompressed_mb']/df['write_time']
>>> print(df[['source_compressor', 'read_mbs', 'dest_compressor', 'write_mbs']].to_string())
    source_compressor    read_mbs dest_compressor   write_mbs
  0              zstd  313.922890            zstd  298.429123
  1              zstd  284.530054             bz2    8.932259
  2               bz2   20.289876             lz4  228.932498
  3               lz4  372.491150           blosc  433.494794
  4             blosc  725.154966            zstd  215.765177