Recompressing & moving data =========================== There are two options for recompressing data: - via the context :py:func:`context.copy_to_frontend` - via a dedicated script ``rechunker`` that only works for filesystem backends and works outside the context. In order to recompress data with another compression algorithm the :py:func:`context.copy_to_frontend` function can be used. The function works on a per run_id-, per datatype- basis. In the example below, peaks data is copied to a second frontend. .. code-block:: python import strax import os # Naturally, these plugins (Records and Peaks) only serve as examples # and are best replaced by a fully constructed context from strax.testutils import Records, Peaks, run_id # Initialize context (st): st = strax.Context(register=[Records, Peaks]) # Initialize frontends storage_frontend_A = strax.DataDirectory('./folder_A') storage_frontend_B = strax.DataDirectory('./folder_B', readonly=True) st.storage = [storage_frontend_A, storage_frontend_B] # In this example, we will only consider records target = "records" print(f'Are records stored?\n{st.is_stored(run_id, target)}') # Make the data (stores to every frontend available) st.get_array(run_id, 'records') for sf in st.storage: print(f'{target} stored in\n\t{sf}?\n\t{st._is_stored_in_sf(run_id, target, sf)}') Which prints: .. code-block:: rst Are records stored? False records stored in strax.storage.files.DataDirectory, path: ./folder_A? True records stored in strax.storage.files.DataDirectory, readonly: True, path: ./folder_B? False Copy ____ In the example above the `storage_frontend_B` was readonly, therefore, when creating records, no is data stored there. Below, we will copy the data from `storage_frontend_A` to `storage_frontend_B`. .. code-block:: python # First set the storage_frontend_B for readonly=False such that we can copy # data there storage_frontend_B.readonly = False # In the st.storage-list, storage_frontend_B is index 1 index_frontend_B = 1 st.copy_to_frontend(run_id, target, target_frontend_id=index_frontend_B) for sf in [storage_frontend_A, storage_frontend_B]: print(f'{target} stored in\n\t{sf}?\n\t{st._is_stored_in_sf(run_id, target, sf)}') Which prints the following (so we can see that the copy to `folder_B` was successful. .. code-block:: rst records stored in strax.storage.files.DataDirectory, path: ./folder_A? True records stored in strax.storage.files.DataDirectory, path: ./folder_B? True Copy and recompress ___________________ Now, with a third storage frontend, we will recompress the data to reduce the size on disk. .. code-block:: python # Recompression with a different compressor # See strax.io.COMPRESSORS for more compressors target_compressor = 'bz2' # Add the extra storage frontend index_frontend_C = 2 storage_frontend_C = strax.DataDirectory('./folder_C') st.storage.append(storage_frontend_C) # Copy and recompress st.copy_to_frontend(run_id, target, target_frontend_id=index_frontend_C, target_compressor=target_compressor) for sf in st.storage: first_cunk = os.path.join(sf.path, '0-records-sqcyyhsfpv', 'records-sqcyyhsfpv-000000') print(f'In {sf.path}, the first chunk is {os.path.getsize(first_cunk)} kB') Which outputs: .. code-block:: rst In ./folder_A, the first chunk is 275 kB In ./folder_B, the first chunk is 275 kB In ./folder_C, the first chunk is 65 kB From the output we can see that the size of the first chunk of folder_C, the data much smaller than in folder_A/folder_B. This comes from the fact that `bz2` compresses the data much more than the default compressor `blosc`. How does this work? __________________ Strax knows from the metadata stored with the data with witch compressor the data was written. It is possible to use a different compressor when re-writing the data to disk (as done for `strax` knows from the metadata stored with the data with witch compressor the data was written. It is possible to use a different compressor when re-writing the data to disk (as done folder_C in the example above). As such, for further use, it does not matter if the data is coming from either of folders folder_A-folder_C as the metadata will tell strax which compressor to use. Different compressors may have different performance for loading/writing data. Rechunker script ================ From strax v1.2.2 onwards, a ``rechunker`` script is automatically installed with strax. It can be used to re-write data in the ``FileSystem`` backend. For example: .. code-block:: bash rechunker --source 009104-raw_records_aqmon-rfzvpzj4mf --compressor zstd will output: .. code-block:: rst Will write to /tmp/tmpoj0xpr78 and make sub-folder 009104-raw_records_aqmon-rfzvpzj4mf Rechunking 009104-raw_records_aqmon-rfzvpzj4mf to /tmp/tmpoj0xpr78/009104-raw_records_aqmon-rfzvpzj4mf move /tmp/tmpoj0xpr78/009104-raw_records_aqmon-rfzvpzj4mf to 009104-raw_records_aqmon-rfzvpzj4mf Re-compressed 009104-raw_records_aqmon-rfzvpzj4mf backend_key 009104-raw_records_aqmon-rfzvpzj4mf load_time 0.4088103771209717 write_time 0.07699322700500488 uncompressed_mb 1.178276 source_compressor zstd dest_compressor zstd source_mb 0.349217 dest_mb 0.349218 Using script to profile write/read rates for compressors -------------------------------------------------------- This script can easily be used to profile different compressors: .. code-block:: bash for COMPRESSOR in zstd bz2 lz4 blosc zstd; \ do echo $COMPRESSOR; \ rechunker \ --source 009104-raw_records-rfzvpzj4mf \ --write_stats_to test.csv \ --compressor $COMPRESSOR; \ done We can check the output in python using: .. code-block:: python >>> import pandas as pd >>> df = pd.read_csv('test.csv') >>> df['read_mbs'] = df['uncompressed_mb']/df['load_time'] >>> df['write_mbs'] = df['uncompressed_mb']/df['write_time'] >>> print(df[['source_compressor', 'read_mbs', 'dest_compressor', 'write_mbs']].to_string()) source_compressor read_mbs dest_compressor write_mbs 0 zstd 313.922890 zstd 298.429123 1 zstd 284.530054 bz2 8.932259 2 bz2 20.289876 lz4 228.932498 3 lz4 372.491150 blosc 433.494794 4 blosc 725.154966 zstd 215.765177