I/O with HDF5

Background

HDF5 is a format often used in computational physics and other data-science applications, because of its ability to store huge amounts of structured numerical data. Many datasets can be stored in a single file, categorized, linked together, and so on. A variety of python modules leverage HDF5 for input and output; often they rely on h5py or PyTables, pythonic interfaces interoperabale with numpy, and no native support of python objects.

However, more complex data structures have no native support for H5ing; a variety of choices are possible. Any python object can be pickled and stored as a binary blob in HDF5, but the resulting blobs are not usable outside of python. The pandas data analysis module can read_hdf and export to_hdf, but even though the data is written in a usable way, the data layouts are nontrivial to read without pandas.

We aim for a happy medium, by providing a class, ReadWriteable, from which other python classes, which contain a variety of data fields, can inherit to allow them to be easily serialized to and from HDF5. An ReadWriteable object will be saved as a group that contains properties written into groups and datasets, with the same name as the property itself. If a property is one of a slew of known types then it will be written natively as an H5 field, otherwise it will be pickled.

The data types that are not pickled are

ReadWriteable or anything that inherits from ReadWriteable.
bool, int, float, and complex.
tuple, list, dict (with some limitations on valid keys)
numpy.ndarray

Serializing Objects

ReadWriteable objects inherit methods

class supervillain.h5.ReadWriteable[source]

Bases: object

to_h5(group, _top=True)[source]: Write the object as an HDF5 group. Member data will be stored as groups or datasets inside group, with the same name as the property itself.

Note

PEP8 considers _single_leading_underscores as weakly marked for internal use. All of these properties will be stored in a single group named _.

classmethod from_h5(group, strict=True, _top=True)[source]: Construct a fresh object from the HDF5 group.

Warning

If there is no known strategy for writing data to HDF5, objects will be pickled.

Loading pickled data received from untrusted sources can be unsafe.

See: https://docs.python.org/3/library/pickle.html for more.

h5/readwriteable.py

def _example_readwrite(cls, filename, original):
    with h5.File(filename, 'w') as f:
        original.to_h5(f.create_group('object'))
        from_disk = cls.from_h5(f['object'])

    return from_disk

One can also provide custom strategies and to_h5 and from_h5 methods. It is nevertheless advisable to inherit from ReadWriteable for typechecking purposes.

Serializing Raw Data

To provide custom methods for H5ing otherwise-unknown types that cannot be made ReadWriteable, a user can write a small strategy. A strategy is an instance-free class with just static methods applies, write, and read. For example, the strategy for writing a single integer is

h5/strategy/int.py

from supervillain.h5 import Data

class Integer(Data, name='integer'):

    @staticmethod
    def applies(value):
        return isinstance(value, int)

    @staticmethod
    def read(group, strict):
        return int(group[()])

    @staticmethod
    def write(group, key, value):
        group[key] = value
        return group[key]

However, it is probably simplest in most circumstances to just inherit from ReadWriteable. See supervillain/h5/strategy/ for the default strategies. If the ReadWriteable strategy is desired but the class cannot be made to inherit from ReadWriteable, just create a new strategy that inherits from h5.strategy.ReadWriteable and overwrites the applies method.

Extendable Data and Objects

Certain data can meaningfully be extended. For example, you might do Monte Carlo generation, make measurements, and realize you don’t have enough for your desired precision. In that case you might want to extend the ensemble, adding new configurations and measurements to disk. There is a datatype for wrapping numpy arrays,

class supervillain.h5.extendable.array(input_array)[source]: Bases: ndarray

which indicates that the array should be written with its zeroeth (batch) dimension resizable. When an Observable is attached to an ensemble it is automatically an extendable array.

An object which contains extendable data won’t know how to extend itself unless it inherits from

class supervillain.h5.Extendable[source]

Bases: ReadWriteable

extend_h5(group, _top=True)[source]

h5/extendable.py

def _example_extend(cls, first, then, filename):

    with h5.File(filename, 'w') as f:
        first.to_h5(f.create_group('object'))
        then.extend_h5(f['object'])
        result = cls.from_h5(f['object'])

    return result

The extend_h5 method can be overwritten if custom handling is needed; the object should still inherit from Extendable for typechecking purposes.