I/O with HDF5
Background
HDF5 is a format often used in computational physics and other data-science applications, because of its ability to store huge amounts of structured numerical data. Many datasets can be stored in a single file, categorized, linked together, and so on. A variety of python modules leverage HDF5 for input and output; often they rely on h5py or PyTables, pythonic interfaces interoperabale with numpy, and no native support of python objects.
However, more complex data structures have no native support for H5ing; a variety of choices are possible. Any python object can be pickled and stored as a binary blob in HDF5, but the resulting blobs are not usable outside of python. The pandas data analysis module can read_hdf and export to_hdf, but even though the data is written in a usable way, the data layouts are nontrivial to read without pandas.
We aim for a happy medium, by providing a class, ReadWriteable
, from which other python classes, which contain a variety of data fields, can inherit to allow them to be easily serialized to and from HDF5.
An ReadWriteable object will be saved as a group that contains properties written into groups and datasets, with the same name as the property itself.
If a property is one of a slew of known types then it will be written natively as an H5 field, otherwise it will be pickled.
- The data types that are not pickled are
ReadWriteable
or anything that inherits from ReadWriteable.bool
,int
,float
, andcomplex
.tuple
,list
,dict
(with some limitations on valid keys)numpy.ndarray
Serializing Objects
ReadWriteable objects inherit methods
- class supervillain.h5.ReadWriteable[source]
Bases:
object
- to_h5(group, _top=True)[source]
Write the object as an HDF5 group. Member data will be stored as groups or datasets inside
group
, with the same name as the property itself.Note
PEP8 considers
_single_leading_underscores
as weakly marked for internal use. All of these properties will be stored in a single group named_
.
- classmethod from_h5(group, strict=True, _top=True)[source]
Construct a fresh object from the HDF5 group.
Warning
If there is no known strategy for writing data to HDF5, objects will be pickled.
Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html for more.
def _example_readwrite(cls, filename, original):
with h5.File(filename, 'w') as f:
original.to_h5(f.create_group('object'))
from_disk = cls.from_h5(f['object'])
return from_disk
One can also provide custom strategies and to_h5
and from_h5
methods. It is nevertheless advisable to inherit from ReadWriteable
for typechecking purposes.
Serializing Raw Data
To provide custom methods for H5ing otherwise-unknown types that cannot be made ReadWriteable, a user can write a small strategy.
A strategy is an instance-free class with just static methods applies
, write
, and read
.
For example, the strategy for writing a single integer is
from supervillain.h5 import Data
class Integer(Data, name='integer'):
@staticmethod
def applies(value):
return isinstance(value, int)
@staticmethod
def read(group, strict):
return int(group[()])
@staticmethod
def write(group, key, value):
group[key] = value
return group[key]
However, it is probably simplest in most circumstances to just inherit from ReadWriteable
.
See supervillain/h5/strategy/ for the default strategies.
If the ReadWriteable strategy is desired but the class cannot be made to inherit from ReadWriteable
, just create a new strategy that inherits from h5.strategy.ReadWriteable
and overwrites the applies
method.
Extendable Data and Objects
Certain data can meaningfully be extended. For example, you might do Monte Carlo generation, make measurements, and realize you don’t have enough for your desired precision. In that case you might want to extend the ensemble, adding new configurations and measurements to disk. There is a datatype for wrapping numpy arrays,
which indicates that the array should be written with its zeroeth (batch) dimension resizable.
When an Observable
is attached to an ensemble it is automatically an extendable array.
An object which contains extendable data won’t know how to extend itself unless it inherits from
- class supervillain.h5.Extendable[source]
Bases:
ReadWriteable
def _example_extend(cls, first, then, filename):
with h5.File(filename, 'w') as f:
first.to_h5(f.create_group('object'))
then.extend_h5(f['object'])
result = cls.from_h5(f['object'])
return result
The extend_h5
method can be overwritten if custom handling is needed; the object should still inherit from Extendable
for typechecking purposes.