Python 处理 HDF5 文件

1. 说明

测试文件：https://huggingface.co/datasets/yifengzhu-hf/LIBERO-datasets/blob/main/libero_10/STUDY_SCENE1_pick_up_the_book_and_place_it_in_the_back_compartment_of_the_caddy_demo.hdf5

2. 安装 h5py

使用 Anaconda 或 Miniconda 时：

conda install h5py

如果你的平台（x86 架构的 mac、Linux、Windows）有可用的预编译 wheel 包，并且不需要 MPI，那么可以通过 pip 安装 h5py：

pip install h5py

从源码安装，请参阅 Installation。

3. 核心概念

HDF5 文件是一个容器，包含两类对象：dataset（数据集，类似数组的数据集合）和 group（组，类似目录的容器，可包含数据集和其他组）。使用 h5py 时最需要牢记的是：

Group 的工作方式像字典，Dataset 的工作方式像 NumPy 数组。

3.1. 读取文件

打开文件，用于读取：

import h5py
import numpy as np


file_path = "/Users/timchow/Desktop/STUDY_SCENE1_pick_up_the_book_and_place_it_in_the_back_compartment_of_the_caddy_demo.hdf5"
f = h5py.File(file_path, 'r')

返回的文件对象是所有读取操作的起点。h5py.File 的行为类似于 Python 字典：

def get_all_datasets(f):
    """
    Get all datasets in the HDF5 file.
    """
    datasets = []
    for key, val in f.items():
        if isinstance(val, h5py.Group):
            datasets.extend(get_all_datasets(val))
        else:
            datasets.append(val)
            print(f"Found dataset: {key}")
    return datasets


all_datasets = get_all_datasets(f)
print(f"Found {len(all_datasets)} datasets")

与 Numpy 数组类似，Dataset 对象拥有形状和数据类型：

print(all_datasets[0].name)
print(all_datasets[0].shape)
print(all_datasets[0].dtype)

数据集也支持数组风格的切片。这就是从文件的数据集中读写的方式：

print(all_datasets[0][0:100:10])
print(all_datasets[0][0])

3.2. 创建文件

通过在初始化文件对象时，将 mode 设置为 w 的方式，创建文件。其他模式包括 a（用于读/写/创建），r+ （用于读写）。访问模式的完整列表在 File 对象中。

create_dataset 使用指定的形状和数据类型创建数据集：

import h5py
import numpy as np

f = h5py.File("mytestfile.hdf5", "w")
dset = f.create_dataset("mydataset", (100,), dtype='i')

文件对象是上下文管理器；因此下面的代码也有效：

import h5py
import numpy as np

with h5py.File("mytestfile.hdf5", "w") as f:
    f.create_dataset("mydataset", (100,), dtype='i')

4. 组和层次组织

HDF 代表 Hierarchical Data Format。HDF5 文件中的每个对象都有名字，并且它们按 POSIX 风格的层级结构组织，层级之间用 / 分隔：

    print(dset.name)
    # Output: /mydataset

系统中的“目录”称为组（Group）。通过 create_group 创建子组。但需要以“追加”模式打开文件（如果文件存在则可读写，不存在则创建）：

import h5py
import numpy as np

with h5py.File("mytestfile.hdf5", "a") as f:
    grp = f.create_group("subgroup")

与文件类似，Group 对象也有 create_* 方法：

import h5py
import numpy as np

with h5py.File("mytestfile.hdf5", "a") as f:
    grp = f.create_group("subgroup")
    dset2 = grp.create_dataset("another_dataset", (50,), dtype='f')
    print(dset2.name)
    # Output: /subgroup/another_dataset

无需手动创建所有中间的组，指定完整路径即可：

import h5py
import numpy as np

with h5py.File("mytestfile.hdf5", "a") as f:
    dset3 = f.create_dataset("subgroup2/dataset_three", (10,), dtype='i')
    print(dset3.name)
    # Output: /subgroup2/dataset_three

组支持大多数 Python 字典风格的接口。可以使用 Item 获取语法获取对象：

import h5py
import numpy as np

with h5py.File("mytestfile.hdf5", "a") as f:
    dset3 = f["subgroup2/dataset_three"]
    print(dset3.name)
    # Output: /subgroup2/dataset_three

在组上迭代可以获取其成员名称：

    for name in f:
        print(name)

也可以使用名称进行成员关系测试：

    print("mydataset" in f)

也可以使用完整路径名称：

    print("subgroup2/dataset_three" in f)

也支持熟悉的 keys()、values()、items() 和 iter() 方法，以及 get() 方法。

由于遍历组时只能得到其直接成员，因此要遍历整个文件，需要使用 Group 的 visit() 和 visititems() 方法；这两个方法都接受可调用对象作为参数：

import h5py
import numpy as np

with h5py.File("mytestfile.hdf5", "a") as f:
    def print_name(name):
        print(name)
    f.visit(print_name)

5. 属性

HDF5 的特性之一是可以将元数据存放在它所描述的数据旁边。所有组和数据集都支持附加带名称的属性。

通过 attrs 代理对象访问属性，它也实现字典接口：

    f.attrs["test_attr"] = "test_value"
    print("test_attr" in f.attrs)