GR00T 系列 2 - 如何准备模态配置

如何准备模态配置

概览

模态配置定义模型应如何加载、处理及解释机器人数据。该配置连接数据集的物理结构（在 meta/modality.json 中定义）与模型的数据处理流水线。

每个 Embodiment 都需要 Python 配置文件，用于指定：

使用哪些观测（视频相机、本体感受状态）

如何按时间采样数据（当前帧、历史帧、未来动作时域）

如何解释及变换动作

使用哪些语言标注

配置结构

模态配置是 Python 字典，包含四个顶层键："video"、"state"、"action" 及 "language"。每个键映射到 ModalityConfig 对象。

下面是 SO-100 示例：

from gr00t.configs.data.embodiment_configs import register_modality_config
from gr00t.data.types import ModalityConfig, ActionConfig, ActionRepresentation, ActionType, ActionFormat

so100_config = {
    "video": ModalityConfig(...),
    "state": ModalityConfig(...),
    "action": ModalityConfig(...),
    "language": ModalityConfig(...),
}

register_modality_config(so100_config, embodiment_tag=EmbodimentTag.NEW_EMBODIMENT)

理解 `ModalityConfig`

每个 ModalityConfig 指定两个必填字段及若干可选字段：

必填字段

1. delta_indices (list[int])

定义相对于当前时间步要采样哪些时间偏移：

当前观测：对当前时间步使用 [0]（推荐用于 Video 和 State）

未来动作：对动作预测时域使用正索引（比如 list(range(0, 16))）

注意： 数据加载器支持用负索引（比如 [-2, -1, 0]）表示历史上下文，但当前的 N1.7 Embodiment 配置都没有使用负索引。除非确实需要堆叠帧，否则 Video 和 State 应使用 [0]。

示例：

# Single current frame for video
delta_indices=[0]

# 16-step action prediction horizon
delta_indices=list(range(0, 16))

注意：如果修改 Action 模态的 delta_indices（比如将动作时域从 16 改为 8），则必须通过重新运行 python gr00t/data/stats.py --dataset-path <dataset_path> --embodiment-tag <embodiment_tag> 来重新生成数据集统计信息。归一化统计信息（尤其是 meta/relative_stats.json）基于原始 delta_indices 长度计算，不匹配将导致训练期间出错。

示例：如果更改 delta_indices 但不重新生成统计信息，将发生什么？

假设 Action 配置最初使用 16 步时域：

"action": ModalityConfig(
    delta_indices=list(range(0, 16)),  # 16 steps
    ...
)

运行 python gr00t/data/stats.py 将生成 meta/relative_stats.json，其中包含 Shape 为 (16, D) 的逐步统计信息，D 是动作维度。

如果之后将时域改为 8 步：

"action": ModalityConfig(
    delta_indices=list(range(0, 8)),  # 8 steps
    ...
)

此时训练数据的 Shape 将变为 (8, D)，但来自 relative_stats.json 的归一化参数的 Shape 仍为 (16, D)。在归一化期间，维度不匹配将导致 IndexError：

IndexError: boolean index did not match indexed array along dimension 0;
dimension is 8 but corresponding boolean dimension is 16

修复： 更改 delta_indices 后，重新运行 python gr00t/data/stats.py --dataset-path <dataset_path> --embodiment-tag <embodiment_tag>，以重新生成匹配的统计信息。

2. modality_keys（list[str]）

指定从数据集中加载哪些键。这些键必须与 meta/modality.json 文件中定义的键匹配。

对于 SO-100 示例：

Video 键：必须匹配 meta/modality.json 中 "video" 下的键（比如 "front"、"wrist"）

State 键：必须匹配 meta/modality.json 中 "state" 下的键（比如 "single_arm"、"gripper"）

Action 键：必须匹配 meta/modality.json 中 "action" 下的键（比如 "single_arm"、"gripper"）

Language 键：必须匹配 meta/modality.json 中 "annotation" 下的键（比如 SO-100 的 "annotation.human.task_description"）

可选字段

3. sin_cos_embedding_keys（list[str] | None）

指定哪些 State 键应使用正弦/余弦编码。最适合以弧度表示的维度（比如关节角）。如果未指定，则使用 Min-Max 归一化。请注意，这将使维度数量扩展为 2 倍，并且推荐仅用于本体感受状态。

"state": ModalityConfig(
    delta_indices=[0],
    modality_keys=["single_arm", "gripper"],
    sin_cos_embedding_keys=["single_arm"],  # Apply sin/cos to joint angles
)

4. mean_std_embedding_keys（list[str] | None）

指定哪些键应使用均值/标准差归一化，而非 Min-Max 归一化。

5. action_configs（list[ActionConfig] | None）

"action" 模态必需。定义应如何解释及变换每个 Action 模态。该列表必须与 modality_keys 具有相同长度及相同顺序，即 action_configs[0] 应用于 modality_keys[0]，action_configs[1] 应用于 modality_keys[1]，以此类推。顺序不匹配将静默应用错误的表示（比如将 RELATIVE 应用于本应为 ABSOLUTE 的 Gripper）。更多细节请参见 Action 模态部分。

配置每个模态

Video 模态

定义使用哪些相机视角：

"video": ModalityConfig(
    delta_indices=[0],  # Current frame only
    modality_keys=[
        "front",  # Must match a key in meta/modality.json under "video"
    ],
)

对于多个相机：

"video": ModalityConfig(
    delta_indices=[0],
    modality_keys=["front", "wrist"],
)

State 模态

定义本体感受观测（关节位置、Gripper 状态等）：

"state": ModalityConfig(
    delta_indices=[0],  # Current state
    modality_keys=[
        "single_arm",      # Must match keys in meta/modality.json under "state"
        "gripper",
    ],
)

Action 模态

定义动作空间及预测时域：

"action": ModalityConfig(
    delta_indices=list(range(0, 16)),  # Predict 16 steps into the future
    modality_keys=[
        "single_arm",      # Must match keys in meta/modality.json under "action"
        "gripper",
    ],
    action_configs=[
        # One ActionConfig per modality_key
        # single_arm
        ActionConfig(
            rep=ActionRepresentation.RELATIVE,  # relative control of the single arm
            type=ActionType.NON_EEF,
            format=ActionFormat.DEFAULT,
        ),
        # gripper
        ActionConfig(
            rep=ActionRepresentation.ABSOLUTE,  # absolute control of the gripper
            type=ActionType.NON_EEF,
            format=ActionFormat.DEFAULT,
        ),
    ],
)

理解 `ActionConfig`

每个 ActionConfig 有三个必填字段及一个可选字段：

1. rep（ActionRepresentation）

定义如何解释动作：

RELATIVE：动作是相对于当前状态的增量（在 UMI 论文中引入）

ABSOLUTE：动作是目标位置

使用相对动作将产生更平滑的动作，但可能受到漂移影响。若使用相对动作，请确保数据集中存储的 State 和 Action 是绝对值；从绝对值到相对值的转换在 Processor 中处理。

2. type（ActionType）

指定控制空间：

EEF：末端执行器/笛卡尔空间控制（期望 9 维向量：X、Y、Z 位置 + Rotation 6D）

NON_EEF：关节空间控制及其他非 EEF 控制空间（关节角、位置、Gripper 位置等）

3. format（ActionFormat）

定义动作表示格式：

DEFAULT：标准格式（比如关节角、Gripper 位置）

XYZ_ROT6D：用于末端执行器控制的 3D 位置 + 6D 旋转表示

XYZ_ROTVEC：用于末端执行器控制的 3D 位置 + 旋转向量

4. state_key（str | None）

可选。指定当 rep=RELATIVE 时用于计算相对动作的对应参考 State 键。如未提供，系统将使用 Action 键作为参考 State 键。

带有 state_key 的示例：

"joint_pos_action_left": ActionConfig(
    rep=ActionRepresentation.RELATIVE,
    type=ActionType.NON_EEF,
    format=ActionFormat.DEFAULT,
    state_key="joint_pos_obs_left",  # Use this state to compute relative action
)

Language 模态

定义使用哪些语言标注：

"language": ModalityConfig(
    delta_indices=[0],
    modality_keys=["annotation.human.task_description"],  # Must match annotation keys in meta/modality.json
)

完整示例：SO-100

下面是完整的 SO-100 配置：

so100_config = {
    "video": ModalityConfig(
        delta_indices=[0],
        modality_keys=["front", "wrist"],
    ),
    "state": ModalityConfig(
        delta_indices=[0],
        modality_keys=[
            "single_arm",
            "gripper",
        ],
    ),
    "action": ModalityConfig(
        delta_indices=list(range(0, 16)),
        modality_keys=[
            "single_arm",
            "gripper",
        ],
        action_configs=[
            ActionConfig(
                rep=ActionRepresentation.RELATIVE,
                type=ActionType.NON_EEF,
                format=ActionFormat.DEFAULT,
            ),
            ActionConfig(
                rep=ActionRepresentation.ABSOLUTE,
                type=ActionType.NON_EEF,
                format=ActionFormat.DEFAULT,
            ),
        ],
    ),
    "language": ModalityConfig(
        delta_indices=[0],
        modality_keys=["annotation.human.task_description"],
    ),
}

与 `meta/modality.json` 的键关系

模态配置的 modality_keys 必须引用数据集 meta/modality.json 中存在的键：

meta/modality.json 示例：

{
    "state": {
        "single_arm": {"start": 0, "end": 5},
        "gripper": {"start": 5, "end": 6},
    },
    "action": {
        "single_arm": {"start": 0, "end": 5},
        "gripper": {"start": 5, "end": 6},
    },
    "video": {
        "front": {"original_key": "observation.images.front"},
        "wrist": {"original_key": "observation.images.wrist"},
    },
    "annotation": {
        "human.task_description": {
            "original_key": "task_index"
        }
    }
}

系统将：

使用 modality_keys 在 meta/modality.json 中查找对应条目

从拼接的 State/Action 数组中提取切片

应用指定变换（归一化、动作表示转换）

注册配置

定义配置后，注册该配置，使其可供训练及推理流水线使用：

from gr00t.configs.data.embodiment_configs import register_modality_config

your_modality_config = {
    ...
}

register_modality_config(your_modality_config, embodiment_tag=EmbodimentTag.NEW_EMBODIMENT)

将配置保存到 Python 文件，在运行微调脚本时将路径传递给 modality_config_path 参数。