Hand–object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of previous work, we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose–Appearance–Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480×720 videos compared to 256×256/256×384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 → 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.
Overview of our three-stage generation pipeline. (1) Pose Generation: A pretrained pose generation model generates the intermediate hand-object interaction (HOI) poses based on the initial and target poses, along with the object mesh. (2) Appearance Generation: A controllable image diffusion model synthesizes the first frame of the video, conditioned on multi-modal inputs (depth maps, semantic masks, and keypoint annotations). (3) Motion Generation: The generated HOI sequence and the first frame are rendered into a full video sequence by a video diffusion model, conditioned on the same multi-modal inputs used in the appearance generation stage.
| Method | FVD (↓) | MF (↑) | LPIPS (↓) | SSIM (↑) | PSNR (↑) | MPJPE (↓) | Resolution |
|---|---|---|---|---|---|---|---|
| CosHand | 58.51 | 0.591 | 0.139 | 0.767 | 23.20 | 30.05 | 256 x 256 |
| InterDyn | 38.83 | 0.680 | 0.119 | 0.848 | 24.86 | - | 256 x 384 |
| ManiVideo | - | - | 0.079 | 0.913 | 30.10 | 57.30 | - |
| Ours w/ all | 29.13 | 0.712 | 0.069 | 0.914 | 30.17 | 19.37 | 480 x 720 |
Quantitative comparison on DexYCB dataset. Our method is evaluated against CosHand, InterDyn, and ManiVideo. Results for InterDyn and ManiVideo are taken from their original papers. For fair comparison, CosHand was fine-tuned on the s0-split training set identical to ours. Our approach achieves state-of-the-art performance across all metrics (FVD, LPIPS, MF, MPJPE) while generating high-resolution 480×720 videos.
| Method | FVD (↓) | MF (↑) | LPIPS (↓) | SSIM (↑) | PSNR (↑) | MPJPE (↓) |
|---|---|---|---|---|---|---|
| CosHand | 68.76 | 0.651 | 0.156 | 0.765 | 23.84 | 14.49 |
| Ours w/ seg | 48.97 | 0.708 | 0.084 | 0.831 | 25.76 | 9.61 |
| Ours w/ depth | 50.85 | 0.702 | 0.086 | 0.845 | 26.98 | 10.07 |
| Ours w/ hand | 52.41 | 0.671 | 0.113 | 0.838 | 25.66 | 8.01 |
| Ours w/ all | 46.31 | 0.777 | 0.081 | 0.851 | 28.36 | 7.01 |
Quantitative results on the OAKINK2 dataset. Comparison of our method with CosHand. For a fair evaluation, both models are trained on the same dataset. Our approach achieves state-of-the-art performance, outperforming CosHand across all evaluated metrics.
Data augmentation analysis with varying ratios of real data. We augment different portions of the DexYCB training set (25%, 50%, 75%, 100%) with our generated synthetic data. The baseline (dashed line) indicates performance when training solely on 100% of the real DexYCB data without synthetic augmentation.
@article{gao2025PAM,
title={PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation},
author={Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Jinkun Liu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Chen, Hao Tang, Yu Zhang, Li Yi, Hao Zhao},
journal={CVPR},
year={2026}
}