PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation

Accepted to CVPR 2026

Mingju Gao^*1,2, Kaisen Yang^*2, Huan-ang Gao², Bohan Li^4,5, Ao Ding²,
Wenyi Li², Yangcheng Yu², Jinkun Liu², Shaocong Xu³, Yike Niu², Haohan Chi², Hao Chen⁴,
Hao Tang¹, Yu Zhang⁴, Li Yi², Hao Zhao^†2,3

¹ Peking University ² Tsinghua University ³ BAAI ⁴ SJTU ⁵ Eastern Institute of Technology
^*Indicates Equal Contribution ^†Indicates Corresponding Author

Paper Code [CVPR 2026] arXiv HF Models

Abstract

Hand–object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of previous work, we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose–Appearance–Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480×720 videos compared to 256×256/256×384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 → 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.

Method

Overview of our three-stage generation pipeline. (1) Pose Generation: A pretrained pose generation model generates the intermediate hand-object interaction (HOI) poses based on the initial and target poses, along with the object mesh. (2) Appearance Generation: A controllable image diffusion model synthesizes the first frame of the video, conditioned on multi-modal inputs (depth maps, semantic masks, and keypoint annotations). (3) Motion Generation: The generated HOI sequence and the first frame are rendered into a full video sequence by a video diffusion model, conditioned on the same multi-modal inputs used in the appearance generation stage.

Results

Quantitative comparison (DexYCB)

Method	FVD (↓)	MF (↑)	LPIPS (↓)	SSIM (↑)	PSNR (↑)	MPJPE (↓)	Resolution
CosHand	58.51	0.591	0.139	0.767	23.20	30.05	256 x 256
InterDyn	38.83	0.680	0.119	0.848	24.86	-	256 x 384
ManiVideo	-	-	0.079	0.913	30.10	57.30	-
Ours w/ all	29.13	0.712	0.069	0.914	30.17	19.37	480 x 720

Quantitative comparison on DexYCB dataset. Our method is evaluated against CosHand, InterDyn, and ManiVideo. Results for InterDyn and ManiVideo are taken from their original papers. For fair comparison, CosHand was fine-tuned on the s0-split training set identical to ours. Our approach achieves state-of-the-art performance across all metrics (FVD, LPIPS, MF, MPJPE) while generating high-resolution 480×720 videos.

Quantitative comparison (OAKINK2)

Method	FVD (↓)	MF (↑)	LPIPS (↓)	SSIM (↑)	PSNR (↑)	MPJPE (↓)
CosHand	68.76	0.651	0.156	0.765	23.84	14.49
Ours w/ seg	48.97	0.708	0.084	0.831	25.76	9.61
Ours w/ depth	50.85	0.702	0.086	0.845	26.98	10.07
Ours w/ hand	52.41	0.671	0.113	0.838	25.66	8.01
Ours w/ all	46.31	0.777	0.081	0.851	28.36	7.01

Quantitative results on the OAKINK2 dataset. Comparison of our method with CosHand. For a fair evaluation, both models are trained on the same dataset. Our approach achieves state-of-the-art performance, outperforming CosHand across all evaluated metrics.

Downstream Validation

Data augmentation analysis with varying ratios of real data. We augment different portions of the DexYCB training set (25%, 50%, 75%, 100%) with our generated synthetic data. The baseline (dashed line) indicates performance when training solely on 100% of the real DexYCB data without synthetic augmentation.