GenDexHand: Generative Simulation for Dexterous Hands

¹ The University of Hong Kong; ² Transcengram; ³ Shanghai Jiao Tong University; ⁴ The Chinese University of Hong Kong
^* Equal Contribution
^‡ Corresponding authors

Abstract

Data scarcity remains a fundamental bottleneck for embodied intelligence. Existing approaches use large language models (LLMs) to automate gripper-based simulation generation, but they transfer poorly to dexterous manipulation, which demands more specialized environment design. Meanwhile, dexterous manipulation tasks are inherently more difficult due to their higher degrees of freedom. Massively generating feasible and trainable dexterous hand tasks remains an open challenge. To this end, we present GenDexHand, a generative simulation pipeline that autonomously produces diverse robotic tasks and environments for dexterous manipulation. GenDexHand introduces a closed-loop refinement process that adjusts object placements and scales based on vision-language model (VLM) feedback, substantially improving the average quality of generated environments. Each task is further decomposed into sub-tasks to enable sequential reinforcement learning, reducing training time and increasing success rates. Our work provides a viable path toward scalable training of diverse dexterous hand behaviors in embodied intelligence by offering a simulation-based solution to synthetic data generation.

GenDexHand Overview

Overview of the GenDexHand pipeline for task generation. The process consists of four stages: Environment Proposal, Environment Creation, MLLM Refinement, and Trajectory Generation. Embodied assets and object assets are first provided to the Generator to produce an environment proposal. The simulator then renders multi-view images of the proposed scene, which are refined using an MLLM. Finally, the refined environment and proposal are combined to generate the resulting dexterous hand trajectory.

Generation result

Experiment

TASK QUALITY OF GENDEXHAND

Task refinement using MLLM — Two examples of task refinement using MLLM. Modification directives include **Scale_Action**, formatted as *object - scale value*, **Position_Action**, formatted as *object - move_[x/y/z] value*, and **Pose_Action**, formatted as *object - rotate_[x/y/z] value*.

Results for text-based task description average cosine similarity.
Method	all-MiniLM-L6-v2	all-mpnet-base-v2	all-distilroberta-v1
GenDexHand	0.2880	0.2836	0.3156
RoboGen	0.1906	0.2174	0.1952
RoboTwin	0.3237	0.3589	0.3945
Bi-DexHands	0.2212	0.2110	0.2030
Meta-World	0.5213	0.5335	0.5981

EFFICIENCY OF POLICY LEARNING

Policy learning efficiency – results figure 1

Policy learning efficiency – results figure 2

Bar chart comparing three tasks: “Open Cabinet,” “Pick up Bottle,” and “Put the Apple into Bowl.” The Y-axis denotes the success rate ↑ and the number of environment steps ↓ required to collect 1000 successful trajectories in evaluation. Four methods are evaluated: (i) w/o subgoal, baseline RL without subtask decomposition; (ii) w/ subgoals, RL with tasks decomposed into short-horizon subgoals; (iii) w/ freeze-DOFs, RL with selective freezing of redundant degrees of freedom; and (iv) w/ motion planning (Ours), approaching subtasks using motion planning instead.

BibTeX

@misc{chen2025gendexhandgenerativesimulationdexterous, title={GenDexHand: Generative Simulation for Dexterous Hands}, author={Feng Chen and Zhuxiu Xu and Tianzhe Chu and Xunzhe Zhou and Li Sun and Zewen Wu and Shenghua Gao and Zhongyu Li and Yanchao Yang and Yi Ma}, year={2025}, eprint={2511.01791}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2511.01791}, }