Reconstructing Objects along Hand Interaction Timelines in Egocentric Video

1University of Bristol, UK
2Max Planck Institute for Intelligent Systems, Tübingen, Germany
Arxiv 2025

tl;dr: A constrained optimisation framework that reconstructs object poses in egocentric video
by modelling temporal constraints, from static to stably grasped.

Results on challenging in-the-wild sequences

Top-left: input egocentric frame. Top-centre: input egocentric frame with overlaid 3D results. Top-right: rotated views of 3D results.
Bottom: 3D results with camera poses in the world coordinate.  camera pose icon: camera pose.
Frames with objects only: static segments.

P01_05_right_bottle_73544_74450 P02_09_left_bottle_29247_30547 P01_01_left_pan_85223_86793 P01_09_left_plate_93267_94233 P02_01_left_mug_5973_7186 P09_02_left_bowl_21343_22185 P11_104_right_pan_29148_29746 P12_101_left_mug_29154_30004 P22_107_right_cup_5903_6940 P28_112_right_bowl_11825_13280 P23_02_left_plate_43624_44088 P25_101_left_bottle_379_1456

Abstract

We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object's perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object's pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps — i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth.

We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.

More in-the-wild Stable Grasps Results

P01_14_left_hand_58993_59073_action P02_09_left_hand_67864_67947_action P06_03_left_hand_19268_19549_action P03_120_left_hand_37223_37347_action P08_16_left_hand_625_808_action P22_107_right_hand_20145_20246_action P02_09_left_hand_10416_10752_action P30_05_left_hand_68640_68737_action P01_09_left_hand_174538_174756_action P27_101_right_hand_57039_57087_action P02_101_left_hand_11611_11690_action P04_110_left_hand_7540_8536_action P01_01_left_hand_7430_7541_action P01_05_right_hand_20658_20696_action P01_09_left_hand_158727_159675_action P25_107_left_hand_102550_102635_action P30_07_right_hand_4697_4760_action

BibTeX

@misc{xxxx,
    title={xxxxxxxxx},
    author={xxxxxxxxxxx},
    year={xxxxxxxx},
    eprint={xxxxx.xxxxx},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}