SkillAgentSearch skills...

LostFound

[RA-L] Lost & Found dynamically tracks object poses from egocentric videos while updating a scene graph, enabling richer semantic 3D understanding for robotic downstream tasks.

Install / Use

/learn @behretj/LostFound
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align='center'> <h2 align="center"> Lost & Found: Tracking Changes from Egocentric Observations in 3D Dynamic Scene Graphs </h2> <div class="is-size-5 publication-authors"> <span class="author-block"> <a href="https://www.linkedin.com/in/tjark-behrens">Tjark Behrens</a><sup>1</sup>,</span> <span class="author-block"> <a href="https://renezurbruegg.github.io/">René Zurbrügg</a><sup>1</sup>,</span> <span class="author-block"> <a href="https://people.inf.ethz.ch/marc.pollefeys/">Marc Pollefeys</a><sup>1,2</sup>, </span> <span class="author-block"> <a href="https://zuriabauer.com/">Zuria Bauer</a><sup>1,*</sup>, </span> <span class="author-block"> <a href="https://hermannblum.net/">Hermann Blum</a><sup>1,3,*</sup> </span> </div> <div class="is-size-5 publication-authors"> <span class="author-block"><sup>1</sup>ETH Zürich,</span> <span class="author-block"><sup>2</sup>Microsoft,</span> <span class="author-block"><sup>3</sup>Uni Bonn</span> <span class="author-block">&nbsp;&nbsp;&nbsp;<sup>*</sup>Equal supervision</span> </div> <br>

teaser

Abstract

Recent approaches have successfully focused on the segmentation of static reconstructions, thereby equipping downstream applications with semantic 3D understanding. However, the world in which we live is dynamic, characterized by numerous interactions between the environment and humans or robotic agents. Static semantic maps are unable to capture this information, and the naive solution of rescanning the environment after every change is both costly and ineffective in tracking e.g. objects being stored away in drawers. With Lost & Found we present an approach that addresses this limitation. Based solely on egocentric recordings with corresponding hand position and camera pose estimates, we are able to track the 6DoF poses of the moving object within the detected interaction interval. These changes are applied online to a transformable scene graph that captures object-level relations. Compared to state-of-the-art object pose trackers, our approach is more reliable in handling the challenging egocentric viewpoint and the lack of depth information. It outperforms the second-best approach by 34% and 56% for translational and orientational error, respectively, and produces visibly smoother 6DoF object trajectories. In addition, we illustrate how the acquired interaction information in the dynamic scene graph can be employed in the context of robotic applications that would otherwise be unfeasible: We show how our method allows to command a mobile manipulator through teach & repeat, and how information about prior interaction allows a mobile manipulator to retrieve an object hidden in a drawer.

[Project Webpage] [Paper] [Teaser Video]

</div>

News :newspaper:

  • April 22nd: Our paper has been accepted as a 4-page abstract to the Workshop on Computer Vision for Mixed Reality that is held in conjunction with CVPR 2025! More information here.
  • March 5th: We published the evaluation dataset. Have a look on Zenodo in order to reproduce our results or run your own pipeline.
  • Febuary 4th: Our paper has been accepted to the IEEE Robotics and Automation Letters (RA-L)!! Check it out here

Environment Setup :memo:

  • Setup conda environment
# create conda environment
conda create -n lost_found -c conda-forge python=3.10.12

# activate conda environment
conda activate lost_found

# install PyTorch for your respective architecture, tested with CUDA 11.7:
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia

# install and build dependencies for co-tracker
cd thirdparty/cotracker && pip install -e . && cd ../..

# install and build dependencies for hand-object-detector
cd thirdparty/detector/lib && python setup.py build develop && cd ../../..

# Install remaining dependencies in main repository
pip install -r requirements.txt

If problems arise with the thirdparty modules on your machine, have a look at the respective git repositories for more detailed installation guides: co-tracker and hand-object-detector.

Downloads :droplet:

  1. The pre-trained model weights for the hand-object detection are available here and place them under the folder thirdparty/detector/models/res101_handobj_100K/pascal_voc:

    mkdir -p thirdparty/detector/models/res101_handobj_100K/pascal_voc
    cd thirdparty/detector/models/res101_handobj_100K/pascal_voc
    gdown https://drive.google.com/uc?id=1H2tWsZkS7tDF8q1-jdjx6V9XrK25EDbE
    cd ../../../../..
    
  2. The pre-trained CoTracker2 weights for the online version is available here and place them under the folder thirdparty/cotracker/checkpoint:

    mkdir thirdparty/cotracker/checkpoint
    cd thirdparty/cotracker/checkpoint
    wget https://huggingface.co/facebook/cotracker/resolve/main/cotracker2.pth
    cd ../../..
    
  3. [Optional] Download demo data for a shoe scene and extract them under the folder demo_data/ as well as the respective 3D scan for the demo under the folder Scan/.

  4. [Optional] Download the full Evaluation Dataset. Extract the Dataset and the 3D Scan into the Data folder.

  5. [Optional] There is an easy docker setup available for the YOLO drawer detection algorithm. Simply pull the docker image from the hub (docker pull craiden/yolodrawer:v1.0). Start the container (docker run -p 5004:5004 --gpus all -it craiden/yolodrawer:v1.0) and run the module (python3 app.py). You need to activate the respective flag for drawer detection in the preprocess_scan function and in the build of the respective scene graph as mentioned in the demo section below. The functional elements for light switches are included in this repo as well. For the setup of the detection module, we refer to this work. They also greatly demonstrate how robotic agents profit from the proposed scene graph structure in the case of light switches.

Run demo

If you have not downloaded the demo data yet as well as the detecion modules, do so as described in the section above. The file run_demo.py consists of an example that steps through the different possibilities when creating a scene graph for ttat tracking sequence. Fill in the respective directories for the variables SCAN_DIR and ARIA_DATA at the beginning of the file.

In the preprocess_scan, we have the option whether we want to run an additional drawer- or light switch-detection algorithm on the scan. If we have done so, we can integrate those detections into the scene graph within its build function.

When creating the scene graph, we have the possibility to set a minimum confidence threshold for objects that should be added to the graph as well as a list of objects that we would like to mark as immovable throughout the tracking. The remove_category function proves useful when you want to get rid off certain object categories for better visualization. To actually visualize the graph, it's sufficient to call the corresponding visualize() function. The flags centroid, connections and labels toggle the visibility of these within the scene graph. For tracking, one can choose to create a video of the sequence by providing a corresponding path.

python run_demo.py 

Evaluation/Dataset

In this section, we report the results of our paper. To reproduce the results, download the dataset as mention above and place it inside an appropriate location. With the two commands below, you are able to generate the 6DoF trajectories with corresponding timesteps for (i) the Head Pose baseline and (ii) Lost & Found:

# (i): Head Pose
python run_dataset.py --scan_dir Data/Final_Scan --data_dir Data/Final_Dataset --headpose --save_pose

# (ii): Lost & Found
python run_dataset.py --scan_dir Data/Final_Scan --data_dir Data/Final_Dataset --save_pose

We used custom implementations of BundleTrack, BundleSDF and FoundationPose for baseline comparison, in a sense that we introduced Metric3Dv2 for depth and SAM2 for mask generation. Please refer to the code bases for more detailed information.

We state the main findings of our approach compared to the baselines below. For more, information please refer to the actual paper.

Results table

Run Pipeline on your own data

This setup requires access to Aria glasses as part of the Aria Research Kit (ARK).

3D Scan

To run the pipeline o

Related Skills

View on GitHub
GitHub Stars56
CategoryContent
Updated9d ago
Forks2

Languages

Python

Security Score

100/100

Audited on Mar 17, 2026

No findings