L3ROcc

L3ROcc is a high-performance visual geometry framework designed to transform standard RGB video sequences into high-precision 3D Point Clouds, 3D Occupancy Grids, and 4D Temporal Observation Data.

Generate Convert Improve

Install / Use

/learn @SCEIRobotics/L3ROcc

About this skill

Quality Score

0/100

README

<h1 align="center">🌌 L3ROcc: Local 3D Reconstruction with Occupancy</h1> <p align="center"> <a href="https://arxiv.org/abs/2507.13347" target="_blank"> <img src="https://img.shields.io/badge/Engine-π³-00AEEF?style=plastic&logo=arxiv&logoColor=white" alt="Paper"> </a> <a href="#" target="_blank"> <img src="https://img.shields.io/badge/Python-3.8+-3776AB?style=plastic&logo=python&logoColor=white" alt="Python"> </a> <a href="#" target="_blank"> <img src="https://img.shields.io/badge/Framework-PyTorch-EE4C2C?style=plastic&logo=pytorch&logoColor=white" alt="PyTorch"> </a> </p> <div align="center"> <a href="https://youtu.be/oqntFdGxhwg" target="_blank"> <img src="https://github.com/user-attachments/assets/66e4fa9f-7f20-40ca-9814-43b3ebf92688" width="100%" alt="L3ROcc Demo"> </a> <p><i>Left: RGB Input | Middle: 3D Point Cloud Fusion | Right: 4D Occupancy Grid</i></p> <p> <b>🎥 <a href="https://youtu.be/oqntFdGxhwg">Watch Full Demo on YouTube</a></b> </p> </div>

L3ROcc is a high-performance visual geometry framework designed to transform standard RGB video sequences into high-precision 3D Point Clouds, 3D Occupancy Grids, and 4D Temporal Observation Data. This project employs $\pi^3$ (Permutation-Equivariant Visual Geometry Learning) as its foundational reconstruction engine, and implements a fully automated pipeline for data labeling and alignment that is customized for navigation learning tasks. All processed data adheres to the LeRobotDataset v2.1 specification. In practical testing, processing a 16-second video segment using this pipeline requires roughly 15 seconds to produce occupancy (occ) and mask data.

✨ Key Features

End-to-End Reconstruction: Directly predicts affine-invariant camera poses and scale-invariant globally point clouds from RGB video streams.
Automated Voxelization: Converts unstructured point clouds into structured Occupancy Grids.
Visibility Analysis: Performs real-time ray casting based on intrinsic and extrinsic parameters of camera to compute visible regions (Visible Masks) and occlusion relationships.
4D Data Serialization:
- Sparse OCC: Utilizes Sparse CSR matrices to store temporal occupancy, significantly reducing disk usage.
- Packed Mask: Implements bit-packing (via np.packbits) for visibility masks to optimize storage efficiency.
Multi-Dataset Adaptation: Built-in generators for both SimpleVideo (single video) and InternData-N1 (large-scale datasets).
Professional Visualization: Mayavi-based 3D rendering tools for generating side-by-side comparison videos of point clouds, trajectories, and occupancy.

💡 Future Work

[ ] Semantic Point Cloud: Integrate semantic segmentation, and instance segmentation to enhance reconstruction quality.
[ ] Multi-modal Fusion: Integrate depth maps to enhance reconstruction quality.

🚀 Quick Start

1. Clone & Install Dependencies

(1). Clone the Repository

git clone --recursive https://github.com/SCEIRobotics/L3ROcc.git
cd L3ROcc

(2). Install Python Dependencies

i. For Production (Generating OCC data for InternData-N1/LeRobot):

Python 3.10+ is recommended. Install the following dependencies:

conda create -n <env> python=3.10 -y
conda activate <env>
pip install -e .
pip install -e third_party/pi3

ii. For Visualization (Rendering dynamic videos & 3D inspection):

3D rendering requires a GUI environment. Please set up this environment on your local computer (Windows/macOS/Linux), not on the remote server：

# Run on your local machine
conda create -n <env> python=3.8 -y
conda activate <env>
pip install -r requirements_visual.txt
conda install -c conda-forge mayavi

(Note: Ensure you have a working OpenGL environment for Mayavi rendering.)

2.Model Checkpoints

Place the $\pi^3$ model weights (model.safetensors) and configuration files in the ckpt/ directory at the project root. If the automatic download from Hugging Face is slow, you can download the model checkpoint manually from here.

3. Run Example （The pipeline supports three primary modes）:

Mode A: Generate Visualized Dynamic Video Use this to create side-by-side comparison videos from your own footage with history frames.

python tools/run_normal_data_occ.py --video_path data/examples/office.mp4 --save_dir data/examples/outputs/ --pcd_save True --mode visual --mesh False

Mode B: Generate LeRobot-compatible Data Use this to generate the standard dataset structure for model training.

python tools/run_normal_data_occ.py --video_path data/examples/office.mp4 --save_dir data/examples/outputs/ --pcd_save True --mode run --mesh False

Mode C: Batch Process InternData-N1 Dataset To process the full InternData-N1 directory with scale alignment enabled.

This mode supports breakpoint resumption with the following logic:

mask_sequence.npz (final result file) exists: Skip processing and continue with the next trajectory.
overwrite=True: Force reprocessing and overwrite existing files.
Otherwise: Proceed with processing.

python tools/run_intern_nav_occ.py --dataset_root data/examples/small_vln_n1/traj_data --output_root data/examples/small_vln_n1_4/traj_data --pcd_save True --overwrite False --mesh False

🛠️ Pipeline Details

1. Data Generators

Located in L3ROcc/generater/, the project includes two core generators:

SimpleVideoDataGenerator: Best for individual videos; automatically builds standard directory structures including meta/, videos/, and data/.
InternData-N1DataGenerator: Designed for large-scale InternData-N1 data enhancement; supports Scale Alignment using Sim3 to ensure reconstruction coordinates match ground truth.

2. Core Configuration

Parameters can be tuned in L3ROcc/configs/config.yaml:

pc_range: Spatial clipping and perception range for point cloud [x_min, y_min, z_min, x_max, y_max, z_max]. Where x is left-right direction, y is height direction (-y for upward), and z is front-back direction.
voxel_size: Base size for occupancy voxels (e.g., 0.02m), which is directly related to the sparsity of the occupancy voxel map.
occ_size: Number of voxel grids in each spatial dimension, derived from (pc_range_max - pc_range_min) / voxel_size with no independent configuration.
interval: Frame sampling interval for video processing.
history_len: Number of past frames to include in history (default: 10).
history_step: Step size for history frame sampling (default: 2).

3.Dataset Structure & Contents

(1). InternData-N1 Format

The following structure is generated under each trajectory directory (e.g., trajectory_1) to ensure compatibility with robotics learning frameworks:

trajectory_1/
├── data/
│   └── chunk-000/                 # Core Geometric Assets
│       ├── all_occ.npz            # Global scene occupancy grid
│       ├── origin_pcd.ply         # Downsampled global point cloud
│       └── episode_000000.parquet # Per-frame poses and intrinsics
├── meta/                          # Metadata & Statistics
│   ├── info.json                  # Dataset schema and feature definitions
│   ├── episodes.jsonl             # Episode metadata and Sim3 scale factors
│   ├── episodes_stats.jsonl       # Feature statistics (min/max/mean/std)
│   └── tasks.jsonl                # Task descriptions
└── videos/
    └── chunk-000/                 # Temporal Sequences
        ├── observation.occ.mask/
        │   └── mask_sequence.npz  # Temporal visibility bitmask
        ├── observation.occ.view/
        │   └── occ_sequence.npz   # Temporal egocentric occupancy
        └── observation.video.trajectory/
            └── reference.mp4      # Original RGB source video

i. data/chunk-000/ (Core Geometric Assets)

all_occ.npz: Stores the global occupancy grid of the entire scene in world coordinates.
origin_pcd.ply: The initial global point cloud reconstructed from the video, optimized via voxel downsampling for efficient processing.
episode_000000.parquet: A structured data table containing per-frame high-level features:
- Camera Intrinsics_occ: 3x3 matrices re-estimated via Least Squares/DLT based on local geometry.
- Camera Extrinsics_occ: 4x4 extrinsic matrices predicted by the π³ model and aligned to world coordinates.

ii. meta/ (Metadata & Statistics)

info.json: Defines the dataset schema, including the data types and shapes for observation.camera_extrinsic_occ and observation.camera_intrinsic_occ.
episodes.jsonl: Contains episode-level constants, most notably the Sim3 Scale Factor used to align the model's relative units to real-world metric scales.
episodes_stats.jsonl: Automatically calculates the statistical distribution (min, max, mean, std) for all observation vectors.
tasks.jsonl: Provides task descriptions and objectives for the dataset.

iii. videos/chunk-000/ (Temporal Sequences)

observation.occ.mask/mask_sequence.npz: A time-series of visibility masks. It uses an optimized Bit-packing format to store which voxels are currently visible within the camera's frustum.
observation.occ.view/occ_sequence.npz: A time-series of egocentric occupancy data. Each frame represents the occupied voxels in the current camera coordinate system, stored as a Sparse CSR Matrix to minimize storage overhead.
observation.video.trajectory/reference.mp4: The original RGB video sequence used as input for reconstruction.

(2). Visual Format

Outputs generated by the visual_pipeline are tailored for rendering and manual inspection:

| Directory/File | Description | |---------------|-------------| | `merge_npy_sequ

Related Skills

clearshot

Structured screenshot analysis for UI implementation and critique. Analyzes every UI screenshot with a 5×5 spatial grid, full element inventory, and design system extraction — facts and taste together, every time. Escalates to full implementation blueprint when building. Trigger on any digital interface image file (png, jpg, gif, webp — websites, apps, dashboards, mockups, wireframes) or commands like 'analyse this screenshot,' 'rebuild this,' 'match this design,' 'clone this.' Skip for non-UI images (photos, memes, charts) unless the user explicitly wants to build a UI from them. Does NOT trigger on HTML source code, CSS, SVGs, or any code pasted as text.

ui-ux-pro-max-skill

57.9k

An AI SKILL that provide design intelligence for building professional UI/UX multiple platforms

ui-ux-pro-max-skill

57.9k

An AI SKILL that provide design intelligence for building professional UI/UX multiple platforms

onlook

25.0k

The Cursor for Designers • An Open-Source AI-First Design tool • Visually build, style, and edit your React App with AI