SkillAgentSearch skills...

RxR

Room-across-Room (RxR) is a large-scale, multilingual dataset for Vision-and-Language Navigation (VLN) in Matterport3D environments. It contains 126k navigation instructions in English, Hindi and Telugu, and 126k navigation following demonstrations. Both annotation types include dense spatiotemporal alignments between the text and the visual perceptions of the annotators

Install / Use

/learn @google-research-datasets/RxR
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Room-Across-Room (RxR) Dataset

Room-Across-Room (RxR) is a multilingual dataset for Vision-and-Language Navigation (VLN) for Matterport3D environments. In contrast to related datasets such as Room-to-Room (R2R), RxR is 10x larger, multilingual (English, Hindi and Telugu), with longer and more variable paths, and it includes and fine-grained visual groundings that relate each word to pixels/surfaces in the environment.

pose_trace

RxR is released as gzipped JSON Lines and numpy archives, and has four components: guide annotations, follower annotations, pose traces, and text features. The guide annotations alone are akin to R2R and sufficient to run the standard VLN setup.

Reference

The RxR dataset is described in Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding.

Bibtex:

@inproceedings{rxr,
  title={{Room-Across-Room}: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding},
  author={Alexander Ku and Peter Anderson and Roma Patel and Eugene Ie and Jason Baldridge},
  booktitle={Conference on Empirical Methods for Natural Language Processing (EMNLP)},
  year={2020}
}

Dataset Download

To download the full 161GB dataset (guide annotations, follower annotations, pose traces, text features, grounded landmarks, and model-generated instructions), install the gsutil tool and run:

gsutil -m cp -R gs://rxr-data .

Using -m gsutil downloads using multi-threading and multi-processing, using a number of threads and processors determined by the parallel_thread_count and parallel_process_count values set in the boto configuration file. You might want to experiment with these values for the fastest download.

Alternatively, see the instructions below for downloading separate components of the dataset.

Downloading Guide Annotations

Each JSON Lines entry contains a guide annotation for a path in the environment.

Data schema:

{'split': str,
 'instruction_id': int,
 'annotator_id': int,
 'language': str,
 'path_id': int,
 'scan': str,
 'path': Sequence[str],
 'heading': float,
 'instruction': str,
 'timed_instruction': Sequence[Mapping[str, Union[str, float]]],
 'edit_distance': float}

Field descriptions:

  • split: The annotation split: train, val_seen, val_unseen, test_standard.
  • instruction_id: Uniquely identifies the guide annotation.
  • annotator_id: Uniquely identifies the guide annotator.
  • language: The IETF BCP 47 language tag: en-IN, en-US, hi-IN, te-IN.
  • path_id: Uniquely identifies a path sampled from the Matterport3D environment.
  • scan: Uniquely identifies a scan in the Matterport3D environment.
  • path: A sequence of panoramic viewpoints along the path.
  • heading: The initial heading in radians. Following R2R, the heading angle is zero facing the y-axis with z-up, and increases by turning right.
  • instruction: The navigation instruction.
  • timed_instruction: A sequence of time-aligned words in the instruction. Note that a small number of words are missing the start_time and end_time fields.
    • word: The aligned utterance.
    • start_time: The start of the time span, w.r.t. the recording.
    • end_time: The end of the time span, w.r.t. the recording.
  • edit_distance Edit distance between the manually transcribed instructions and the automatic transcript generated by Google Cloud Text-to-Speech API.

Sample entry:

{'path_id': 11,
 'split': 'val_seen',
 'scan': '2n8kARJN3HM',
 'heading': 3.105381634905035,
 'path': ['d38a4c31821c48ac9082d896e628c128',
  '1d6a100cf3d34326936ef7d0a50840d9',
  '87998608d4844fcfaca266bd5aba6516',
  '5248918af65645a28a65f59d3424598a',
  'e0b2917ecb6d4e31846957451348f80a'],
 'instruction_id': 26,
 'annotator_id': 19,
 'language': 'en-IN',
 'instruction': 'Okay, now you are in a room facing towards two bathtubs, one on the right side ...',
 'timed_instruction': [{'start_time': 0.4, 'word': 'Okay,', 'end_time': 1.0}, ...],
 'edit_distance': 0.11137440758293839}

Guide annotations can be downloaded from these links:

Downloading Follower Annotations

Each JSON Lines entry contains a follower annotation for an instruction in the guide annotations.

Data schema:

{'demonstration_id': int,
 'instruction_id': int,
 'annotator_id': int,
 'path': Sequence[str],
 'metrics': {'ne': float,
  'sr': float,
  'spl': float,
  'dtw': float,
  'ndtw': float,
  'sdtw': float}}

Field descriptions:

  • demonstration_id: Uniquely identifies the follower annotation.
  • instruction_id: Uniquely identifies the guide annotation being followed.
  • annotator_id: Uniquely identifies the follower annotator.
  • path: A sequence of panoramic viewpoint along the path traversed by the follower (which may differ to the guide path).
  • metrics: Evaluation metrics for the follower path measured against the guide path:
    • ne: Navigation Error, the shortest-path distance between the final viewpoint in each path.
    • sr: Success Rate, whether ne is less than 3 meters.
    • spl Success weighted by Path Length, success weighted by traversal efficiency.
    • dtw Dynamic Time Warping (DTW) cost, the divergence between guide and follower paths.
    • ndtw Normalized DTW cost.
    • sdtw Success-weighted normalized DTW cost.

Sample entry:

{'demonstration_id': 26,
 'instruction_id': 26,
 'annotator_id': 43,
 'path': ['d38a4c31821c48ac9082d896e628c128',
  '1d6a100cf3d34326936ef7d0a50840d9',
  'd8eb4eab2d3442e1a3a7a74fc810be22',
  '87998608d4844fcfaca266bd5aba6516',
  '5248918af65645a28a65f59d3424598a',
  'e0b2917ecb6d4e31846957451348f80a'],
 'metrics': {'ne': 0,
  'sr': 1.0,
  'spl': 0.9479355350139731,
  'dtw': 1.5461700564297585,
  'ndtw': 0.9020566069282614,
  'sdtw': 0.9020566069282614}}

Follower annotations can be downloaded from these links:

Extended Dataset

The extended RxR dataset is substantially larger and comprised of many small files. Therefore, we recommend installing the gsutil tool to copy the dataset in parallel. Individual download via URL is also an option.

Downloading Pose Traces

Guide and follower annotations are paired with their corresponding pose traces: a sequence of snapshots that capture the annotator's virtual camera pose and field-of-view. The naming convention for these files is {instruction_id:06}_guide_pose_trace.npz for guide annotations and {demonstration_id:06}_follower_pose_trace.npz for follower annotations.

Data schema:

{'pano': (np.str, [n, 1]),
 'time': (np.float32, [n, 1]),
 'audio_time': (np.float32, [n, 1]),
 'extrinsic_matrix': (np.float32, [n, 16]),
 'intrinsic_matrix': (np.float32, [n, 16]),
 'image_mask': (np.bool, [k, 128, 256]),
 'text_masks': (np.bool, [k, m]),
 'feature_weights': (np.float32, [k, 36])}

Where n is the number of snapshots, k is the number of panoramic viewpoints in the associated path, and m is the number of BERT SubWord in the tokenized instructions.

Field descriptions:

  • pano: Panoramic viewpoint of the snapshot.
  • time: Timestamp of the snapshot in seconds.
  • audio_time: Timestamp corresponding to the follower's progress listening to the guide's instruction recording. This is only included in follower pose traces.
  • extrinsic_matrix: The extrinsic parameters, or pose, of the annotator's camera.
  • intrinsic_matrix: The intrinsic parameters, or projection matrix, of the annotator's camera.
  • image_mask: Mask indicating the pixels observed in the panorama. This mask is in equirectangular format, with heading angle 0 being the center of the image.
  • text_masks: Mask indicating the utterances that have been spoken or heard by the guide or follower, respectively, at this panoramic viewpoint.
  • feature_weights: An image_mask in feature space, corresponding to the typical setting in which 36 image features are generated at 12 heading and 3 elevation increments.
View on GitHub
GitHub Stars177
CategoryDevelopment
Updated4d ago
Forks14

Languages

Python

Security Score

95/100

Audited on Mar 30, 2026

No findings