RxR
Room-across-Room (RxR) is a large-scale, multilingual dataset for Vision-and-Language Navigation (VLN) in Matterport3D environments. It contains 126k navigation instructions in English, Hindi and Telugu, and 126k navigation following demonstrations. Both annotation types include dense spatiotemporal alignments between the text and the visual perceptions of the annotators
Install / Use
/learn @google-research-datasets/RxRREADME
Room-Across-Room (RxR) Dataset
Room-Across-Room (RxR) is a multilingual dataset for Vision-and-Language Navigation (VLN) for Matterport3D environments. In contrast to related datasets such as Room-to-Room (R2R), RxR is 10x larger, multilingual (English, Hindi and Telugu), with longer and more variable paths, and it includes and fine-grained visual groundings that relate each word to pixels/surfaces in the environment.

RxR is released as gzipped JSON Lines and numpy archives, and has four components: guide annotations, follower annotations, pose traces, and text features. The guide annotations alone are akin to R2R and sufficient to run the standard VLN setup.
Reference
The RxR dataset is described in Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding.
Bibtex:
@inproceedings{rxr,
title={{Room-Across-Room}: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding},
author={Alexander Ku and Peter Anderson and Roma Patel and Eugene Ie and Jason Baldridge},
booktitle={Conference on Empirical Methods for Natural Language Processing (EMNLP)},
year={2020}
}
Dataset Download
To download the full 161GB dataset (guide annotations, follower annotations, pose traces, text features, grounded landmarks, and model-generated instructions), install the gsutil tool and run:
gsutil -m cp -R gs://rxr-data .
Using -m gsutil downloads using multi-threading and multi-processing, using a
number of threads and processors determined by the parallel_thread_count and
parallel_process_count values set in the boto configuration file. You might
want to experiment with these values for the fastest download.
Alternatively, see the instructions below for downloading separate components of the dataset.
Downloading Guide Annotations
Each JSON Lines entry contains a guide annotation for a path in the environment.
Data schema:
{'split': str,
'instruction_id': int,
'annotator_id': int,
'language': str,
'path_id': int,
'scan': str,
'path': Sequence[str],
'heading': float,
'instruction': str,
'timed_instruction': Sequence[Mapping[str, Union[str, float]]],
'edit_distance': float}
Field descriptions:
split: The annotation split:train,val_seen,val_unseen,test_standard.instruction_id: Uniquely identifies the guide annotation.annotator_id: Uniquely identifies the guide annotator.language: The IETF BCP 47 language tag:en-IN,en-US,hi-IN,te-IN.path_id: Uniquely identifies a path sampled from the Matterport3D environment.scan: Uniquely identifies a scan in the Matterport3D environment.path: A sequence of panoramic viewpoints along the path.heading: The initial heading in radians. Following R2R, the heading angle is zero facing the y-axis with z-up, and increases by turning right.instruction: The navigation instruction.timed_instruction: A sequence of time-aligned words in the instruction. Note that a small number of words are missing thestart_timeandend_timefields.word: The aligned utterance.start_time: The start of the time span, w.r.t. the recording.end_time: The end of the time span, w.r.t. the recording.
edit_distanceEdit distance between the manually transcribed instructions and the automatic transcript generated by Google Cloud Text-to-Speech API.
Sample entry:
{'path_id': 11,
'split': 'val_seen',
'scan': '2n8kARJN3HM',
'heading': 3.105381634905035,
'path': ['d38a4c31821c48ac9082d896e628c128',
'1d6a100cf3d34326936ef7d0a50840d9',
'87998608d4844fcfaca266bd5aba6516',
'5248918af65645a28a65f59d3424598a',
'e0b2917ecb6d4e31846957451348f80a'],
'instruction_id': 26,
'annotator_id': 19,
'language': 'en-IN',
'instruction': 'Okay, now you are in a room facing towards two bathtubs, one on the right side ...',
'timed_instruction': [{'start_time': 0.4, 'word': 'Okay,', 'end_time': 1.0}, ...],
'edit_distance': 0.11137440758293839}
Guide annotations can be downloaded from these links:
rxr_train_guide(72.1M)rxr_val_seen_guide(12.9M)rxr_val_unseen_guide(12M)rxr_test_standard(1.9M) - used for the RxR competitionrxr_test_challenge(1.9M) - used for the RxR-Habitat competition
Downloading Follower Annotations
Each JSON Lines entry contains a follower annotation for an instruction in the guide annotations.
Data schema:
{'demonstration_id': int,
'instruction_id': int,
'annotator_id': int,
'path': Sequence[str],
'metrics': {'ne': float,
'sr': float,
'spl': float,
'dtw': float,
'ndtw': float,
'sdtw': float}}
Field descriptions:
demonstration_id: Uniquely identifies the follower annotation.instruction_id: Uniquely identifies the guide annotation being followed.annotator_id: Uniquely identifies the follower annotator.path: A sequence of panoramic viewpoint along the path traversed by the follower (which may differ to the guide path).metrics: Evaluation metrics for the follower path measured against the guide path:ne: Navigation Error, the shortest-path distance between the final viewpoint in each path.sr: Success Rate, whetherneis less than 3 meters.splSuccess weighted by Path Length, success weighted by traversal efficiency.dtwDynamic Time Warping (DTW) cost, the divergence between guide and follower paths.ndtwNormalized DTW cost.sdtwSuccess-weighted normalized DTW cost.
Sample entry:
{'demonstration_id': 26,
'instruction_id': 26,
'annotator_id': 43,
'path': ['d38a4c31821c48ac9082d896e628c128',
'1d6a100cf3d34326936ef7d0a50840d9',
'd8eb4eab2d3442e1a3a7a74fc810be22',
'87998608d4844fcfaca266bd5aba6516',
'5248918af65645a28a65f59d3424598a',
'e0b2917ecb6d4e31846957451348f80a'],
'metrics': {'ne': 0,
'sr': 1.0,
'spl': 0.9479355350139731,
'dtw': 1.5461700564297585,
'ndtw': 0.9020566069282614,
'sdtw': 0.9020566069282614}}
Follower annotations can be downloaded from these links:
Extended Dataset
The extended RxR dataset is substantially larger and comprised of many small files. Therefore, we recommend installing the gsutil tool to copy the dataset in parallel. Individual download via URL is also an option.
Downloading Pose Traces
Guide and follower annotations are paired with their corresponding pose traces:
a sequence of snapshots that capture the annotator's virtual camera pose and
field-of-view. The naming convention for these files is
{instruction_id:06}_guide_pose_trace.npz for guide annotations and
{demonstration_id:06}_follower_pose_trace.npz for follower annotations.
Data schema:
{'pano': (np.str, [n, 1]),
'time': (np.float32, [n, 1]),
'audio_time': (np.float32, [n, 1]),
'extrinsic_matrix': (np.float32, [n, 16]),
'intrinsic_matrix': (np.float32, [n, 16]),
'image_mask': (np.bool, [k, 128, 256]),
'text_masks': (np.bool, [k, m]),
'feature_weights': (np.float32, [k, 36])}
Where n is the number of snapshots, k is the number of panoramic viewpoints
in the associated path, and m is the number of BERT SubWord in the tokenized
instructions.
Field descriptions:
pano: Panoramic viewpoint of the snapshot.time: Timestamp of the snapshot in seconds.audio_time: Timestamp corresponding to the follower's progress listening to the guide's instruction recording. This is only included in follower pose traces.extrinsic_matrix: The extrinsic parameters, or pose, of the annotator's camera.intrinsic_matrix: The intrinsic parameters, or projection matrix, of the annotator's camera.image_mask: Mask indicating the pixels observed in the panorama. This mask is in equirectangular format, with heading angle 0 being the center of the image.text_masks: Mask indicating the utterances that have been spoken or heard by the guide or follower, respectively, at this panoramic viewpoint.feature_weights: Animage_maskin feature space, corresponding to the typical setting in which 36 image features are generated at 12 heading and 3 elevation increments.
