Alfred

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Generate Convert Improve

Install / Use

/learn @askforalfred/Alfred

About this skill

Quality Score

0/100

README

ALFRED

A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk,
Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, Dieter Fox
CVPR 2020

ALFRED (Action Learning From Realistic Environments and Directives), is a new benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. Long composition rollouts with non-reversible state changes are among the phenomena we include to shrink the gap between research benchmarks and real-world applications.

For the latest updates, see: askforalfred.com

What more? Checkout ALFWorld – interactive TextWorld environments for ALFRED scenes!

Quickstart

Clone repo:

$ git clone https://github.com/askforalfred/alfred.git alfred
$ export ALFRED_ROOT=$(pwd)/alfred

Install requirements:

$ virtualenv -p $(which python3) --system-site-packages alfred_env # or whichever package manager you prefer
$ source alfred_env/bin/activate

$ cd $ALFRED_ROOT
$ pip install --upgrade pip
$ pip install -r requirements.txt

Download Trajectory JSONs and Resnet feats (~17GB):

$ cd $ALFRED_ROOT/data
$ sh download_data.sh json_feat

Train models:

$ cd $ALFRED_ROOT
$ python models/train/train_seq2seq.py --data data/json_feat_2.1.0 --model seq2seq_im_mask --dout exp/model:{model},name:pm_and_subgoals_01 --splits data/splits/oct21.json --gpu --batch 8 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1

More Info

Dataset: Downloading full dataset, Folder structure, JSON structure.
Models: Training and Evaluation, File structure, Pre-trained models.
Data Generation: Generation, Replay Checks, Data Augmentation (high-res, depth, segementation masks etc).
Errata: Updated numbers for Goto subgoal evaluation.
THOR 2.1.0 Docs: Deprecated documentation from Ai2-THOR 2.1.0 release.
FAQ: Frequently Asked Questions.

SOTA Models

Open-source models that outperform the Seq2Seq baselines from ALFRED:

LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, Yu Su <a href="https://arxiv.org/pdf/2212.04088"> Paper</a>, <a href="https://github.com/OSU-NLP-Group/LLM-Planner"> Code </a>

Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents Byeonghwi Kim, Jinyeon Kim, Yuyeong Kim, Cheolhong Min, Jonghyun Choi <a href="https://arxiv.org/pdf/2308.07241.pdf"> Paper</a>, <a href="https://github.com/snumprlab/capeam"> Code </a>

Multi-Level Compositional Reasoning for Interactive Instruction Following Suvaansh Bhambri*, Byeonghwi Kim*, Jonghyun Choi <a href="https://arxiv.org/pdf/2308.09387.pdf"> Paper</a>, <a href="https://github.com/yonseivnl/mcr-agent"> Code </a>

Agent with the Big Picture: Perceiving Surroundings for Interactive Instruction Following Byeonghwi Kim, Suvaansh Bhambri, Kunal Pratap Singh, Roozbeh Mottaghi, Jonghyun Choi <a href="https://embodied-ai.org/papers/Agent-with-the-Big-Picture.pdf"> Paper</a>, <a href="https://github.com/snumprlab/abp"> Code </a>

FILM: Following Instructions in Language with Modular Methods So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, Ruslan Salakhutdinov <a href="https://arxiv.org/pdf/2110.07342.pdf"> Paper</a>, <a href="https://github.com/soyeonm/FILM"> Code </a>

A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution Valts Blukis, Chris Paxton, Dieter Fox, Animesh Garg, Yoav Artzi <a href="https://arxiv.org/pdf/2107.05612.pdf"> Paper</a>, <a href="https://github.com/valtsblukis/hlsm"> Code </a>

Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring Yichi Zhang, Joyce Chai <a href="https://aclanthology.org/2021.findings-acl.368/"> Paper</a>, <a href="https://github.com/594zyc/HiTUT"> Code </a>

Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun <a href="https://arxiv.org/pdf/2105.06453.pdf"> Paper</a>, <a href="https://github.com/alexpashevich/E.T."> Code </a>

MOCA: A Modular Object-Centric Approach for Interactive Instruction Following Kunal Pratap Singh*, Suvaansh Bhambri*, Byeonghwi Kim*, Roozbeh Mottaghi, Jonghyun Choi <a href="https://arxiv.org/abs/2012.03208"> Paper</a>, <a href="https://github.com/gistvision/moca"> Code </a>

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, Gaurav Sukhatme
<a href="https://arxiv.org/abs/2108.04927">Paper</a>, <a href="https://github.com/amazon-research/embert"> Code </a>

Contact Mohit to add your model here.

Prerequisites

Python 3
PyTorch 1.1.0
Torchvision 0.3.0
AI2THOR 2.1.0

See requirements.txt for all prerequisites

Hardware

Tested on:

GPU - GTX 1080 Ti (12GB)
CPU - Intel Xeon (Quad Core)
RAM - 16GB
OS - Ubuntu 16.04

Leaderboard

⚠️ Update (Apr 2025): As of April 2025, the Ai2 leaderboard has been deprecated. Please see the instructions below on email submissions.

Run your model on test seen and unseen sets, and create an action-sequence dump of your agent:

$ cd $ALFRED_ROOT
$ python models/eval/leaderboard.py --model_path <model_path>/model.pth --model models.model.seq2seq_im_mask --data data/json_feat_2.1.0 --gpu --num_threads 5

This will create a JSON file, e.g. task_results_20191218_081448_662435.json, inside the <model_path> folder. Email this file to askforalfred@googlegroups.com, preferrably through a storage link on a platform like Google Drive, Dropbox etc.

The results will be available at askforalfred.com/leaderboard/leaderboard.html.

Rules:

You are only allowed to use RGB and language instructions (goal & step-by-step) as input for your agents. You cannot use additional depth, mask, metadata info etc. from the simulator on Test Seen and Test Unseen scenes. However, during training you are allowed to use additional info for auxiliary losses etc.
During evaluation, agents are restricted to max_steps=1000 and max_fails=10. Do not change these settings in the leaderboard script; these modifications will not be reflected in the evaluation server.
:exclamation:Do not spam the leaderboard with repeated submissions (under different email accounts) in order to optimize on the test set. Fine-tuning should be done only on the validation set, NOT on the leaderboard test set.
Pick a legible model name for the submission. Just "baseline" is not very descriptive.
All submissions must be attempts to solve the ALFRED dataset.
Answer the following questions in the description: a. Did you use additional sensory information from THOR as input, eg: depth, segmentation masks, class masks, panoramic images etc. during test-time? If so, please report them. b. Did you use the alignments between step-by-step instructions and expert action-sequences for training or testing? (no by default; the instructions are serialized into a single sentence)
Share who you are: provide a team name and affiliation.
(Optional) Share how you solved it: if possible, share information about how the task was solved. Link an academic paper or code repository if public.
Only submit your own work: you may evaluate any model on the validation set, but must only submit your own work for evaluation against the test set.

Submissions:

Only one submission is allowed every 7 days. All submissions will be made public. Please do not create anonymous emails for multiple submissions. Use the val set to iterate on your agent.

Docker Setup

Install Docker and NVIDIA Docker.

Modify docker_build.py and docker_run.py to your needs.

Build

Build the image:

$ python scripts/docker_build.py

Run (Local)

For local machines:

$ python scripts/docker_run.py
 
  source ~/alfred_env/bin/activate
  cd $ALFRED_ROOT

Run (Headless)

For headless VMs and Cloud-Instances:

$ python scripts/docker_run.py --headless 

  # inside docker
  tmux new -s startx  # start a new tmux session

  # start nvidia-xconfig
  sudo nvidia-xconfig -a --use-display-device=None --virtual=1280x1024

  # start X server on DISPLAY 0
  # single X server should be sufficient for multiple instances of THOR
  sudo python ~/alfred/scripts/startx.py 0  # if this throws errors e.g "(EE) Server terminated with error (1)" or "(EE) already running ..." try a display > 0

  # detach from tmux shell
  # Ctrl+b then d

  # source env
  source ~/alfred_env/bin/activate
  
  # set DISPLAY varia

Related Skills

node-connect

346.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

346.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

346.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。