Skip to content

Latest commit

 

History

History
1341 lines (1005 loc) · 51 KB

DataCard.md

File metadata and controls

1341 lines (1005 loc) · 51 KB

LaNMP

LaNMP is a mobile manipulation robot dataset comprised of Natural Language, Navigation, Manipulation, and Perception (LaNMP) data. The dataset is collected in both simulated and real-world environments. The environments are multi-room, ensuring the tasks are long-horizon in nature. The tasks are pick-and-place described by humans to a robot in natural language. The trajectories, which are collected from robots via human teleoperation, contain LaNMP data at every timestep. There are 524 simulated and 50 real trajectories, totalling to 574 trajectories.

Dataset Link

https://www.dropbox.com/scl/fo/c1q9s420pzu1285t1wcud/AGMDPvgD5R1ilUFId0i94KE?rlkey=7lwmxnjagi7k9kgimd4v7fwaq&dl=0

Data Card Author(s)

  • Name, Team: Ahmed Jaafar (Owner)

Authorship

Publishers

Publishing Organization(s)

Brown University, Rutgers University, University of Pennsylvania

Industry Type(s)

  • Academic - Tech

Contact Detail(s)

Author(s)

  • Ahmed Jaafar, Brown University
  • Shreyas Sundara Raman, Brown University
  • Yichen Wei, Brown University
  • Sofia Juliani, Rutgers University
  • Anneke Wernerfelt, University of Pennsylvania
  • Ifrah Idrees, Brown University
  • Jason Xinyu Liu, Brown University
  • Stefanie Tellex, Associate Professor, Brown University

Funding Sources

Institution(s)

  • Office of Naval Research (ONR)
  • National Science Foundation (NSF)
  • Amazon Robotics

Funding or Grant Summary(ies)

This work is supported by ONR under grant award numbers N00014-22-1-2592 and N00014-23-1-2794, NSF under grant award number CNS-2150184, and with support from Amazon Robotics.

Dataset Overview

Data Subject(s)

  • Data about places and objects
  • Synthetically generated data
  • Data about systems or products and their behaviors
  • Others (Language data provided by humans, robot movement and visual data)

Dataset Snapshot

Category Data
Size of Dataset 288400 MB
Number of Instances 574
Human Labels 574
Capabilities 4
Avg. Trajectory Length 247
Number of environments 8
Number of rooms 30
Number of actions 12
Number of robots 2

Above: The numbers are combining both the simulated and real datasets. "Capabilities" refers to the high-level aspects/modalities this dataset covers: Natural language, Navigation, Manipulation, and Perception. "Human Labels" refers to the natural language commands of robot tasks provided by humans. "Number of actions" refers to the high-level discrete actions in only simulation.

Additional Notes: The robots used are mobile manipulators. The simulation robot is from ManipulaTHOR and the real robot is quadruped with an arm, a Boston Dynamics Spot.

Content Description

Every data point in simulation (trajectory time step) contains these important aspects: natural language command, egocentric RGB-D, instance segmentations, bounding boxes, robot body pose, robot end-effector pose, and grasped object poses.

Every data point in real (trajectory time step) contains on a high-level: natural language command, egocentric RGB-D, egocentric RGB-D, gripper RGB-D, gripper instance segmentations, robot body pose, robot arm pose, feet positions, joint angles, robot body velocity, robot arm velocity, gripper open percentage, object held boolean.

Descriptive Statistics

Statistic Simulation Trajectories Real Trajectories
count 524 50
mean 172 323
std 71 187
min 52 123
max 594 733

Above: The mean, std, min, and max of the trajectories refers to their lengths.

Sensitivity of Data

Sensitivity Type(s)

  • User Content
  • Anonymous Data
  • Others (Robot movement and visual data)

Risk Type(s)

  • No Known Risks

Dataset Version and Maintenance

Maintenance Status

Actively Maintained - No new versions will be made available, but this dataset will be actively maintained, including but not limited to updates to the data.

Version Details

Current Version: 1.0

Last Updated: 06/2024

Release Date: 06/2024

Maintenance Plan

Ahmed Jaafar will be maintaining this dataset and resolving dataset issues brought up by the community.

Example of Data Points

Primary Data Modality

  • Multimodel (Natural Language, Vision, Navigation, Manipulation)

Data Fields

Simulation Value Description
Natural Language "Go pick up the apple and put it on the couch." The command the human tells the robot for completing a certain task
Scene "FloorPlan_Train8_1" The simulation environment in AI2THOR
Sim time 0.19645 The simulation time
Wall clock time 14:49:37 The real-world time
Body state [4.0, 6.2, 7.5 , 226] The global state of the robot, [x, y, z, yaw]
End-effector state [2.59, 0.89, -4.17, -1.94, -1.27, 1.94] The global state of the robot's end-effector, [x, y, z, roll, pitch, yaw]
Hand sphere radius 0.059 The radius of the hand grasp field
Held objects [Apple] A list of objects currently held by the robot
Held object state [4.4, 2.3, 5.1] The global state of the currently held objects, [x, y, z]
Bounding boxes {"keys": [Apple], "values":[418, 42, 23, 321]} The objects detected with bounding boxes and the coordinates of those boxes
RGB ./rgb_0.npy The path to the RGB npy egocentric image of the time step
Depth ./depth_0.npy The path to the depth npy egocentric image of the time step
Instance segmentations ./inst_seg_0.npy The path to the instance segmentations npy egocentric image of the time step
Real-world Value Description
Natural Language "Go pick up the apple and put it on the couch." The command the human tells the robot for completing a certain task
Scene "FloorPlan_Train8_1" The simulation environment in AI2THOR
Wall clock time 14:49:37 The real-world time
Body state [4.0, 6.2, 7.5] The global euclidean state of the robot, [x, y, z]
Body state quaternion [0.04, 0, 0, 0.99] The global quaternion state of the robot body, [w, x, y, z]
Body orientation [0, 0.17, 3.05] The global rotation of the robot body, [roll, pitch, yaw]
Body linear velocity [0, 0.5, 0.1] The linear velocity of the robot body, [x, y, z]
Body angular velocity [0, 0.5, 0.1] The angular velocity of the robot body, [x, y, z]
Arm state [0.5, 0, 0.26] The robot arm state relative to the body, [x, y, z]
Arm quaternion state [0.99, 0, 0.7, 0.008] The quaternion robot arm state relative to the body, [w, x, y, z]
Arm state global [1.9, 0.5, 0] The global robot arm state, [x, y, z]
Arm quaternion state global [0.04, 0, 0, 0.99] The global quaternion robot arm state, [w, x, y, z]
Arm linear velocity [0.2, 0.04, 0] The linear velocity of the robot arm, [x, y, z]
Arm angular velocity [0.1, 0.4, 0.008] The angular velocity of the robot arm, [x, y, z]
Arm stowed 1 Boolean of if the arm is stowed or not
Gripper open 0.512 The percentage of how open the gripper is
Object held 1 Boolean of if an object is currently held by the gripper
Feet state [0.32, 0.17, 0], ... The state of the four quadruped feet relative to the body, [x, y, z]
Feet state global [-0.21, 0.05, 0], ... The global state of the four quadruped feet
Joint angles {fl.hx: -0.05, fl.hy: 0.79, fl.kn: -1.57, ...} The angles of all the quadruped's joints
Joint velocities {fl.hx: 0.004, fl.hy: 0.01, fl.kn: 0.57, ...} The velocities of all the quadruped's joints
Left RGB ./left_fisheye_image_0.npy The path of the left eye RGB egocentric image, which captures the right side of the view
Right RGB ./right_fisheye_image_0.npy The path of the right eye RGB egocentric image, which captures the left side of the view
Left Depth ./left_fisheye_depth_0.npy The path of the left eye depth egocentric image, which captures the right side of the view
Right Depth ./right_fisheye_depth_0.npy The path of the right eye depth egocentric image, which captures the left side of the view
Left instance segmentations ./left_fisheye_image_instance_seg_0.npy The path of the left eye instance segmentations egocentric image, which captures the right side of the view
Right instance segmentations ./right_fisheye_image_instance_seg_0.npy The path of the right eye instance segmentations egocentric image, which captures the left side of the view
Gripper RGB ./gripper_image_0.npy The path of the gripper RGB image
Gripper depth ./gripper_depth_0.npy The path of the gripper depth image
Gripper instance segmentations ./gripper_image_instance_seg_0.npy The path of the gripper instance segmentations image

Typical Data Point

Simulation:

{
    "nl_command": "Go to the table and pick up the salt and place it in the white bin in the living room.",
    "scene": "FloorPlan_Train8_1",
    "steps": [
        {
            "sim_time": 0.1852477639913559,
            "wall-clock_time": "15:10:47.900",
            "action": "Initialize",
            "state_body": [3.0, 0.9009992480278015, -4.5, 269.9995422363281],
            "state_ee": [2.5999975204467773, 0.8979992270469666, -4.171003341674805, -1.9440563492718068e-07, -1.2731799533306385, 1.9440386333307377e-07],
            "hand_sphere_radius": 0.05999999865889549
            "held_objs": [],
            "held_objs_state": {},
            "inst_det2D": {
                "keys": [
                    "Wall_4|0.98|1.298|-2.63",
                    "RemoteControl|+01.15|+00.48|-04.24",
                ],
                "values": [
                    [418, 43, 1139, 220], [315, 0, 417, 113], ...
                ]
            },
            "rgb": "./rgb_0.npy",
            "depth": "./depth_0.npy",
            "inst_seg": "./inst_seg_0.npy",
        }
    ]
}

Real-world:

{
  "language_command": "Go pick up Hershey's syrup in the room with the big window and bring it to the room with the other Spot.",
  "scene_name": "",
  "wall_clock_time": "12:50:10.923",
  "left_fisheye_rgb": "./Trajectories/trajectories/data_3/folder_0.zip/left_fisheye_image_0.npy",
  "left_fisheye_depth": "./Trajectories/trajectories/data_3/folder_0.zip/left_fisheye_depth_0.npy",
  "right_fisheye_rgb": "./Trajectories/trajectories/data_3/folder_0.zip/right_fisheye_image_0.npy",
  "right_fisheye_depth": "./Trajectories/trajectories/data_3/folder_0.zip/right_fisheye_depth_0.npy",
  "gripper_rgb": "./Trajectories/trajectories/data_3/folder_0.zip/gripper_image_0.npy",
  "gripper_depth": "./Trajectories/trajectories/data_3/folder_0.zip/gripper_depth_0.npy",
  "left_fisheye_instance_seg": "./Trajectories/trajectories/data_3/folder_0.zip/left_fisheye_image_instance_seg_0.npy",
  "right_fisheye_instance_seg": "./Trajectories/trajectories/data_3/folder_0.zip/right_fisheye_image_instance_seg_0.npy",
  "gripper_fisheye_instance_seg": "./Trajectories/trajectories/data_3/folder_0.zip/gripper_image_instance_seg_0.npy",
  "body_state": {"x": 1.7732375781707208, "y": -0.2649551302417769, "z": 0.04729541059536978},
  "body_quaternion": {"w": 0.11121513326494507, "x": 0.00003060940357089109, "y": 0.0006936040684443222, "z": 0.9937961119411372},
  "body_orientation": {"r": 0.0017760928400286857, "p": 0.016947586302323542, "y": 2.919693676695565},
  "body_linear_velocity": {"x": 0.0007985030885781894, "y": 0.0007107887103978708, "z": -0.00001997174236456424},
  "body_angular_velocity": {"x": -0.002894917543479851, "y": -0.0017834609980581554, "z": 0.00032649917985633773},
  "arm_state_rel_body": {"x": 0.5536401271820068, "y": 0.0001991107128560543, "z": 0.2607555091381073},
  "arm_quaternion_rel_body": {"w": 0.9999642968177795, "x": 0.00019104218517895788, "y": 0.008427758701145649, "z": 0.008427758701145649},
  "arm_orientation_rel_body": {"x": 0.0003903917486135314, "y": 0.016855526363847233, "z":0.0009807885066525242},
  "arm_state_global": {"x": 1.233305266138133, "y": 0.0001991107128560543, "z": 0.2607555091381073},
  "arm_quaternion_global": {"w": 0.11071797661404018, "x": -0.0083232786094425, "y": 0.0018207155823512953, "z": 0.9938152930378756},
  "arm_orientation_global": {"x": 0.0017760928400286857, "y": 0.016947586302323542, "z": 2.919693676695565},
  "arm_linear_velocity": {"x": -0.00015927483240388228, "y": 0.00006229256340773636, "z": -0.003934306244239418},
  "arm_angular_velocity": {"x": 0.02912604479413378, "y": -0.012041083915871545, "z": 0.009199674753842119},
  "arm_stowed": 1,
  "gripper_open_percentage": 0.521618127822876,
  "object_held": 0,
  "feet_state_rel_body": [
    {"x": 0.32068437337875366, "y": 0.17303785681724548, "z": -0.5148577690124512},
    {"x": 0.32222312688827515, "y": -0.17367061972618103, "z": -0.5163648128509521},
    ...
  ],
  "feet_state_global": [
    {"x": -0.35111223090819643, "y": -0.0985760241189894, "z": -0.5146475087953596},
    {"x": -0.27597323368156573, "y": 0.239893453842677, "z": -0.5166350285289446},
    ...
  ],
  "all_joint_angles": {"fl.hx": 0.013755097053945065, "fl.hy": 0.7961212992668152, "fl.kn": -1.5724135637283325, ...},
  "all_joint_velocities": {"fl.hx": -0.007001522462815046, "fl.hy": 0.0006701984675601125, "fl.kn": 0.00015050712681841105, ...}
}

Motivations & Intentions

Motivations

Purpose(s)

  • Research

Domain(s) of Application

Robotics, Imitation Learning, Behavior Cloning, Reinfocement Learning, Machine Learning

Motivating Factor(s)

There have been recent advances in robotic mobile manipulation, however the field as a whole is still lagging behind. We feel one reason behind this is a lack of useful and difficult benchmarks for mobile manipulation models. In particular, there were no benchmarks that have data for long-horizon room-to-room pick-and-place tasks comprised of natural langauge, navigation, manipulation, and perception in both simulation and the real-world, including a quadruped.

Intended Use

Dataset Use(s)

  • Safe for research use

Suitable Use Case(s)

Suitable Use Case: Training and testing behavior cloning models.

Suitable Use Case: Learning reward functions via inverse reinforcement learning.

Suitable Use Case: Robot skill learning.

Suitable Use Case: Providing in-context examples for robot planning.

Research and Problem Space(s)

This dataset intendes to serve as a benchmark addressing the gap of the integration of natural language, navigation, manipulation, and perception for pick-and-place mobile manipulation tasks that span room-to-room and floor-to-floor in both simulated and real environments. Mobile manipulation is lagging behind overall, and we believe one of the reasons behind that is a lack of difficult comprehensive benchmarks that models in developement can be tested against. LaNMP is here to fill this gap.

Citation Guidelines

Guidelines & Steps: As simple as referncing the BiBTeX below.

BiBTeX:

Coming soon!

Access

Access

Access Type

  • External - Open Access

Documentation Link(s)

Provenance

Collection

Method(s) Used

  • Crowdsourced - Paid
  • Crowdsourced - Volunteer
  • Survey, forms, or polls
  • Others (Keyboard teleoperated, tablet-controller teleoperated)

Methodology Detail(s)

Collection Type

Source: Prolific.

Platform: Prolific, A crowdsourcing platform for researchers to collect data.

Is this source considered sensitive or high-risk? No

Dates of Collection: [03 2024 - 04 2024]

Primary modality of collection data:

  • Text Data

Update Frequency for collected data:

  • Static

Additional Notes: Used to collect the natural language commands. Crowdsourced humans explore the simulated environements and come up with commands for tasks the robot can do in those environements.

Collection Type

Source: Human teleoperation

Platform: AI2THOR simulator

Is this source considered sensitive or high-risk? No

Dates of Collection: [03 2024 - 04 2024]

Primary modality of collection data:

  • Multimodal (Navigation, Manipulation, Vision)

Update Frequency for collected data:

  • Static

Additional Notes: Humans teleoperate a simulated robot via keyboard to collect the robot trajectory data.

Collection Type

Source: Human speech

Platform: N/A

Is this source considered sensitive or high-risk? No

Dates of Collection: [05 2024]

Primary modality of collection data:

  • Text Data

Update Frequency for collected data:

  • Static

Additional Notes: Used to collect the natural language commands. Humans explore the real-world environements and come up with commands for tasks the robot can do in those environements.

Collection Type

Source: Human teleoperation

Platform: Boston Dynamics Spot

Is this source considered sensitive or high-risk? No

Dates of Collection: [05 2024]

Primary modality of collection data:

  • Multimodal (Navigation, Manipulation, Vision)

Update Frequency for collected data:

  • Static

Additional Notes: Human teleoperates a real quadruped robot via a tablet/joystick controller to collect the robot trajectory data.

Collection Cadence

Static: Data was collected once from single or multiple sources.

Data Processing

Collection Method or Source

Description: Natural language commands

Methods employed: Utilized other humans to manually correct grammatical mistakes in the given textual natural language commands. The humans deleted the commands that were not possible for the robot to execute or did not match the desired research goal.

Tools or libraries: N/A

Collection Method or Source

Description: Robot trajectories

Methods employed: Utilized other humans to manually delete incomplete collected trajectories.

Tools or libraries: N/A

Collection Criteria

Data Selection

  • Natural language commands: The criteria for selction included commands that mention a pick-and-place task where the robot picks up an object and places it somewhere else, and having the robot go from room-to-room.
  • Trajectories: The criteria for selction included trajectories that execute the commands in the most efficient manner, ones that minimize robot lag, and ones that don't collide objects in the environment.

Relationship to Source

Benefit and Value(s)

  • Combines natural language, navigation, manipulation, and perception robot data
  • Mobile manipulation pick-and-place tasks that are room-to-room and some are cross-floor making them long-horizon
  • Utilizing a quadruped which can handle terrains that other robots can't, such as stairs, enabling cross-floor tasks
  • Diverse environements and objects

Limitation(s) and Trade-Off(s)

  • Only pick-and-place tasks
  • No ground-truth goal position of the target object
  • Size

Human and Other Sensitive Attributes

Sensitive Human Attribute(s)

  • Language

Intentionality

Intentionally Collected Attributes

Human attributes were labeled or collected as a part of the dataset creation process.

Field Name Description
nl_command Natural language commands given by humans telling the robot what task to do in the simulator
language_command Natural language commands given by humans telling the robot what task to do in the real-world

Unintentionally Collected Attributes

Human attributes were not explicitly collected as a part of the dataset creation process but can be inferred using additional methods.

N/A

Rationale

We wanted to capture a natural distribution of commands that humans would tell a househould robots to complete long-horizon mobile manipulation tasks. Rather than automatically generating the commands using tools such as LLMs, we wanted to capture what humans really want done in households by assitant robots, so we used humans to provide the commands. Since the ultimate goal is to one day have assistive robots in the home and workplace, capturing the commands that humans would eventually tell them now is crucial for research and development to get to that goal.

Source(s)

  • Human Attribute: Prolific.com
  • Human Attribute: In-person humans

Extended Use

Use with Other Data

Safety Level

  • Safe to use with other data

Best Practices

  • Make sure the datasets are both in the same format
  • Do not mix at the time step level, only at the trajectory level, e.g. Other dataset trajectory Y can come after LaNMP trajectory X, but X and Y's time steps should not be mixed

Forking & Sampling

Safety Level

  • Safe to form and/or sample

Acceptable Sampling Method(s)

  • Cluster Sampling
  • Haphazard Sampling
  • Multi-stage sampling
  • Random Sampling
  • Stratified Sampling
  • Systematic Sampling
  • Weighted Sampling

Best Practice(s)

Do not sample at the time step level, only at the trajectory level, e.g. sample trajectories 4-15 but not the timesteps of those trajectories.

Use in ML or AI Systems

Dataset Use(s)

  • Training
  • Testing
  • Validation
  • Fine Tuning

Notable Feature(s)

Exploration Demo: Google Colab notebook

Distribution(s)

Set Number of data points
Train 446
Test 78

Above: We don't hyperparameter tune so we only use train and test splits. 85% and 15% respectively. This is only for the simulation data.

Additional Notes: This split was only used during the task generalization experiment. More details in the paper.

Split Statistics

Statistic Train Test
Count 446 78

Above: We don't hyperparameter tune so we only use train and test splits. 85% and 15% respectively. This is only for the simulation data.

Transformations

Synopsis

Transformation(s) Applied

  • Other (Fixing natural langauge command grammatical mistakes)

Field(s) Transformed

Transformation Type

Field Name Description
nl_command Natural language commands given by humans telling the robot what task to do in the simulator
language_command Natural language commands given by humans telling the robot what task to do in the real-world

Additional Notes: Fixing grammatical mistakes of the commands or deleting trajectories where the commands are incomplete.

Library(ies) and Method(s) Used

Transformation Type

Method: Manually fixing grammatically incorrect natural language commands and injecting them into their respective trajectories to replace the already saved wrong commands. Also deleting trajectories that have incomplete commands e.g. "Pick up the blue"

Transformation Results: Trajectories with the fixed commands, and less trajectories overall due to the deletion of the ones that had incompelete commands.

Annotations & Labeling

Annotation Workforce Type

  • Human Annotations (Expert)
  • Human Annotations (Non-Expert)
  • Human Annotations (Employees)
  • Human Annotations (Crowdsourcing)

Annotation Characteristic(s)

Expert Number
Number of unique annotations 50
Total number of annotations 50
Average annotations per example 1
Number of annotators 1
Number of annotators per example 1

Above: The real-world robot trajectory execution (teleoperation) data collection done by one of the authors.

Non-Expert Number
Number of unique annotations 50
Total number of annotations 50
Average annotations per example 1
Number of annotators 7
Number of annotators per example 1

Above: Humans that gave natural language commands of tasks for the real-world robot to execute.

Employees Number
Number of unique annotations 524
Total number of annotations 524
Average annotations per example 1
Number of annotators 15
Number of annotators per example 1

Above: Humans that exected the trajectories in the simulator.

Crowdsourcing Number
Number of unique annotations 524
Total number of annotations 524
Average annotations per example 1
Number of annotators 41
Number of annotators per example 1

Above: Humans that gave natural language commands of tasks for the simulated robot to execute.

Annotation Description(s)

Expert

Description: The real-world robot trajectory execution (teleoperation) data collection done by one of the authors.

Link: N/A

Platforms, tools, or libraries:

  • Boston Dynamics Spot

Non-Expert

Description: Humans that gave natural language commands of tasks for the real-world robot to execute.

Link: N/A

Platforms, tools, or libraries:

  • N/A

Employees

Description: Humans that exected the trajectories in the simulator.

Link: https://ai2thor.allenai.org/

Platforms, tools, or libraries:

  • AI2THOR

Crowdsourcing

Description: Humans that gave natural language commands of tasks for the simulated robot to execute.

Link: https://www.prolific.com/

Platforms, tools, or libraries:

  • Prolific

Human Annotators

Annotator Description(s)

Expert Real-Robot Trajectory Collection

Task type: The real-world robot trajectory execution (teleoperation) data collection done by one of the authors

Number of unique annotators: 1

Expertise of annotators: Expert

Description of annotators: An author.

Language distribution of annotators: English

Geographic distribution of annotators: United States

Annotation platforms: Boston Dyanmics Spot

Non-Expert Real-Robot Command Collection

Task type: Humans that gave natural language commands of tasks for the real-world robot to execute

Number of unique annotators: 7

Expertise of annotators: Non-Expert

Description of annotators: Students

Language distribution of annotators: English

Geographic distribution of annotators: United States

Annotation platforms: N/A

Employed Simulator Command Collection

Task type: Humans that exected the trajectories in the simulator

Number of unique annotators: 7

Expertise of annotators: Non-Expert

Description of annotators: General adults

Language distribution of annotators: English

Geographic distribution of annotators: United States and United Kingdom

Annotation platforms: Prolific.com

Language(s)

  • English [100%]

Above: All the natural language commands.

Sampling Methods

Method(s) Used

  • Unsampled

Known Applications & Benchmarks

ML Application(s)

Classification, Regression, Supervised Learning, Imitation Learning

Evaluation Result(s)

RT-1

Model Card: In page 21 of the paper.

ALFRED Seq2Seq

Model Card: No card available. Please refer to the GitHub repo instead.

Evaluation Results

Model SR Length Grasp SR RMSE v.s. GT Weighted $\Delta_\text{xyz}$ CLIP EMA Score End Goal Dist CE Loss
Cross-Scene
--- ALFRED Seq2Seq 0.0 655.09 ± 450.52 0.0 3.11 ± 0.63 0.0026 ± 0.0035 0.1614 ± 0.0120 12.42 ± 5.44 286.77 ± 20.31
--- RT-1 0.0 205.03 ± 27.36 0.0 9.50 ± 0.27 1.3423 ± 0.1133 0.1521 ± 0.0065 12.56 ± 6.67 80.98 ± 4.68
Task Generalization
--- ALFRED Seq2Seq 0.0 501.60 ± 578.62 0.0 3.01 ± 1.18 0.0008 ± 0.0014 0.1681 ± 0.0327 12.83 ± 11.12 286.66 ± 398.80
--- RT-1 0.0 199.56 ± 106.11 0.0 9.74 ± 1.67 1.3980 ± 0.5834 0.1488 ± 0.0243 12.40 ± 12.20 82.61 ± 1.81
Ground Truth 1.0 171.69 ± 70.80 1.0 --- 0.5576 ± 0.1751 0.2067 ± 0.0311 --- ---

Additional Notes: These results are from the simulation data only.

Evaluation Process(es)

[Metrics used]:

  • Task Success (GTR): a binary value measuring whether an agent achieves the goal/completes the task specified in the command.
  • Distance From Goal (GTR): the spatial distance between the agent's final position after executing a learned trajectory and the designated gold goal state.
    d = 1/2 (sqrt{x_{gt_body,n}^2 - x_{eval_body,n}^2} + sqrt{x_{gt_ee,n}^2 - x_{eval_ee,n}^2})
    
  • Grasp Success Rate (GTR): the efficacy of the agent's attempts to grasp objects in the scene. Specifically, the percentage of attempts that result in successful object acquisition.
  • Average RMSE (GTR): the average root-mean-square error of the agent's body and end-effector coordinates between the generated trajectory and the ground truth. It reports a weighted average between body and end-effector errors normalized across the maximum length of both trajectories.
    RMSE = sum_{i=0}^n 1/2 (sqrt{x_{gt_body,i}^2 - x_{eval_body,i}^2} + sqrt{x_{gt_ee,i}^2 - x_{eval_ee,i}^2})
    
  • Average Number of Steps (GTR): the total number of actions an agent takes. It serves to evaluate a model's ability to replicate efficient human navigation.
  • Mean and Standard Deviation in State Differences (GTI): the standard deviation in positional differences between successive timesteps in a trajectory. It assesses the control smoothness exhibited by the agent to compare learned trajectories against the fluidity and naturalness of the ground-truth trajectories.
    Delta = sum_{i=1}^n 1/2 (sqrt{x_{eval_body,i}^2 - x_{eval_body,(i-1)}^2} + sqrt{x_{eval_ee,i}^2 - x_{eval_ee,(i-1)}^2})
    
  • CLIP Embedding Reward (GTI): the exponential moving average of CLIP text-image correlation scores for all steps of a trajectory. Natural language task specification can be ambiguous and difficult to formulate into a structured goal condition. Inspired by previous works using CLIP for RL rewards, we propose this metric to capture complex semantic correlations between the trajectory and task specification. That is understanding, reasoning, the grounding of a task using the CLIP embedding space. This provides a measure of the agent's task comprehension and execution fidelity.
    EMA_i = alpha EMA_{i-1} + (1-alpha)r_i
    
    where
    r_i := CLIP(task,img_i)
    

Additional Notes: For robust evaluation, we consider two categories of metrics for cross-scene and task generalization experiments: ``ground truth relative" (GTR) metrics that compare against trajectories in LaNMP as standards and "ground truth independent" (GTI) metrics that evaluate a trajectory (ground-truth or generated) on task understanding or smoothness.

Description(s) and Statistic(s)

RT-1

Model Card: In page 21 of the paper.

Model Description: Robotics Transformer 1 (RT-1) is a model designed for generalizing across large-scale, multi-task datasets with real-time inference capabilities. RT-1 leverages a Transformer architecture to process images and natural language instructions to generate discretized actions for mobile manipulation. RT-1 is trained on a diverse dataset of approximately 130K episodes across more than 700 tasks collected using 13 robots. This enables RT-1 to learn through BC from human demonstrations annotated with detailed instructions.

  • Model Size: 35M (params)

ALFRED Seq2Seq

Model Card: No card available. Please refer to the GitHub repo instead.

Model Description: The ALFRED paper introduces a Sequence-to-Sequence model leveraging a CNN-LSTM architecture with an attention mechanism for task execution. It encodes visual inputs via ResNet-18 and processes language through a bidirectional LSTM. A decoder leverages these multimodal inputs along with historical action data to iteratively predict subsequent actions and generate pixelwise interaction masks, enhancing precise object manipulation capabilities within the given environment.

  • Model Size: 35M (params)

Expected Performance and Known Caveats

Expected Performance: We expected RT-1 to perform better than ALFRED Seq2Seq due to it being more recent and more advanced. We expected both models to perform poorly, especially on the Task Success metric.

Known Caveats: The model architectures had to be modified to make them work for LaNMP. RT-1 had to be pretrained by us instead of using the provided pretrained checkpoint. There were some simulator issues during real-time evaluation.