Robotic Skill Acquistion via Instruction Augmentation
with Vision-Language Models


In recent years, much progress has been made in learning robotic manipulation policies that follow natural language instructions. Such methods typically learn from corpora of robot-language data that was either collected with specific tasks in mind or expensively re-labelled by humans with rich language descriptions in hindsight. Recently, large-scale pretrained vision-language models (VLMs) like CLIP or ViLD have been applied to robotics for learning representations and scene descriptors. Can these pretrained models serve as automatic labelers for robot data, effectively importing Internet-scale knowledge into existing datasets to make them useful even for tasks that are not reflected in their ground truth annotations? For example, if the original annotations contained simple task descriptions such as "pick up the apple", a pretrained VLM-based labeller could significantly expand the number of semantic concepts available in the data and introduce spatial concepts such as "the apple on the right side of the table" or alternative phrasings such as "the red colored fruit". To accomplish this, we introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL): we utilize semi-supervised language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data and then train language-conditioned policies on the augmented datasets. This method enables cheaper acquisition of useful language descriptions compared to expensive human labels, allowing for more efficient label coverage of large-scale datasets. We apply DIAL to a challenging real-world robotic manipulation domain where 96.5% of the 80,000 demonstrations do not contain crowd-sourced language annotations. DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.


DIAL consists of three stages: (1) finetuning a VLM’s vision and language representation on a small offline dataset of trajectories with crowdsourced episode-level natural language description, (2) generating alternative instructions for a larger offline dataset of trajectories with the VLM, and (3) learning a language-conditioned policy viabehavior-cloning on this augmented offline data.

After finetuning CLIP on the portion of the training dataset that contains crowd-sourced language instructions, we can automatically label the rest of the dataset without any additional human effort. In our setting, we finetune CLIP on crowd-sourced human labels for 2,800 demonstrations out of the total training dataseet of 80,000 teleoperated demonstrations.

CLIP predicts language instructions by scoring candidate instructions sourced from the original crowd-sourced dataset as well as LLM caption proposals. While many of these candidate instructions might be erroenous or irrelevant, CLIP is often able to predict instruction labels describing skills or concepts not present in the original label.


In our experiments, we investigate whether DIAL can improve policy performance on 60 novel evaluation tasks. We compare against different instruction augmentation methods, including methods that do not utilize visual context. To study how applicable DIAL is to various practical robot learning scenarios, we consider the setting where we have a fully labeled trajectory dataset as well as where we have only a partially labeled trajectory dataset that contains episodes with no corresponding language labels. We find that DIAL is able to successfully imbue offline datasets with additional semantic concepts not contained in the original instruction set, and demonstrate these capabilities on a large-scale real robotic manipulation setting.

We perform over 1,300 real world robot policy evaluations on 60 novel instructions that we organize into three categories. "Spatial" tasks focus on instructions involving reasoning about spatial relationships, such as specifying an object’s initial position relative to other objects in the scene. "Rephrased" tasks are linguistic re-phrasings of the original tasks prompted during teleoperation, such as referring to sodas and chips by their colors instead of their brand name. "Semantic" tasks describe skills not contained in the original dataset, such as moving objects away from all other objects, since the original dataset only contains trajectories of moving objects towards other object. We find that DIAL outperforms instruction augmentation baselines across all three categories.

The source trajectory dataset we utilize consists of a 5,600 trajectories dataset (Dataset A) with crowdsourced hindsight labels and a larger 80,000 trajectories dataset (Dataset B) that does not have any crowdsourced instructions. Even though Dataset B does not contain hindsight labels, it does contain structured task information that was used to guide human demonstrators as to which task should be collected (for example, “pick coke can”); we refer to these structured commands as foresight instructions. In the previous experiment, we considered all information available during both instruction augmentation as well as policy training. However, does DIAL still improve policy performance in settings where the source dataset is only partially labeled? We study the settings where foresight instructions are not available as well as the setting where crowd-sourced labels are not available. We find that in both cases DIAL significantly increases performance on novel evaluation instructions. This experiment is motivated by the setting where large amounts of unstructured trajectory data are available but hindsight labels are expensive to collect, such as robot play data collection.

We also study the tradeoff between relabeled instruction accuracy, augmented dataset size, and downstream policy performance. Emperically, we find that the Finetuned CLIP model we utilize becomes very uncertain after the initial few hindsight label predictions; we propose two instruction prediction methods, and find that the more conservative approach is able to strike a balance between producing rich instruction augmentations while still providing accurate supervision signal for the control policy. For more details on this analysis and experimental results, please refer to the full paper.



The authors would like to thank Kanishka Rao, Debidatta Dwibedi, Pete Florence, Yevgen Chebotar, Fei Xia, and Corey Lynch for valuable feedback and discussions. We would also like to thank Emily Perez, Dee M, Clayton Tan, Jaspiar Singh, Jornell Quiambao, and Noah Brown for navigating the ever-changing challenges of data collection and robot policy evaluation at scale. Additionally, Tom Small designed informative animations to visualize DIAL. Finally, we would like to thank the large team that built SayCan, upon which we develop DIAL.

The website template was borrowed from Jon Barron.