Watching videos of Minecraft helps AI learn the game
OOpen AI has trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play while using just a small amount of labeled contractor data.
With a bit of fine-tuning, the AI research and deployment company is confident that its model can learn to craft diamond tools, a task that usually takes proficient humans over 20 minutes (24,000 actions). Its model uses the native human interface of keypresses and mouse movements, making it quite general, and represents a step towards general computer-using agents.
A spokesperson for the Microsoft-backed firm said: “The internet contains an enormous amount of publicly available videos that we can learn from. You can watch a person make a gorgeous presentation, a digital artist draw a beautiful sunset, and a Minecraft player builds an intricate house. However, these videos only provide a record of what happened but not precisely how it was achieved, i.e. you will not know the exact sequence of mouse movements and keys pressed.
“If we would like to build large-scale foundation models in these domains as we’ve done in language with GPT, this lack of action labels poses a new challenge not present in the language domain, where “action labels” are simply the next words in a sentence.”
In order to utilize the wealth of unlabeled video data available on the internet, Open AI introduces a novel, yet simple, semi-supervised imitation learning method: Video PreTraining (VPT). The team begins by gathering a small dataset from contractors where it records not only their video but also the actions they took, which in its case are keypresses and mouse movements. With this data, the company can train an inverse dynamics model (IDM), which predicts the action being taken at each step in the video. Importantly, the IDM can use past and future information to guess the action at each step.
The spokesperson added: “This task is much easier and thus requires far less data than the behavioral cloning task of predicting actions given past video frames only, which requires inferring what the person wants to do and how to accomplish it. We can then use the trained IDM to label a much larger dataset of online videos and learn to act via behavioral cloning.”
VPT paves the path toward allowing agents to learn to act by watching the vast numbers of videos on the internet, according to Open AI.
The spokesperson said: “Compared to generative video modeling or contrastive methods that would only yield representational priors, VPT offers the exciting possibility of directly learning large-scale behavioral priors in more domains than just language. While we only experiment in Minecraft, the game is very open-ended and the native human interface (mouse and keyboard) is very generic, so we believe our results bode well for other similar domains, e.g. computer usage.”