14 May 2020 Department of Computer Science , Faculty , Research , Feature , Artificial Intelligence

To understand the work she does, Angela Yao says to imagine a future where robot helpers are commonplace. Whether they’re workplace assistants, companions, or domestic helpers, robots need to be able to do one crucial thing, says the assistant professor from NUS Computing.

“If you want to have a safe and smooth interaction with people, the ability to predict what’s going to happen next is the most important thing,” says Yao, who specialises in computer vision and machine learning. It is an uphill task, one made all the more difficult by the problem computer scientists call zero-shot anticipation — when the robot has to anticipate a scenario it has never seen before.

Yao uses the example of a robot chef, whose job is to aid with the creation of tasty meals in the kitchen of the future. “Maybe my robot chef has been trained to make chocolate chip cookies, but I would like it to help me make oatmeal raisin cookies, which it isn’t necessarily trained on,” she says.

For a robot to make a never-seen-before dish — or in more general terms, to predict a never-seen-before activity — it has to be trained in the right way. “We don’t want to train for every single possible recipe that might exist,” says Yao. “Instead, we’d like to generalise the principles of cooking and then apply them in completely new situations.”

When video is great, but...
When deep learning methods are used in machine learning, training data is key. Quantity is important — the more information or training scenarios you provide, the better the machine learns — but so is quality. A robot chef, just like its human counterpart, may be able to brush up on its cooking skills by reading recipes, but learning is enhanced when these recipes are accompanied by mouth-watering pictures. Throw in videos as part of the training process and the results are amplified further.

“Multimodal is the richest form of data,” says Yao, referring to sources that include image plus text, or videos with narration. But working with videos in machine learning is challenging in a number of ways.

First, there aren’t as many recipes available in video form as there are written ones. It is also difficult to annotate data from video compared with text, says Yao. “When we receive steps in sentences, there’s a natural partition and it’s easy to tell where it ends by the full stop. But with video, it’s harder because it comes as a stream.”

“So a robot doesn’t know how to parse this down into individual chunks that are reasonable actions for planning,” she says. This creates problems especially when it is working on complex activities such as cooking, which involve a number of sequential steps.

To get around this, the researchers could annotate the videos themselves. But given the virtually unlimited number of recipes and cooking videos out there, the process would be too cumbersome and time-consuming.

From text to video
In 2018, Yao and her PhD student Fadime Sener from the University of Bonn began studying the problem of anticipation in videos. The trick, the pair realised, was to start off with text recipes first — to take information from the wealth of text corpora available online, and transfer it to the visual domain to predict the next steps of an instructional cooking video.

To do this, Yao and Sener built a two-stage hierarchical model. In the first pre-training phase, they obtained information from Recipe1M, a repository comprising over a million recipes — one of the largest such datasets available online.

Each recipe was then run through a sentence encoder to parse the individual steps into feature vectors (representations that take the form of numerals, similar to how GPS coordinates indicate a geographical location). Next, a recurrent neural network (RNN) was used to model the sequence of the recipe steps. Such algorithms are called “recurrent” because they “allow you to start off with the ingredients, encode it, then predict the next step,” explains Yao. “And then from that current state, encode it again, make another prediction, and so on.”

For example, a recipe may call for the following ingredients: caramel, whipped cream, sea salt, caramel sauce, dark chocolate, and whole milk. Analysing this, the RNN recognises it as a recipe for salted caramel hot chocolate, and hence suggests the first step to be “In a saucepan, bring the milk to a boil.” This might be followed by: “Add the caramel sauce and stir until the chocolate is melted.

Photograph of a saucepan with chocolate sauce.

“And then it’s going to predict the next feature vector, and the next and so on,” says Yao, explaining why the model is deemed hierarchical. “It’s similar to how Google autocomplete works when you’re typing in search terms.”

When the RNN is done, a sentence decoder converts the predicted recipe steps back into natural language, or human-interpretable sentences.

The second stage of the model is what Yao calls the fine-tuning stage, where the model is applied to videos (a video encoder replaces the sentence encoder).

Anticipating complex tasks
To test their model, the pair applied it to videos from YouCook2 and Tasty Videos, an online collection of approximately 2,000 and 4,000 recipes respectively. “We were testing for how well the predicted steps match to what is actually given in the recipe,” explains Yao.

The results were very encouraging, she says. “We matched or outperformed state-of-the-art video captioning methods that were trained in a supervised way,” says Yao. Despite having never seen a particular recipe before, the model was able to predict the next steps, performing as well as algorithms that had been previously trained on the test recipe and which would provide captions after — instead of before — seeing a portion of the video (hence the term supervised learning).

While the hierarchical model was prone to making some wrong predictions, it was able to correct itself after receiving more visual context — something that was “a pleasant surprise,” says Yao.

The model offers many advantages over conventional approaches to zero-shot anticipations for videos. “Instead of getting word-by-word predictions, we can now get sentences, which offers more coherence and conveys much more information,” says Yao.

It also minimises the need for excessive labelling of video data and makes it easier to predict multiple steps in complex tasks from visual data.

Yao is now working to include more information, such as recipe titles, into the model, which she says can be applied to other non-cooking related scenarios. “It can be used in anything that involves multiple steps, like doing chores or assembling IKEA furniture,” she says. “We’re the first to do this type of zero-shot action anticipation for multi-step activities so it’s really exciting.”

  

Paper:
Zero-Shot Anticipation for Instructional Activities