Always one step ahead: Robo-Chef predicts steps of recipes it’s never seen before

14 May 2020

Artificial Intelligence, Department of Computer Science, Faculty, Feature, Research

Angela Yao

Dean's Chair Assistant Professor

Computer Science

SHARE THIS ARTICLE

To understand the work she does, Angela Yao says to imagine a future where robot helpers are commonplace. Whether they’re workplace assistants, companions, or domestic helpers, robots need to be able to do one crucial thing, says the assistant professor from NUS Computing.

“If you want to have a safe and smooth interaction with people, the ability to predict what’s going to happen next is the most important thing,” says Yao, who specialises in computer vision and machine learning. It is an uphill task, one made all the more difficult by the problem computer scientists call zero-shot anticipation — when the robot has to anticipate a scenario it has never seen before.

Yao uses the example of a robot chef, whose job is to aid with the creation of tasty meals in the kitchen of the future. “Maybe my robot chef has been trained to make chocolate chip cookies, but I would like it to help me make oatmeal raisin cookies, which it isn’t necessarily trained on,” she says.

For a robot to make a never-seen-before dish — or in more general terms, to predict a never-seen-before activity — it has to be trained in the right way. “We don’t want to train for every single possible recipe that might exist,” says Yao. “Instead, we’d like to generalise the principles of cooking and then apply them in completely new situations.”

When video is great, but…

When deep learning methods are used in machine learning, training data is key. Quantity is important — the more information or training scenarios you provide, the better the machine learns — but so is quality. A robot chef, just like its human counterpart, may be able to brush up on its cooking skills by reading recipes, but learning is enhanced when these recipes are accompanied by mouth-watering pictures. Throw in videos as part of the training process and the results are amplified further.

“Multimodal is the richest form of data,” says Yao, referring to sources that include image plus text, or videos with narration. But working with videos in machine learning is challenging in a number of ways.

First, there aren’t as many recipes available in video form as there are written ones. It is also difficult to annotate data from video compared with text, says Yao. “When we receive steps in sentences, there’s a natural partition and it’s easy to tell where it ends by the full stop. But with video, it’s harder because it comes as a stream.”

“So a robot doesn’t know how to parse this down into individual chunks that are reasonable actions for planning,” she says. This creates problems especially when it is working on complex activities such as cooking, which involve a number of sequential steps.

To get around this, the researchers could annotate the videos themselves. But given the virtually unlimited number of recipes and cooking videos out there, the process would be too cumbersome and time-consuming.

From text to video

In 2018, Yao and her PhD student Fadime Sener from the University of Bonn began studying the problem of anticipation in videos. The trick, the pair realised, was to start off with text recipes first — to take information from the wealth of text corpora available online, and transfer it to the visual domain to predict the next steps of an instructional cooking video.

To do this, Yao and Sener built a two-stage hierarchical model. In the first pre-training phase, they obtained information from Recipe1M, a repository comprising over a million recipes — one of the largest such datasets available online.

Each recipe was then run through a sentence encoder to parse the individual steps into feature vectors (representations that take the form of numerals, similar to how GPS coordinates indicate a geographical location). Next, a recurrent neural network (RNN) was used to model the sequence of the recipe steps. Such algorithms are called “recurrent” because they “allow you to start off with the ingredients, encode it, then predict the next step,” explains Yao. “And then from that current state, encode it again, make another prediction, and so on.”

For example, a recipe may call for the following ingredients: caramel, whipped cream, sea salt, caramel sauce, dark chocolate, and whole milk. Analysing this, the RNN recognises it as a recipe for salted caramel hot chocolate, and hence suggests the first step to be “In a saucepan, bring the milk to a boil.” This might be followed by: “Add the caramel sauce and stir until the chocolate is melted.”

“And then it’s going to predict the next feature vector, and the next and so on,” says Yao, explaining why the model is deemed hierarchical. “It’s similar to how Google autocomplete works when you’re typing in search terms.”

When the RNN is done, a sentence decoder converts the predicted recipe steps back into natural language, or human-interpretable sentences.

The second stage of the model is what Yao calls the fine-tuning stage, where the model is applied to videos (a video encoder replaces the sentence encoder).

Anticipating complex tasks

To test their model, the pair applied it to videos from YouCook2 and Tasty Videos, an online collection of approximately 2,000 and 4,000 recipes respectively. “We were testing for how well the predicted steps match to what is actually given in the recipe,” explains Yao.

The results were very encouraging, she says. “We matched or outperformed state-of-the-art video captioning methods that were trained in a supervised way,” says Yao. Despite having never seen a particular recipe before, the model was able to predict the next steps, performing as well as algorithms that had been previously trained on the test recipe and which would provide captions after — instead of before — seeing a portion of the video (hence the term supervised learning).

While the hierarchical model was prone to making some wrong predictions, it was able to correct itself after receiving more visual context — something that was “a pleasant surprise,” says Yao.

The model offers many advantages over conventional approaches to zero-shot anticipations for videos. “Instead of getting word-by-word predictions, we can now get sentences, which offers more coherence and conveys much more information,” says Yao.

It also minimises the need for excessive labelling of video data and makes it easier to predict multiple steps in complex tasks from visual data.

Yao is now working to include more information, such as recipe titles, into the model, which she says can be applied to other non-cooking related scenarios. “It can be used in anything that involves multiple steps, like doing chores or assembling IKEA furniture,” she says. “We’re the first to do this type of zero-shot action anticipation for multi-step activities so it’s really exciting.”

Paper:
Zero-Shot Anticipation for Instructional Activities

Trending Posts

4 May 2025

Unlocking the True Potential of Enterprise Systems: Why User Behavior Matters More Than You Think

A new study by NUS Computing’s Assoc Prof Tan Chuan Hoo reveals how leadership, user mindset, and system design determine whether enterprise systems are used effectively—or fail despite good technology. ...

23 January 2020

Let’s maximise influence, but in a fair way

A few years ago, Yair Zick was attending a conference in Stockholm when he struck up a conversation with two researchers from the University of Southern California (USC). Zick, a ...

6 December 2019

The holy grail of seamless systems integration

Hospital visits can be complicated things. Sometimes it starts out as a visit to the outpatient clinic, where a doctor draws blood or orders some scans to investigate your niggling ...

30 November 2023

Policing the Dark Web: Can Targeting Large Vendors Curb Further Drug Sales?

One day in May 2014, law enforcement officials swooped down on a warehouse in the San Francisco Bay Area. There they found a mini laboratory, pill press machines, and barrels ...

13 December 2024

Exploring DiffPath: A Revolutionary Approach to Detecting Out-of-Distribution Data with AI

In the world of artificial intelligence (AI), one major challenge is teaching models to recognise when they encounter something they’ve never seen before—known as out-of-distribution (OOD) data. Imagine training a ...

15 December 2023

To Attract VCs’ Attention, Should Startups Go with Crowdfunding or Angel Investing?

Roughly a decade ago, there was a big shake-up to the startup world. Entrepreneurs looking to fund their latest business venture no longer had to seek seed capital from traditional ...

21 May 2021

Creating Human-Aware AI

In 1961, something momentous happened at a squat, nondescript factory in the tiny town of Ewing, New Jersey. The Unimate, a robotic arm, was fired up for the first time, ...

23 October 2020

The Perils of Paying for Product Reviews

These days, we live and buy by online reviews. Looking for a pair of headphones? Wondering what movie to stream or if you should splash out for the new PlayStation ...

27 December 2019

Move over Alfred, there’s a new butler in town

The shiny, black robotic arm gleamed as it whirred into action and ‘waved’ at us, accompanied by Alexa’s robotic, yet (somehow) cheery, disembodied greeting, “Hello! My name is MICO.” Mohit ...

13 August 2019

The dilemma of an unknown diameter

They say that in the future, vehicles will be able to talk. Not in the way that those in the Pixar movie “Cars” do, but more in the sense of ...

2 April 2020

Visualising Algorithms with a Click

It was July 2011 in Pattaya, Thailand. While guiding the Singaporean team at the International Olympiad for Informatics (IOI), Dr Steven Halim was struck by an idea to improve the ...

26 November 2021

Built a good machine learning model? Think again

When Jungpil Hahn was appointed head of the Department of Information Systems and Analytics at NUS Computing in 2015, it changed his perspective on many things. ...

20 August 2021

The Olympics for Computer Science

The International Olympiad in Informatics (IOI) is one of the most prestigious competitions in the computer science world. Held every summer since 1987, the tournament sees exceptional high school students ...

4 June 2025

Bullying the Machine: What AI’s Reactions to Psychological Pressure Teach Us About Vulnerability

A new study led by Professor Mohan Kankanhalli (Provost’s Chair Professor and Director of NUS AI Institute) reveals that large language models exhibit human-like psychological vulnerabilities when subjected to AI-driven ...

13 November 2020

Quantum Physics Gets a Boost from AI

Stéphane Bressan and Christian Miniatura grew up in rival neighbourhoods of the naval garrison town of Toulon in southern France. They went to the same high school and the same ...

13 November 2018

Of beer and diapers, and other sale-boosting tricks

One of the most famous folklore in marketing and data mining goes like this: many years ago, Walmart noticed that on Fridays, men would head to the store, pick up ...

30 December 2024

Unlocking the Power of High-Dimensional Simulations with STDE

In a world increasingly driven by artificial intelligence and complex computations, tackling the most challenging problems—from modeling galaxies to designing personalized medicine—requires innovation. One such breakthrough is the Stochastic Taylor ...