A child wearing a red ski-suit
The shiny, black robotic arm gleamed as it whirred into action and ‘waved’ at us, accompanied by Alexa’s robotic, yet (somehow) cheery, disembodied greeting, “Hello! My name is MICO.” Mohit Shridhar stretched his lanky frame across the counter to place plastic replicas of a few everyday objects—a red bowl, an apple, and a banana—on the white tablecloth in front of MICO. Then Shridhar instructed, “Alexa, tell MICO to pick up the apple.” The robotic arm contorted and whirred until it held its gripper over the apple. “Do you mean this?” Alexa asked. “Alexa, tell MICO to go ahead,” Shridhar confirmed. MICO obediently, albeit mechanically, lowered its gripper and picked up the apple.
Shridhar was giving verbal instructions to MICO through Alexa because MICO, the robotic arm, cannot ‘hear’ or ‘speak’. So, Amazon’s virtual assistant Alexa (via an Echo Dot) supplied the ‘ears’ and ‘voice’. An Xbox Kinect2 mounted above the counter served as the ‘eyes’. These devices were being employed to demonstrate the robot system that Shridhar designed: INGRESS. INGRESS functioned as the ‘brain’, making sense of the visual and auditory information picked up by Alexa and the Kinect2, and telling MICO to move accordingly. It is a robot system that follows human natural language instructions to pick and place everyday objects, for now, but essentially INGRESS could be integrated onto any robot. It was June 2018, and Shridhar, a recent Computer Engineering graduate and research assistant for Professor David Hsu, was preparing to demonstrate his work to the Minister for Education, Ong Ye Kung. In November the same year, he would also install INGRESS onto a mobile (i.e. roving) robot called the Fetch Mobile Manipulator to demonstrate its versatility to Emeritus Senior Minister Goh Chok Tong.
“I work on teaching robots how to connect human language to the real world (known in the field as ‘grounded language’), and one of the main problems in this area is trying to get robots to understand unconstrained natural language expressions,” Shridhar explained. Traditionally, researchers have tried to teach robots to understand human language and their environment using pre-defined sets of objects. These objects were labelled with either the object names (e.g. ‘Banana’, to identify an image of a banana), or a list of attributes that ‘define’ those objects (e.g. yellow, long = ‘Banana’). As you might imagine, neither of these solutions were very robust or practical. Not only would you have to supply endless lists of definitions and attributes to cover everything the robot could possibly encounter, but, as Shridhar puts it, “...if you deviated from this predefined list even just a bit—if you described a spotty, overripe banana—you would completely break the system.”
To illustrate how complex and challenging it is to develop a system like this, Shridhar drew on a familiar example: “When we were learning about objects as kids, our parents didn’t give us lists of attributes to combine. They just pointed at objects and said ‘That’s an orange’, ‘That’s a ball’. And while they did this, we learnt abstract concepts and made abstract connections in our head. For example, you can describe a red apple as a red object or a roundish fruit. We’re making all these connections in our heads and that’s what we’re trying to do in this project—we want to give robots the same ability to connect abstract concepts and rich visual descriptions of the object together, rather than single words for each object.”
So, how do we instill in robots this ability, that is seemingly effortless for us but the mechanisms for which are not yet completely understood or agreed upon (i.e. there are multiple theories about the cognitive neuroscience of visual object recognition)? One answer (at least so far) is to teach robots the same way we were taught— i.e. expose them to lots of examples object. Instead of painstakingly trying to teach a robot to identify an apple by defining the parameters of what makes an apple an apple, with machine learning, you can train it by showing it many different images of apples and it will learn patterns and develop its own parameters for identifying apples.
So, Shridhar used this machine learning based method to build a flexible system that can not only visually identify everyday objects in its environment, but also understand unconstrained, natural human language—such as the use of synonyms, object relationships and perspectives—used in reference to those objects. On top of all that, INGRESS can also generalise this ability to accurately identify new items that it has not ‘seen’ before.
Of course, the system does not always perform perfectly. As described, it identifies objects based on the numerous examples of other everyday objects it was exposed to during training. So, if it encounters an unusual looking object that does not quite look like anything it has previously seen, the results can be quite comical. When, during one of the demonstrations, Shridhar placed a Bob the Minion (from the cartoon ‘Despicable Me’) figurine among the items MICO was supposed to identify, MICO identified all the other items correctly, except Bob, which was apparently ‘a child wearing a red ski-suit’ even though Bob was in blue denim overalls. Chuckling, Shridhar offered his theory, “It’s because the background is white, so it kinda thinks it’s like snow. Maybe because it’s seen lots of pictures of people skiing. But there’s no red on the Minion. That’s kinda like it’s hallucinating things from biases in the training data.” Still, during testing, INGRESS correctly identified everyday objects 75% of the time.
In secondary school, Shridhar was interested in a wide range of subjects, from physics to filmmaking. His first significant experience with computing occurred about a year into his time there. His class was given assignment that required them to build an animated game using Flash Action Script. He decided that he wanted to replicate an online game that was really popular then, but quickly realised that he was being limited by the tool he had been given—the game characters’ complicated behaviours could not be reproduced with the programme’s Graphical User Interface (GUI).
So that was when Shridhar dived down the Google ‘rabbit-hole’, and began figuring out how to code in order to produce the more advanced effects that he was after. “I learnt how to do a lot of programming by just googling,” said Shridhar. The game that he built for that assignment was just the start—he would go on to build iPhone apps for things like the school’s music festival event guide. By the time he turned 15, he was a tutor for a programme called First Robotics, which is when Shridhar says he first developed his interest in robotics.
A few years later when it came time to apply to universities, Shridhar knew that he wanted to delve into robotics and stay in Asia to be near his family, so he googled NUS to find out what programmes would get him there.
Shridhar was enrolled in NUS’ common engineering programme when he joined the university in August 2013. He took mechanical engineering modules thinking that they were the path to robotics, but did not enjoy them. Ultimately, the introductory programming courses that he also took were what became his favourite part of his first year at NUS. So, when it came time for Shridhar to specialise in his second year, he switched to the multi-disciplinary Computer Engineering programme without hesitation.
“Help me Obi-Wan Kenobi. You’re my only hope.”
In his second year, as part of the Design Centric Programme, Shridhar worked on models for a 3D teleconferencing system—essentially like the holographic calls and messages on Star Wars. He and his group mate were presenting their work at a student project showcase when they were spotted by NUS alumnus, Peter Ho.
Impressed with their work, Ho, who is CEO and co-founder of boutique engineering firm Hope Teknik, offered them internships during the mid-year university break. For Shridhar, interning at a successful local engineering firm that builds specialised robots and vehicles for industrial and government use was a boon. There, he had his first practical experience with robotics, working on building autonomous ground vehicles (AGV) which are used to carry heavy and supplies equipment across floors in hospitals.
Over three months, Shridhar worked with two robotics engineers to work out how to get the AGV to understand where it was located in its physical space, and navigate its surroundings autonomously. He dabbled with Robotic Operating System (ROS) and simultaneous localisation and mapping (SLAM), among other things. His 3-month internship at Hope Teknik did not count for any university credit, but it was significant because his experience there led him to the decision to focus on robotics. “This is the intersection where computer science meets engineering—you get to write software and actually see its physical manifestation,” Shridhar explained.
After securing a spot in the NUS Overseas Colleges (NOC) programme, where selected candidates spend up to 12 months interning at high tech startups and taking entrepreneurship courses at prestigious partner universities, Shridhar spent 2015 in Stanford, California, USA. As part of the NOC Silicon Valley programme, he attended mobile computer vision and machine learning courses at Stanford University, and got to attend lectures by preeminent leaders in the field like Professor Andrew Ng. At the same time, Shridhar interned as a computer vision engineer at Meta, an augmented reality (AR) startup that was in Y Combinator’s accelerator programme at the time.
Meta’s product was an AR headset and the key thing that they had to figure out was how to program the device to know where it was in the environment at any given time. This is because when you are using the AR headset, any relevant virtual objects that you see should stay where they are relative to the movement of your head. For example, if your head (and the headgear attached to it) moved 6 cm left, then any stationary virtual objects you see should move 6 cm to your right, to maintain the effect of movement and relative distance away from said stationary virtual objects. The biggest problem they faced with this was having the corrective movements of the stationary virtual objects occur fast enough (relative to your head movement) so that their movements would be imperceptible to you—i.e. the latency issue. Once again, Shridhar worked with the SLAM algorithm to develop a solution to this problem. At the same time, he also worked with Meta’s computer graphics team on 3D reconstructions of people for teleconferencing, like his second year Design Centric Programme project.
By the end of the year, Shridhar was contemplating what he would do when he returned to NUS for his final year. “I knew I would be bored because I only had to complete three more modules,” Shridhar said. Since he had always wanted to do some work in robotics, he googled ‘NUS Robotics’ to see if he could find someone in NUS who had robotics projects that he could also work on during his final year. That is how he found Hsu, an expert in robotics and artificial intelligence at NUS Computing. After several follow up emails, in January 2016, Hsu introduced Shridhar to his (then) PhD student Lan Ziquan and the pair began working together on Lan’s PhD project.
Lan’s idea of developing a system to allow a drone to autonomously take photos of objects was right up Shridhar’s alley. His experience using SLAM while interning at Meta proved invaluable when Shridhar suggested they try a promising new SLAM algorithm called ORB-SLAM that he had come across while he was in Silicon Valley. This became one of the major turning points in the project, and later the autonomous drone photo taking system that they developed, XPOSE, went on to win the Robotics System Science (RSS) Best Systems Paper Award in July 2017.
While working with Lan on XPOSE, Shridhar also tinkered with a few of his own ideas just for fun—like trying to control a drone with voice commands instead of using its controller—but ultimately had to put them on hold to focus on his final exams in May. By the end of 2016, Shridhar had completed his bachelor’s degree in Computer Engineering, and won the NUS 30th Annual Faculty Innovation and Research Award (Merit) for his Final Year Project—that aforementioned Star-Wars-like (holographic), 3D teleconferencing system.
After that, Shridhar was free to return to his main interests, working on natural language processing and teaching robots to recognise objects. Joining Hsu’s lab as a research assistant in January 2017 allowed him to indulge in this freely, with the added perk of being able to bounce ideas off an international award winning robotics expert like Hsu.
Raising robust robots
Despite initially toying with the idea of a ‘voice-controlled-drone’, Shridhar eventually decided it unwise to try to shout instructions above the din of an airborne drone. He figured that his time would be better spent if he tested his ideas on a simpler machine instead. That is when he switched to working with a manipulator, a KINOVA MICO robotic arm that he simply called ‘MICO’, combined with a Kinect2 and Echo Dot for ‘sight’ and ‘sound’.
Shridhar wanted to design a system that would be flexible enough to enable robots to understand natural verbal human language. “I think it’s a very exciting challenge because sometime in the future, we’re going to be living with robots, and we’re going to have robotic butlers taking care of our everyday needs like cooking and cleaning,” he explained, “And, the most natural way of interacting with these robots is through verbal language.” “We will tell robots what to do rather than program them, so the robot must understand what we see, what we say, and connect them together,” Hsu added.
Most computers, and robots, however, are built in ways that allow them to perform tasks only when they receive instructions in consistent, specific forms that match a script—e.g. telling Siri to “Call Dad”, or a sequence of taps on your iPhone to play “Bohemian Rhapsody”, from the “Songs To End All Songs” playlist on the Spotify app. This prevailing standard for neat, linear ‘template-like’ computer instruction protocols is contrary to the nature of human language and communication, and is a major hurdle to overcome if you are trying to build a robot that will understand broad verbal instructions. Everyday human communication often varies a great deal, can be ambiguous, and can progress in many different ways that may not be straightforward. Shridhar’s paramount concern was that his system should adequately account for and accommodate the ‘unconstrained’, natural ways in which humans express themselves when they talk to others.
In addition to understanding what we say, to be truly useful, these robots need to have a visual understanding of their surroundings, and to be able to interact with objects and beings in their environments. Shridhar tackled these challenges methodically, developing and testing one solution at a time, one phase at a time. With each test, he would encounter new problems or derive new ideas that would spawn more systematic testing. Unsurprisingly, Shridhar went through numerous iterations of his system, cobbling together the framework for INGRESS with each successful outcome.
First, Shridhar used a captioning network that had already been trained to identify a diverse range of everyday objects and scenarios via machine learning, with a dataset of over 100,000 images and 4,300,000 expressions of nearly 80,000 object categories. But still, what if his system encountered something it had not seen before? How would it cope? He obsessed over, “What if...?” What if we needed to refer something by describing it in relation to other objects in the environment? So, when he was satisfied with his system’s ability to visually identify objects, he began teaching it to understand relationships between objects—e.g. ‘the cup in the middle’, in a row of three cups.
Next, Shridhar wondered, “What if we needed to refer to something by describing it in relation to our point of view, or the point of view of another?” Thus began several months of his training INGRESS to understand what it means when he instructs it to, “Pick up the bottle on your left.” And that is how Shridhar kept extending the abilities of his system to make it more and more powerful. The final “What if…?” that Shridhar addressed became the system’s failsafe: What if the system is uncertain? Whether this uncertainty is due to it being given ambiguous instructions, or a limitation in its ability to discern between similar looking objects, Shridhar programmed the system to ask for confirmation before performing an action: Do you mean this bottle?
Explaining the need for this critical feature, Shridhar said, “It’s not just about understanding what the user is saying, but it’s also at the same time communicating what it understands and what it doesn’t understand about the world. So in the case of ambiguity or uncertainty, the robot actually tells the user, ‘This is what I think’ when it asks the user ‘Do you mean this object?’, and it learns to communicate with the user. Without having to look at the code, by the way the robot asks questions and answers questions, you know what it knows and doesn’t know. Like with calling the Minion figurine ‘a boy in a red ski suit’, you know it hasn’t seen enough examples of a toy like that to correctly identify it.”
“The combination of these two things—the ability to understand unconstrained language and to ask questions when necessary, allows the robot to be flexible and robust in real world scenarios,” continued Shridhar. “Because of this ability to communicate with the user, we can place these robots in homes without having to worry about it breaking down, and we can be sure that the robot is self-reliant and that if it doesn’t understand something, it will ask the user—you don’t have to go to technical services.” Hsu remarked, “By connecting language with perceived physical objects and object relations, Mohit’s work took an important step towards our dreamed future.”
Even before Shridhar had completed all the extensions to his system, he and Hsu had the sense that they were onto something, so they decided to test the waters and see how it would fare against some external scrutiny. Shridhar wrote up his initial findings and submitted it as a workshop paper for RSS 2017 (the same conference where he and Lan won the Best Systems Paper Award for XPOSE) in July. Shridhar’s work-in-progress was validated when the paper was accepted and received positive feedback from experts in the sub-field of Spatial-Semantic Representations in Robotics. By February 2018, INGRESS had been fleshed out sufficiently to be submitted as a conference paper. Shridhar and Hsu’s paper, Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction, was accepted with strong reviews and presented at RSS in June 2018. “Full of passion and creativity, Mohit is the first undergraduate student in our lab to chart a new research direction almost single-handedly. He is our pride,” said Hsu.
Now, Shridhar is undertaking a PhD in Computer Science at the University of Washington, Seattle, USA. He is still working on that robot butler.
by Toh Tien-Yi
Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction