Confined to their homes during the circuit breaker period, Singapore’s Covid-19 lockdown, people began ordering certain products in earnest: fitness equipment, home office accessories, flour and other baking goods. If, like many, you were forced to turn to online shopping in recent months, you might have realised what a complex beast it can be.
The wealth of products available is staggering, and trying to find the precise item you are after is sometimes a frustrating task. For example, do you look for a new running top in the ‘Sportswear’ category of an online clothing store? Or would you find it under ‘Tops & T-Shirts’? Or perhaps it comes under the ‘Running’ section?
“E-commerce platforms categorise their products into a multi-level taxonomy tree with thousands of leaf categories,” explains Stanley Kok, an assistant professor from NUS Computing who specialises in the fields of AI, data science, and machine learning. Buyers begin by identifying which broad category their desired product falls under, narrowing their search using various sub-categories (called branches) until they eventually find what they are looking for (on a specific ‘leaf’ node).
To order a running top from the Royal Sporting House, a Singapore-based sports retailer, for example, you need to first select the ‘Women’s Clothing’ category on its website, followed by ‘T-Shirt & Tops’. Alternatively, you can find running top options are available under ‘Women’s Sports’ → ‘Running.’
“One problem with this system is that the number of classifiers grows exponentially with every step,” says Kok. “If you have a tree with four levels, you would need a minimum of 24, or eight, classifiers. With 10 levels, that’s 210.”
Given that product catalogues can contain millions of goods, with new ones added on a daily basis, taxonomy trees are forever expanding and can be many levels deep. “But the current classification system just doesn’t scale and it’s a huge maintenance cost for companies,” says Kok.
Additionally, such stepwise classification can be prone to errors. List something wrongly on an early ‘branch’ of the taxonomy tree, and the mistake gets propagated down to the lower levels, with the product eventually being listed on the wrong ‘leaf.’ The result: a frustrated potential buyer and a possible lost sale.
The need for accurate listings, however, goes beyond keeping customers happy. “Correctly categorizing a new product into the taxonomy is fundamental to many business operations,” says Kok. “For example, it helps firms apply the right censorship policies, bill items using the correct tax rate, determine the appropriate shipping fees, and directs customers to the correct department when they call the company helpline.”
A weird language
Recognising these classification problems and wanting to help companies have accurate online product catalogues, Kok and his colleagues sought to create a new classification technique.
What they came up with was so outlandish that even Kok admits it sounded like “a hare-brained idea” at the time. It was mid-2018, and Kok had just joined the NUS faculty, fresh from a job in industry where he worked as a principal researcher at Japanese e-commerce site Rakuten, one of the largest such sites in the world.
“I noticed that because e-commerce companies operate globally, they have to translate their product titles into various languages,” he says. “And since many of them have already invested a large amount of time, energy, and money researching and proving their machine translation systems, they’re already pretty good.”
“So I thought: it’s good technology. Why don’t we just use this machine translation system to do product categorisation? Companies will get more return on investment too,” Kok says. He admits it was “not an obvious reach” because the two tasks — of language translation and product categorisation — seem inherently different on the surface. But, upon closer inspection, the two actually bear sufficient similarities for the idea to work.
“They both involve linear structures,” explains Kok. In conventional machine translation, one language is converted into another, using the notion that each sentence is a linear structure. For product categorisation, this structure takes the form of a path from the root to the leaves of a taxonomy tree.
“Effectively, we’re translating from English to a categorisation language,” says Kok. “It’s a very weird kind of language but still a language.
Neural machine translation is similar to existing techniques using machine learning in that both involve mapping text characters to vectors. The difference, however, is the former results in vectors that take the form of various numbers, rather than simply being a string of ones and zeros. It’s a crucial distinction that allows the technology to recognise synonyms — and by extension, similar products — despite two words having vastly different spellings.
Take the words ‘river’ and ‘stream’ as examples, says Kok. If the vector representing them has three positions, neural machine translation might represent one word as ‘6.1, 3.2, 1.5’ and the other as ‘6.2, 3.5, 1.6’. The similarity of these numbers implies they are “very close together, and hence they could be translated as closely related things,” he says.
As a result, machine translation technology, similar to how “there’s more than one way to convert and express an English sentence to Chinese,” can provide more than one description or categorisation for a product, says Kok. “It is able to create both existing root-to-leaf paths and novel paths, which is important because it’s very human to view a product in multiple ways.”
Additionally, machine translation provides the benefit of being resilient to the vagaries of language and errors in a product’s description or labelling. The technology, for example, would have no trouble recognising that a ‘Mix Pancake Waffle 24 oz, Pack of Six’ is the same as a ‘Packet of Six: Waffle Pancake Mix, 24 oz.’
Machine translation offers yet another advantage over traditional classification systems: improved accuracy. When Kok and his collaborators applied both approaches to two large real-world Rakuten datasets, which together comprised nearly two million product titles, they found that their new technique correctly categorised products (according to their existing leaves) more often than machine learning did.
“We were very surprised, we didn’t expect it to work in such a straightforward manner,” says Kok. For future work, he would like to conduct user experiments to study how convenient novel root-to-leaf paths are to customers and whether this converts to actual sales. Additionally, he hopes to integrate pictures alongside text into the machine translation framework to help improve categorisation accuracy.