New Algorithm Revolutionises a Decades-Old Estimation Problem

27 March 2023
Associate Professor
Computer Science
SHARE THIS ARTICLE

When Covid-19 came barrelling through the world, it upended nearly every aspect of our lives, forcing us to live, work, and play in completely new ways. We became accustomed to things we previously held off as a last resort or long resisted — things like face masks, Zoom, and having our movements monitored.

The last was particularly interesting to observe from a computer science point of view, says Kuldeep Meel, an assistant professor at NUS Computing. During the pandemic, places such as malls, medical facilities, and offices in numerous countries began keeping track of their visitors for the purpose of contact tracing.

In theory, this sounded simple — staff would be stationed at entrances to collect personal details or have visitors check in with a purpose-built app. But the reality was more complicated, says Meel. “One way to do it was to keep track of everyone’s identification numbers, but the problem is you end up with a really big table of data.”

Academics refer to this as the problem of counting distinct elements. “It involves finding the number of unique items in a stream of data,” explains Meel, whose work focuses on artificial intelligence and automated reasoning. Pandemic aside, the ability to perform such counting is extremely useful, especially in the realm of the Internet. For instance, online businesses like to know which products most people view on their webpages, news outlets monitor which articles get the most reads, and social media platforms seek to understand how many unique people are interacting with and sharing a particular post.

“In short, the Distinct Elements problem is a crucial part of many of the technologies that we rely on in our daily lives,” says Meel. “The problem is so important that special algorithms have been developed to solve it.”

However, none of these approaches are ideal, despite researchers having studied this “fundamental challenge” for nearly four decades, he says. The main difficulty lies in figuring out how many visitors to a site are actually unique — for example, distinguishing between individuals who visit a particular mall only occasionally versus those who live close by and go twice a day.

“There were either very complicated algorithms whose analysis is often beyond graduate courses, or algorithms that aren’t really practical and don’t run really well,” says Meel.

But last September, he and his collaborators — Sourav Chakraborty, a theoretical computer scientist at the Indian Statistical Institute in Kolkata, and computing professor Vinodchandran Variyam at the University of Nebraska-Lincoln — announced they had come up with a straightforward, yet elegant, solution.

Insightful sampling

The idea came to the trio one warm evening in December 2019. Variyam was on sabbatical at NUS for a semester while Chakraborty, a long-term collaborator of Meel’s, was in town for the week. The latter had just finished giving a talk so the researchers decided to head to a German gastropub to unwind and have dinner. As they dined al fresco and enjoyed views of the sea, conversation naturally turned to their shared passion: mathematics and computer science.

“It was there that a kind of sudden realisation struck us. We almost came up with the complete idea there and then, and wrote it down on a napkin,” recalls Meel, laughing at the cliché.

Their algorithm works on a deceptively simple premise: to get an estimate of unique visitors to a particular place, one only needs to set a threshold for that estimate and correspondingly keep track of, or sample, a small number of people.

To highlight how this works in practice, Meel uses the mall as an example. Say a person manning the entrance jots down the ID numbers of the first 50 people who pass through. For each of these numbers, the person then tosses a fair coin — one that has an equal probability of landing on either heads or tails. If the coin lands on tails, the ID number is removed from the list. If it’s heads, it stays. Eventually, the person ends up with a whittled-down list of roughly 20 numbers. The probability (p) in this instance is ½.

When the list fills up again as more visitors come to the mall, the person repeats the same process: flipping coins and removing IDs that correspond to a tails outcome. The probability reduces to ¼. This is repeated again and again, until finally the estimate of the number of unique visitors to the mall is simply the number of entries on the list (i.e. 50) divided by the current p value.

The trio’s algorithm is beautiful for both its simplicity and effectiveness. “We were amazed that such a simple algorithm could work so well,” says Meel. “It provides rigorous theoretical guarantees while also performing well in practice.”

Additionally, the concept is easy to understand and can be taught to anyone, including undergraduate students, he adds. “This is a major advantage because it means that students will be able to learn about this important problem and the latest technique for solving it, even if they are not enrolled in advanced graduate-level courses. This will enable more students to gain exposure to this important topic.”

A leap forward

Following that inspired dinner, the researchers spent the next year ironing out the finer details of their new algorithm. In September 2022, they published their findings at the 30th Annual European Symposium on Algorithms (ESA) in Germany.

In that time, Meel and his co-authors also took their research one step further, this time applying it to cases where each element in the data stream was a set of numbers (such as ID numbers or IP addresses), rather than a single number. This is a significant advancement, he says, for it will enable researchers and practitioners to tackle a wider range of problems that involve estimating distinct elements.

“Previous algorithms were not able to handle this type of data, which limits their applicability to many real-world scenarios,” says Meel. “But our new algorithm is able to deal with this type of data flawlessly.”

“The algorithm has the potential to be a powerful tool with potential applications in spatial reasoning, databases, and probabilistic reasoning,” he says.

Paper: Distinct Elements in Streams: An Algorithm for the (Text) Book

Trending Posts