When Jungpil Hahn was appointed head of the Department of Information Systems and Analytics at NUS Computing in 2015, it changed his perspective on many things.
“I began to see the broader picture of the discipline as a whole, and began to think holistically about what we are teaching and what we are missing in the overall curriculum,” recalls Associate Professor Hahn. “That’s when I saw the urgency and extent of the problem.”
The problem he alludes to is one that pervades the industry. Hahn elaborates: “Everybody’s extremely excited about artificial intelligence. They say if we develop a machine learning model or deep learning model or neural network model, it’s all going to be awesome.” But the predictions such models make can be misleading, especially if the data used to build them is flawed.
In particular, Hahn — leader of the Garbage Can Lab research group at NUS, which studies how organisations make decisions — is interested in how imperfect models can affect companies and their business outcomes. He’s coined an umbrella term for the issue: last-mile data analytics.
“Organisations develop well-trained machine learning models and think they’re applying data analytics techniques properly, but things aren’t always as straightforward as they believe,” he explains. “So we look at some of the problems encountered in practice.”
The case of missing data
Through his work, Hahn and his lab are hoping to raise greater awareness of the issues surrounding last-mile data analytics. “We want companies to be cognisant of these kinds of problems and to think about how this influences the accuracy of the analytics they’re performing,” he says, “rather than just assuming their data is reliable and relevant.”
One reason why machine learning models are flawed has to do with missing data. At first glance, it’s a notion that seems contrary in today’s world of big data, where challenges often centre around having too much data.
“People think that this big data revolution has naturally pushed out the problem of missing data,” says Hahn. Over the years, he’s spoken to numerous practitioners at various meetings and conferences, all of whom have told him a variation of the same thing: “We have so much data so we don’t worry about it. We just delete all the records with missing values and we’re still left with more than enough data to train our machine learning models.”
But missing data isn’t always innocuous, especially if they are missing for non-random reasons. A systematic process could have made the “missingness happen,” says Hahn. “Missing values are notoriously difficult because we don’t know if their value is zero or if we just haven’t been able to capture them.”
Failing to examine what causes data to go missing and applying appropriate mitigation measures in statistical analysis leads to machine learning models that spew out inaccurate or biased predictions.
To that end, Hahn and his team have developed a number of theoretical frameworks and mitigation techniques — which they discuss in this paper — to help practitioners deal with the challenge of missing data.
The downside to privacy
Another way in which machine learning models can be flawed is if the data they use is inaccurate. For companies that collect consumer data for marketing purposes, a common cause of this inaccuracy is PETs.
Privacy-enhancing technologies, or PETs for short, help protect an online user’s personal information. The techniques vary widely and include replacing email and IP addresses with random ones, anonymising user information and aggregating it with data from other users, and employing a cryptographic algorithm to add ‘statistical noise’ to a dataset.
“The use of PETs is only going to grow with all the news about privacy issues,” says Hahn. “Plus it’s kind of creepy how much social media services know about you.”
The proliferation of PETs is good news for consumers, but much less so for the companies who rely on their data to bolster sales and marketing efforts. “What PETs fundamentally do, from an analytics perspective, is deteriorate the quality and sanctity of the data that organisations collect about consumers’ browsing and purchasing behaviours,” says Hahn.
“If you run your analysis using that data, you’re going to get incorrect inferences,” he adds. For instance, a machine learning model might ask a consumer who’s just purchased a walking aid to take a look at the latest line of ski gear or show him an ad for an upcoming triathlon. Poor recommendations such as these can negatively impact a firm’s future sales.
Hahn and his PhD student Dawei Chen discussed these implications in a paper published last December. As part of their research, the pair ran various simulations to test how different types of PETs and their adoption rates would impact data quality and how quickly the decline would happen.
They also explored possible mitigation measures firms could take to counter the impacts on prediction models, including what would happen if these data points were excluded altogether.
“Our main work is to raise awareness of these loopholes and to quantitatively document the extent of the impact this may have,” says Hahn. Eventually, he hopes to create software packages that firms can directly download and plug into their analytics tool to adjust their models accordingly.
To wait or not to wait?
Another last-mile issue Hahn has been studying is the question many data analysts puzzle over: when should I retrain my models?
Data environments are dynamic, producing new information over time that may alter the accuracy of a model’s initial predictions. For instance, a finance model might need to be recalibrated to consider the recession that’s just occurred. Or a natural language processing model that recognises speech and language might have to be updated to account for the 26 new Korean words recently added to the Oxford English Dictionary.
Detecting changes is one thing (they don’t always happen with a bang), but deciding when to implement them into a machine learning model is another. “Sometimes we don’t have enough data to reflect the changed environment until a sufficient amount of time has passed,” explains Hahn.
“If you retrain too soon, you have a less reliable model. But the longer you wait, the more inaccurate your predictions will be with the prior model,” he says. “So there’s an inherent cost-benefit tradeoff in terms of how long I should wait, whether historical data is relevant and if I should somehow leverage them in the retraining.”
Hahn and his then-PhD student Peng Jiaxu, now an assistant professor at Beijing’s Central University of Finance and Economics, created a framework for exploring this tradeoff. Their paper, which was presented at the 2020 International Conference on Information Systems (ICIS), was awarded Best Paper in the Advances in Research Method category.
Today, Hahn continues to explore the issue of last-mile data analytics. In the end, he says, “we hope to take the guesswork out of analytics.”