Imagine that you’re a book publisher gathering feedback for a new novel that your firm has recently released. Sales figures are useful, but you’re keen to find out more about what people actually think of the book. So you gather Amazon-style reviews, asking respondents to rate it on a scale of one to five.
The goal was to hit 1,000 reviews but you only have 960. Do the forty missing reviews matter? Probably not, you think. 960 is still a pretty big figure. So you happily crunch the numbers and post the new book’s average rating on the company’s website.
But hang on, not so fast, says Jungpil Hahn, a professor at NUS Computing’s Department of Information Systems and Analytics. “We tend to think that dropping a few observations here and there isn’t going to matter, given that there’s enough sample,” he says. “But it can matter.”
For instance, what if one of the reviews you omitted was from an eminent book reviewer of a leading newspaper? Or a thought leader such as Barack Obama? Or a celebrity like Oprah who has a famous book club? Their say on the new novel could hold significant sway and convert someone who’s sitting on the fence into a paying customer.
The bottom line, Hahn says, is that when values are missing from a dataset, things can get problematic. This is especially pertinent because we now live in an age of big data, where voluminous complex datasets are compiled and information is extracted from them in order to guide decision making. Big data’s applications are widespread: companies use it to figure out what their customers want, humanitarians use it to predict and respond to natural disasters, and physicians use it to diagnose diseases, among other applications.
“Having high-quality data is something we would all like,” says Hahn, who research partly focuses on data science and business analytics. “But in the real world, this is almost never the case. You’re always dealing with data imperfections.”
An unseen problem
Missing values is one such imperfection. But it is a somewhat neglected issue, says Hahn, one that requires a massive rethink on the part of data scientists. He has spent the past few years trying to raise awareness of the problem, and recently published a paper on the topic, together with NUS Computing colleague Huang Ke-Wei and former PhD student Peng Jiaxu, now an assistant professor at Beijing’s Central University of Finance and Economics.
The paper — which provides a comprehensive look at the problem of missing values and how to handle them — has been published in the Information Systems Research journal this year.
“With big data, because we have so much of it, we are disillusioned to think that we can simply ignore the missingness without really thinking deeply through it,” explains Hahn. “But that’s very dangerous because you can get incorrect inferences if you approach your analytics in such a way.”
The result? Imbalanced customer reviews for one, says Huang. Or datasets that inaccurately predict a firm’s financials, or surveys that poorly reflect a company’s IT resource requirements (such as how many workers are needed, what the costs are, and so on) — one of the most common types of datasets you find in information systems work, he says.
However, most companies don’t currently tackle the problem of missing values head-on. Few data scientists are aware of the issue, much less know how to deal with it. “There is a standard operating procedure (SOP) of conducting research in information systems,” explains Huang. “But somehow in that SOP, the handling of missing values is overlooked.”
“They don’t even report it, and we can only guess that they just delete the data row or arbitrarily put in some values using subjective assessment,” he adds. “But those measures only really work under restrictive mathematical assumptions.”
Even in academia, where researchers “should be very rigorous in reporting how they use their data,” people don’t disclose enough information about how they handle missing values,” he says.
“I think the main reason is because researchers don’t really know about it,” says Huang. “Which is why we’ve tried to increase awareness around the issue.”
Awareness as the first step
To that end, the team has detailed in their paper the three main categories of missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).
It’s the last two that are problematic, says Hahn. “There's no way to statistically prove whether it’s type one, two, or three, but you need to think about what it’s most likely to be and really think about the nature of why the values are missing.”
Only after this identification can you apply the appropriate remedy, he says. The ‘solutions’ vary and the team offers up a handful of suggestions in their paper, including techniques such as the estimation process, maximum likelihood, and multiple imputation methods.
Hahn admits that these solutions can be complex and time-consuming to use. “They reduce bias but it’s also very costly, so there’s a tradeoff,” he says. “It’s not a panacea for missing values.”
His team provides the theory and mechanisms for dealing with missing values, but data scientists still have to manually apply these to their own datasets. In the future, Hahn hopes to be able to collaborate with software engineers to create a “usable piece of software” that will allow practitioners to “plug and play” their solutions.
For now, he strongly advocates data scientists to be transparent and upfront about disclosing the gaps in their datasets. And even if they don’t use his team’s solutions, “just being aware of missing values and knowing the potential biases that can arise is helpful,” says Hahn. “It gives you more nuanced information.”