Datasets that are “dirty” - that is, containing wrong or incomplete information - aren’t a new problem. Think of an old-school phone book or voter’s list - how much information in it would still be accurate and salient, a year after compilation? Today’s much bigger datastreams can’t avoid the same difficulty, notes Vaclav Vincalek, CEO of PCIS, but machine-learning tools can help with the onerous tasks of ongoing cleanup.
Read Vaclav Vincalek on how to corral rogue data.
“‘It's a serious challenge. In organizations, it’s usually 40-50% of the effort that goes into these kinds of manual tasks around machine learning,’ said Vaclav Vincalek, CEO of Vancouver-based PCIS.
"‘Realistically, it’s not going to get better, as organizations will keep getting more and more data.’
To counter that, part of your machine learning project has to be data quality assurance. IT leaders need to know that if you get a new data set, it complies with your requirements. The problem is not just about how to get clean data, but how to correlate data from different sources.”
It’s interesting, albeit frustrating, to realise this is fundamentally an irreducible problem -- at least for companies that don’t recognize what they need to do from the outset. As the data scientist quoted in this article notes, a useful dataset cannot be “a sanitized version of reality.” And reality, to paraphrase the old saying about history, is just one damn datum after another.