In a recent Forbes article, Adrian Bridgwater uses an interesting analogy to examine the challenges of “dirty data.” Likening data cleansing to clothes washing, he argues that data is routinely dirtied by application data silos, cumbersome manual workflows, and information that exists out of context—all of which necessitates it be put into the “enterprise laundry.”
As TIBCO’s Global CTO Nelson Petracek told Bridgwater, “Data… goes through numerous cycles—and let’s say wash cycles to continue the analogy. First, just like stains, it is often easier to deal with dirty data closer to its source, or right after it is created.” Petracek then goes on to explain in more depth the best ways to deal with dirty data, which we’ve broken down for you below:
Pretreat Your Data Stains at the Edge
Petracek underscores that dealing with dirty data at its generated source will become increasingly important with the popularity of Internet of Things (IoT) devices and remote data collection. Rather than wait until the dirty data has made its way back to the organization, companies should clean it up right after it is generated—before it has the chance to impact downstream storage systems or applications. Edge analytics, applying artificial intelligence (AI) and machine learning (ML) at the edge, and data transformation and filtering are actions companies can take to cleanse and enrich data before any other component interacts with it.
The Speed Factor
Another element that impacts dirty data is speed, as the speed at which data is produced factors into how organizations approach its cleansing. Addressing dirty data at the network level, for example, requires the ability to process the information at high speeds. Streaming data from an IoT device or batch data stored in a data lake similarly require bespoke approaches, which is why it’s so important to consider the speed factor and the location where dirty data will be addressed. If an approach is applied at the wrong speed or location, it’s unlikely that data quality will improve, but there is a good chance that performance or another aspect will suffer. As Petracek succinctly states, “Put the washing machine on ‘high’ with a bunch of running shoes in it, and see what happens.”
Consider Data’s Characteristics
Data may be derived from raw data sources, enriched with other data sources, or merged with other datasets. These and other factors impact data quality, so it’s critical to consider the characteristics of data from both a technical and business perspective. One approach suggested by Petracek is thinking about who else in the organization has used the data to make business decisions—and what the outcome of those actions was.
For more on the above and other considerations for eliminating dirty data and giving it that fresh and clean feeling, check out the Forbes piece here.