Date

Tired of Data Quality Discussions?

Five steps to avoid these discussions.

I have been in tons of meetings where data and results of any sort of analysis have been presented. And most meetings have one thing in common, data quality is being challenged and most of the meeting time is used for discussing potential data quality issues. The number one follow up of this meeting is to verify the open question, and we start all over again. Sounds familiar?

Photo by Luke Chesser on Unsplash

It can be different. There are meetings where these discussions don’t take place, or perhaps were started, but immediately taken care of. I have seen and been involved in a few. And there was ONE difference between these types of meetings that I have seen over and over again. The person presenting the data was not on top of their data, was not anticipating and not thinking a step further.

The person presenting the data was not on top of their data, was not anticipating and not thinking a step further.

In fact, many of these data quality discussions are not actually data quality issues but an understanding of the meaning of the data. For example its hierarchy or structure, and the interpretation of the metrics. It is very easy when you don’t understand something to blame the data quality, but usually, the issue lies somewhere else.

It is very easy when you don’t understand something to blame the data quality, but usually, the issue lies somewhere else.

Let’s assume you are working on some exploratory data analysis that you are doing to get started with AI. The key to success is to really understand the data you are working with. If the quality is not up to standard, make it up to standard or find a way to work with the data nonetheless. Be proactive and then it will find it’s a long way.

1. Start small

The key here is as with so many things to start small. If you are looking at a handful of features you can actually dig into what these features mean. If you are starting off with hundreds, it will be more difficult. Let’s look at the number of products per customer, which is clearly small.

If you are looking at a handful of features you can actually dig into what these features mean

2. Make sure you understand your data

Because you started small, you are able to dig deep. Do your correlation plots, look at the frequencies, and read the documentation on these features.

Because you started small, you are able to dig deep and truly understand the data

In our example, we basically have two features to look at, two features that actually both have a large potential for discussion. I have once taken about three months to define what is meant with customer, an especially difficult question when working in a B2B environment. Depending on the company you work in, there may be different levels of products used, each of them who can be of interest in a different type of role. A product manager can have a different hierarchy of interest than the head of sales of a region.

3. Verify the data quality

There may be standard ways already that the data quality is checked, and you should understand and be able to explain these. I recommend going a step beyond the usual checks. Check for inconsistencies from a business perspective, are most of the jobs of your customer “Accountant”? Think again, it may be the top selection of the drop-down list. Another typical quality issue is the inconsistency between systems. Be sure you know these inconsistencies, what drives them, and their implications.

It may be the top selection of the drop-down list

4. Anticipate the issues

Quite a few questions and issues you can anticipate. What are the questions you typically get? What KPIs have been reported to your audience? What discussions have taken place in the past? Which words are used in the daily discussions? That should for example give you a good sense of the product split you are looking at (spoiler alert, it may well be none of the splits in your data). Make sure you understand the different levels of why they are used and how.

Anticipating the issues will allow you to divert from the data quality discussion

In my example, there were many different product hierarchies (from different systems) that were used by different audiences. I have built-in both hierarchies in my dashboard and was able to explain the overlap and differences between the two.

If you find out which systems your audience is using and what data they typically see. Have an upfront discussion with someone you trust to go through the data and results to take out all possible flaws.

5. Know the issues and work around them

Once you know the issues that are there. It’s time to work around them. One way is to tackle the issue at source. It may not be your job but potentially critical for a follow-up project where these features are going to be used.

If you are still in the exploratory phase, then you could think of making the issues and assumptions clear. Key will be that you are able to explain them and their implications to gain the trust of your audience.

Key will be that you are able to explain the issues and their implications to gain the trust of your audience.

You are thinking that this is a lot of work? Well think again, once this is sorted you can actually do your job and start creating actionable insights, and take action.

More
articles