Data Profiling vs Data Quality Assessment - Let's Explain The Difference
A common problem I see in data management circles is the confusion around what is meant by data profiling as opposed to data quality assessment.
Some people tend to use the terms interchangeably and it’s easy to see why...
When we first plug a data profiling tool into our data sources it can help us create a huge amount of insight into the quality levels of our data. We believe these early investigations are in fact assessments of the data because they are giving us stats and measures .
The following definition of the word assessment helps us understand where many people go wrong:
A lot of people use data profiling as the start and end point for their data quality assessment and as a result they lack an ability to determine whether the profiling results are:
- Valued in a balanced and correct way
- Significant to the business
- Reflect the true extent of a particular issue
The problem is we’re missing a few key stages so let’s extend our discussion with a more comprehensive workflow.
Step 1: Data Profiling (a.k.a Data Quality Requirements Discovery)
In this phase we are using data profiling software to begin the process of discovery, but not we're not doing an assessment just yet.
Data profiling helps to find data quality rules and requirements that will support a more thorough data quality assessment in a later step.
Data profiling finds data quality rules that help create a data quality assessment #dataquality
For example, data profiling can help us to discover value frequencies, formats and patterns that lead us to believe that a particular attribute is a product code.
Using data profiling alone we can find some perceived defects and outliers but in terms of assessing the quality of the equipment code it will fall short until we have created more rigorous definitions of quality that may span multiple attributes, entities or even systems.
We end up with a whole range of additional questions to ask, based on those initial clues that our data profiling tool helped us to uncover:
- Viability: Does the code have a viable business function or is it redundant?
- Relativity: Is the quality of the code determined by other attributes, for example manufacturer code or some other combination of attribute values?
- Expansion: Can (and should) we decompose the code to extract more information that will help us validate the quality of its value?
With your very first data profiling activity you’ve started a process of data quality requirements gathering but not data quality assessment, that will come later when all the requirements are encapsulated as executable data quality rules that can provide us with a much more comprehensive metric of data quality.
Step 2: Data Quality Requirements Creation
Armed with our data profiling insights we can now start to define some data quality rules that our data must adhere to.
Why must we do this?
Because we need a means of comparing the quality of our data against an approved set of criteria. Data profiling results alone simply publish stats, there is no approval rating or contextual validation at all.
For example, at a previous assignment I discovered major issues with location information across a wide range of inside plant equipment for a utilities organisation. According to the profiling results the figure was bleak, 40% of equipment had a missing location value - the classic 'completeness' dimension.
However, this profiling figure gave us no means of true data quality assessment because:
- A huge proportion of that equipment was actually retired or assigned to spares
- A great deal of equipment belonged to other partners and was therefore out of scope
- Some equipment was actually mastered in another system so depending on the equipment type it was important to gather location data from another source
As you can see, the data profiling function can help us uncover these rules and requirements but data profiling alone cannot give us an accurate assessment.
Instead, we must define and build the rules elsewhere.
Step 3: Data Quality Assessment
Okay, we’ve profiled our data, discovered a wide ranging set of data quality requirements or rules and now we need to put our rules to the test.
We assess the data across our rules base and record the passes and fails to create a true assessment of data quality.
(Obviously taking a purist stance the only way we can make a true assessment of data quality is to validate the real source of the data but this is obviously impractical in most cases).
So in our earlier example we would assess the location of our equipment based on a far more stringent set of rules than profiling data would give us. We may use profiling functions to validate the function, length, code values and substring values against our data quality requirements but the goal is to determine whether each value passes or fails against an approved set of criteria.
Using this approach we can build a much clearer picture of data quality "health”.
Many companies instantly panic when they first run data profiling software on their data that highlights vast amount of defects. However, if they understand the bigger picture and start to move through the profiling, requirements gathering and data quality assessment phases they start to get a far more balanced and subjective view of how bad or good their data really is.
What do you think? Do you feel data profiling and data quality assessment are the same or different terms? Welcome your views on this topic.