Arkady Maydanchik provides practical tips for managing domain data quality rules for attributes
Data Quality Rules Tutorial 1 of 4: Attribute Domain Constraints
by Arkady Maydanchik
In this tutorial, Arkady Maydanchik provides an overview of the 5 categories of data quality rules as covered in his Data Quality Assessment book. He discusses in greater detail the role of Attribute Domain constraints.
Imagine that you are appointed to be a home-plate umpire in a major league baseball game.
Of course, you cannot do it without knowing the rules. The official rulebook of major league baseball contains 124 rules, many with numerous sub-rules. If you miss just a couple of rules, you may inadvertently influence the outcome of the game – the one play that you call erroneously could be decisive. If you do not know 10% of the rules, you can easily cause a riot. Also, complicated rules are no less important than easy ones, so learning all but 10% of the most complex rules still leaves you 10% short of the target.
Data quality rules play the same role in data quality assessment as the rules of baseball in refereeing a major league game. They determine the outcome!
Unfortunately, identifying data quality rules is more difficult than learning rules of baseball because there is no official rulebook that is the same for all databases. In every project, we have to discover the rules anew.
Also, some rules are easy to find, while others require lots of digging; some rules are easy to understand and implement, while others necessitate writing rather complex programs. But, as with baseball, all rules are equally important. Omitting a few complex and obscure data quality rules can (and most of the time will!) jeopardize the entire effort.
Data quality rules fall into five broad categories.
- Attribute domain constraints restrict allowed values of individual data attributes. They are the most basic of all data quality rules.
- Relational integrity rules are derived from the relational data models and enforce identity and referential integrity of the data.
- Rules for historical data include timeline constraints and value patterns for time-dependent value stacks and event histories.
- Rules for state-dependent objects place constraint on the lifecycle of objects described by so-called state-transition models.
- General dependency rules describe complex attribute relationships, including constraints on redundant, derived, partially dependent, and correlated attributes.
Attribute Domain Constraints
At the most atomic level, the data in any database consists of individual values of various attributes. Those values generally represent measurements of the characteristics of real world people, things, places, or events. For instance, height and weight are characteristics of people; latitude and longitude are characteristics of geographical locations on Earth; and room number and duration are characteristics of a "business meeting” event.
Now, real world objects cannot take any shape and form. We do not expect people to be 12 feet tall, or meetings to be held on the 215th floor. What this means is that attribute values of these objects cannot take any values but only certain reasonable ones. The data quality rules used to validate individual attribute values are commonly referred to as attribute domain constraints. These include optionality, format, valid value, precision, and granularity requirements.
Attribute domain constraints can be deduced from analysis of the meta data, such as data models, data dictionaries, and lookup tables. However, these meta data should be used with caution since they are often incorrect or incomplete. Data models typically reflect the data structure at the time of database design. Over time data models are rarely updated and quickly become obsolete, especially in the volatile area of attribute domains. Data dictionaries and lookup tables are also seldom up-to-date.
The cure to this common meta data malady is data profiling – a combination of techniques aimed at examining the data and understanding its actual content, rather than that described theoretically in the data models and data dictionaries. Specifically, attribute profiling examines the values of individual data attributes and produces three categories of meta data for each attribute: basic aggregate statistics, frequent values, and value distribution. Analysis of this information helps identify allowed values for an attribute.
These prevent attributes from taking Null, or missing, values. On the surface, these constraints appear to be the easiest to identify. In fact, Non-Null constraints are often enforced by relational databases. So what is the point to validate such constraints? The devil, as usual, is in the details.
First, Not-Null constraints are often turned off in the databases to allow for situations when attribute value is not available (though required!) but the record must be created. For the same reason, optionality is not always represented correctly in the meta data. Routinely, data models and data dictionaries reflect actual database configuration (i.e. Null value allowed), rather than the true data quality requirement.
More importantly, default values are often entered to circumvent the Not-Null constraints. Attribute is populated with such a default when actual value is not available. Database designers are often unaware of such default values, and data dictionaries rarely list them. Even business users who enter them might forget all default values they use or used in the past. Yet, default values are no different from missing values for all practical purposes.
How do we identify the default values? This is done by analysis of frequent values in the attribute profile. The default values are usually "strikingly inappropriate.” Frequent values that do not look "real” are likely candidates.
Format constraints define the expected form in which the attribute values are stored in the database field. Format constraints are most important when dealing with "legacy” databases. However, even modern databases are full of surprises. From time to time, numeric and date/time attributes are still stored in text fields.
Format constraints for numeric, date/time, and currency attributes are usually represented as a value mask, a la MMDDYYYY standing for 2-digit month followed by 2-digit day and 4-digit year for the date. Text attributes are most often made of a single word that has restriction on length, allowed set of characters, and mask. Text attributes made of multiple words often have format constraints in the form of a word pattern.
Valid Value Constraints
Valid value constraints limit permitted attribute values to a prescribed list or range. Unfortunately, valid value lists are often unavailable, incomplete, or incorrect. To identify valid values, we first need to collect counts of all actual values. These counts can then be analyzed, and actual values can be cross-referenced against the valid value list, if available. Values that are found in many records are probably valid, even if they are missing from the data dictionary. This typically happens when new values are added after the original database design and are not added to the documentation. Values that have low frequency are suspect.
For numeric and date/time attributes, the number of valid values is typically infinite (or at least too large to be enumerated in practice). However, even for these attributes certain values are not valid. For instance, annual compensation cannot be negative. Further, an employee date of birth can be neither too far in the past nor too close to present day. In these situations, the attribute domain constraints take form of a valid value range rather than a list.
Some numeric and especially date/time attributes have very complex domain constraints. Instead of a single range, the domain is defined as a set of ranges following a certain pattern. For example, year-end bonus payment date may only fall in December and January. Extensive analysis of value distributions is the only way to identify such domain constraints.
Precision constraints require all values of an attribute to have the same precision, granularity, and unit of measurement. Precision constraints can apply to both numeric and date/time attributes. For numeric values, they define the desired number of decimals. For the date/time attributes, precision can be defined as calendar month, day, hour, minute, or second. Data profiling can be used to calculate distribution of values for each precision level. Fluctuation of the distribution from random is a sign of a prevailing precision.
In this article we introduced the concept of data quality rules and discussed the first category of rules – attribute domain constraints. There are many practical challenges in identifying even these seemingly trivial rules. Comprehensive attribute profiling is the key to success. Without detailed understanding of attribute profiles, domain constraints will always be incomplete and incorrect. In the next article of the series we will discuss relational integrity rules.
For more than 20 years, Arkady Maydanchik has been a recognized leader and innovator in the fields of data quality and information integration. He is the author of the extremely successful publication "Data Quality Assessment".