Have you had the pleasure of filling out a form online recently? The level of enjoyment can range based on the person completing the survey and its complexity. I must admit there have been plenty of times where I have pushed the “Submit” button only to get the same screen refreshed with a red or pink error message telling me to look through the form one more time due to an omission or invalid data format. I also must admit, that even though I may be the only one interacting with the computer at the time, I get a little embarrassed that I had missed something. I can see why these data checkers are in place as they make the collection of a complete data set more likely.
A complete data set maximizes the value of the discoveries that can be made. It also minimizes the amount of time that it takes to investigate data gaps that may be present. As a result, data quality tools (e.g. completeness checkers that make sure that all required information has been provided and is in the right format) have become important resources to help facilitate complete data sets.
Although it is not critical to the KPI assessment process, it does help to know how data that will be used was collected. For privacy purposes, it is critical that you know where the data came from and what elements are considered personally identifiable information. (A topic for a future blog post.) For the purposes of the KPI assessment process, knowing how the data was collected may be useful when looking at the level of completeness of a data set. Let’s look at the data set referenced in blog #5 post, Data Review.
In Figure 1, there is a data set with two rows highlighted in yellow. Figure 2 outlines the data dictionary for the data set in Figure 1. When examining the data set, it contains three columns of data and eight rows, with the first row being the headers to the columns. The data appears to be complete and in the correct format until the sixth row of data where there is no date or shoe size. The seventh row of data is missing a valid shoe size. These types of observations are important to make note of in the data dictionary. At this stage of the data retrieval process, the sixth and seventh rows of data will be set aside as the remainder of the data is examined. At this point you are not deleting these rows, you are simply setting them aside.
Take notice of the column labeled “ID.” According to the data dictionary (Figure 2) the ID column contains a unique identifier. It is important to see if the unique value happens more than once in the data set. Looking through the data in Figure 1, all of the values in the ID column are unique. If there were situations where a duplicate value was found, this too would be something that should be noted in the data dictionary.
The data example in Figure 1 is only seven rows of data. This small number of rows of data is for ease of illustrating concepts. Most data sets will have many more rows of data. A key question that may be forming is how would one conduct a scan of a much larger data set? First, given that the data file is an extract, you should be able to view the file. Although the file may have 1,000 rows, it is important to take the time to review the file (as was discussed in yesterday’s post). You may see some inconsistencies by looking through the file, but due to the number of rows, it would be a good idea to use a data aggregation tool such as a pivot table or pivot chart in the spreadsheet application to quickly provide you with a summary of the data set. Additional calculated columns can be added to further determine averages, ranges, standard deviations etc. Of course, there are also statistical analysis programs that are available as well (R, SAS, SPSS, STATA); however, we are not going to be exploring tools of this type in this KPI Assessment process. The key question you want to be able to answer is given the data set provided, how complete is it? Are there inconsistencies and/or gaps in the data?
As we return to think about the sixth and seventh rows of data from Figure 1, there is a situation of a data gap (row six) and an inconsistency (row seven) present in the data set. Blog #7 will address the issue of data gaps given that when data gaps occur, a broader examination of data quality needs to be considered. In the case of the inconsistency, there is only one instance of this happening in the example data set. What caused this inconsistency to arise? Was it an entry error? Was it actually the response of the person surveyed? Did data get transposed from one field to another? Any of these are a possibility with the first two being more probable that others as they are linked to data entry error, which is very common. In the case of this example, the seventh row of data would be eliminated from further use in any analysis given that there are too many questions as to why the data is inconsistent and it is only 1 out of 7 (14%) of the data. However, if this same error occurred in a data set with 1,000 rows and 14% of the data set had inconsistencies (amounting to 140 cases) serious concerns about data quality may arise.
If 140 cases were inconsistent, time would need to be spent to determine if there is a pattern to the data inconsistency that could be caused by an algorithmic or a consistent data collection protocol error. For example, if through further investigation, it was determined that all 140 cases of inconsistency had a value of 100 as the shoe size, but was supposed to be 10. Additional information was provided by the survey team explaining that the “0” key of the keyboard kept sticking and added more zeros that should have been included. Therefore, the cause was consistent data collection protocol error.
An investigation of inconsistencies will lead to decisions about whether to:
Blog #6's question: Is the data complete?
Blog #7 sneak peek: Data Gaps