This post continues the examination of the need to conduct an initial quality check after utilizing a calculation process for one or more Key Performance Indicators (KPI) for the first time. In part 1 of this blog topic (see Part 1 in the previous entry (post #9)) an overview of the purpose for the quality check was presented as well as a specific focus on how to examine the business rules used. In part 2 of this blog topic we examine the fidelity of the algorithm used to calculate the KPI value as well as referencing the point in time of the data used for such calculation.
Algorithms Correctly Respond to the KPI
In the eighth blog post of this series entitled Initial Run Through, the concepts of area and perimeter were referenced. These concepts illustrate the part of the quality checking process specific to examining that the correct algorithms are used to calculate a particular KPI. The area of a shape is specific to certain aspects of it’s characteristics. In general, two numbers are going to be multiplied together, the length and width. This is straight forward when finding the area of a parallelogram. Webster’s online dictionary defines a parallelogram as “a quadrilateral with opposite sides parallel and equal”[i]. The means of calculating the area of a trapezoid is different. Webster’s defines a trapezoid as “a quadrilateral having only two sides parallel”[ii]. While the trapezoid also is a quadrilateral, its characteristics are slightly different then a parallelogram, and thus the formula for calculating the area is also slightly different ((base a + base b)/2) X (height). The point here is that the algorithm may appear to be calculating values with great precision; yet the team needs to be sure that the precision calculated is specific to the KPI assessed.
The Date Range is Reflective of the KPI Parameters and a Frozen File
“Ladies and Gentlemen, may I have your attention in the terminal. Generic Airlines Flight 1234 enroot to Rochester, NY is in the final stages of boarding for an on-time departure. The doors will be closing in 10 minutes and will not reopen.” Sound familiar? This page over the PA system is most prevalent in two situations where people are rushing to make their connections:
At this point in the performance year, there are still situations of data collection for the fourth quarter. As we move forward, this reality is recognized, and we will work through how to make progress while data is still coming in for this last 25% of the performance period; however, by this time, all data for the first, second and third quarters should be complete. Additional data should not be in the process of being added for the first three quarters. If this is the case where data is still being modified, then the team needs to deem the data set as not being ready. It is recommended that the strategies outlined in blog posts #4-7 be utilized to validate these three quarters of data once finalized.
The data set that was used to conduct the initial run through of data should be date stamped as to when the data set was extracted, from what system, and by what processes or people. All of this information should be included in the data dictionary that accompanies the data set. Additionally, as the calculation of a KPI may become more discrete, any subsets of data created should also contain timestamp information and be included in the data dictionaries of the subset(s) created.
An Unexpected Surprise
As the team continues to work with data sets, it is quite possible that an unforeseen glitch in the data may arise. Such a glitch may occur even after great care has been demonstrated to check for completeness and quality of the data. A common glitch that occurs at this stage is the sudden discovery of a significant set of data that has not yet been entered into the application storing the data needed. As a result, the team is now facing the reality that the data set currently being used to assess a KPI(s) is incomplete. The team should discuss the feasibility of back tracking the analytic process to include this information. Alternatively, the team may need to speak with other data leaders as to proper courses of action which may include the discontinuation of analysis for a specific KPI or proceeding with limited information. Moving forward known limitations raises a value statement once again: Do limitations that have surfaced are overpowered or outweigh the potential conclusions?
Blog #10 Question: Does your initial run through of data align with the intent of the KPI?
Blog #11 Preview: First Quarter Analysis
[i] Source: https://www.merriam-webster.com/dictionary/parallelogram retrieved 11/25/2017.
After the first time that business rules and algorithms are applied to a data set, whether it be for a Key Performance Indicator (KPI) assessment process or for any analytic purpose, a quality review should occur. A data quality review specific to the KPI assessment process will examine:
After each team member has had the opportunity to generate the values associated with the KPI in question, the team should come together to begin to deliberate on each other’s findings. The first question should focus on whether everyone generated the same value. If this is not the case, then the values should be compared. Is there a large difference? Document the value generated by each team member, the order with which each team member completed the process and the length of time it took for each team member to calculate the value. Once this information has been collected for each team member, the team should meet to begin the data quality review process buy examining the business rules used.
Business Rules Outlined Correctly
A great way to begin the discussion for each component of the quality review process, would be for the team to compare notes and see whether the value for the KPI under discussion was the same value for each member. If…
For every red car produced in the 2017 model year, the amount of carbon dioxide emitted will be 50% lower than the same red model of car manufactured from the 2016 model year.
The data set used to calculate the example KPI above contains multiple vehicle types and colors. The team should focus on how the subset of data was created. The example KPI is specific to two categorical variables: vehicle type and color (red cars). As a result, only red cars should be selected from the data set to conduct any calculations for this example KPI. As long as these two filters are the only ones being applied, then the order that the filter is applied should not matter (see Figure 1).
Figure 1: Portion of Example Data Set Dealing with Vehicle Attributes.
It does not matter whether the data set is filtered by the color of the vehicle first and then by vehicle type (results shown in yellow in column named “Color First”). The same results happen when the filtering occurs vehicle type first and then by color (results shown in blue in column named “Car First”). The label of YES means that the line of data is included.
The problem exists if the data set was used for another KPI where a different filter was applied and not removed as shown in Figure 2. Here the table includes an additional column referring to the type of interior being either leather or cloth. If the filter regarding the type of interior is not removed from the data set prior to calculating the example KPI, a problem will occur. An illustration of this problem is highlighted in green in Figure 2. If the data set is first filtered by the interior type (which has nothing to do with the example KPI), then the subset of data created to analyze the example KPI will not contain all the data needed. I encourage you to download this example table and play around with these different filtering orders.
Figure 2: Portion of Data Set Dealing with Vehicle Attributes including Distractors
In Part 2 of this Blog topic we examine the fidelity of the algorithm used and the point in time of the data.
Blog #9: Are the business rules carrying out the KPI assessment process or are they a mere reference guide?
Blog #10 Sneak Peak: Quarter 1 Analysis
It was an attention seeking statement, “Ladies and gentlemen, I have just found a wallet at the front of the cabin.” I looked up at the flight attendant who was speaking on the public-address system as if he was talking directly to me about my wallet. This strategy has been used by the lead flight attendant on several flights I have taken recently. Each time I react in the same way, by looking up-- even though I know that this strategy is used to get people’s attention about the flight safety presentation that was commencing. Why did I continue to have this small moment of panic and react the same way even though I knew that it was a way of getting my attention? Because, there was a possibility that it was true. I was not 100% sure. Instead, I had a feeling of panic when discovering that something may be missing whether it be an item or piece of information.
A same sense of panic results after reviewing a data set absent of certain information. In essence it is a data gap. In cities with mass transit rail lines (e.g. light rail, commuter rail, subway) there are warning signs posted to remind passengers to be alert about a gap between the platform and the train car. Sometimes these gaps are minimal. At other times these gaps can be as large as a foot. Regardless, the warning should be taken seriously at all times. Data gaps require the same level of care.
When a data gap is discovered it should be documented. An analyst’s mind will be the most objective about the data gap at the time of discovery. As soon as the analyst gets involved in the details of solving the question as to why the data gap occurred, the clarity of capturing descriptive details becomes clouded. At the time of discovery, the data gap documentation should include the name of the data set, the column and rows where the data is missing and any other details that may be relevant to the discovery (e.g. data values that contain out of place characters).
After initial discovery, concerns about the missing data should be discussed with your data team and a plan of investigating the data gap should be developed. The first step in the investigation is to check the data dictionary for any notes from other colleagues who were involved in the initial capture of the data or the data extraction. If no documentation is included in the data dictionary with an explanation, then the next step is to determine the data source. Was the data extracted by:
The process of identifying the cause of data gaps and making a decision as to how to move forward is time consuming. When multiple people are involved there will be time spent on research of past practice and the review of meta data. This will involve waiting; however, it is important to figure out why the data gap exists. The process of discovery should not be abandoned because it is taking too long and is complex. Assessing a KPI based on flawed data will cause major issues down the road and should be avoided.
Blog #7 Question: Are you taking care to mind the data gap?
Blog #8 Sneak Peak: Initial Run Through
Have you had the pleasure of filling out a form online recently? The level of enjoyment can range based on the person completing the survey and its complexity. I must admit there have been plenty of times where I have pushed the “Submit” button only to get the same screen refreshed with a red or pink error message telling me to look through the form one more time due to an omission or invalid data format. I also must admit, that even though I may be the only one interacting with the computer at the time, I get a little embarrassed that I had missed something. I can see why these data checkers are in place as they make the collection of a complete data set more likely.
A complete data set maximizes the value of the discoveries that can be made. It also minimizes the amount of time that it takes to investigate data gaps that may be present. As a result, data quality tools (e.g. completeness checkers that make sure that all required information has been provided and is in the right format) have become important resources to help facilitate complete data sets.
Although it is not critical to the KPI assessment process, it does help to know how data that will be used was collected. For privacy purposes, it is critical that you know where the data came from and what elements are considered personally identifiable information. (A topic for a future blog post.) For the purposes of the KPI assessment process, knowing how the data was collected may be useful when looking at the level of completeness of a data set. Let’s look at the data set referenced in blog #5 post, Data Review.
In Figure 1, there is a data set with two rows highlighted in yellow. Figure 2 outlines the data dictionary for the data set in Figure 1. When examining the data set, it contains three columns of data and eight rows, with the first row being the headers to the columns. The data appears to be complete and in the correct format until the sixth row of data where there is no date or shoe size. The seventh row of data is missing a valid shoe size. These types of observations are important to make note of in the data dictionary. At this stage of the data retrieval process, the sixth and seventh rows of data will be set aside as the remainder of the data is examined. At this point you are not deleting these rows, you are simply setting them aside.
Take notice of the column labeled “ID.” According to the data dictionary (Figure 2) the ID column contains a unique identifier. It is important to see if the unique value happens more than once in the data set. Looking through the data in Figure 1, all of the values in the ID column are unique. If there were situations where a duplicate value was found, this too would be something that should be noted in the data dictionary.
The data example in Figure 1 is only seven rows of data. This small number of rows of data is for ease of illustrating concepts. Most data sets will have many more rows of data. A key question that may be forming is how would one conduct a scan of a much larger data set? First, given that the data file is an extract, you should be able to view the file. Although the file may have 1,000 rows, it is important to take the time to review the file (as was discussed in yesterday’s post). You may see some inconsistencies by looking through the file, but due to the number of rows, it would be a good idea to use a data aggregation tool such as a pivot table or pivot chart in the spreadsheet application to quickly provide you with a summary of the data set. Additional calculated columns can be added to further determine averages, ranges, standard deviations etc. Of course, there are also statistical analysis programs that are available as well (R, SAS, SPSS, STATA); however, we are not going to be exploring tools of this type in this KPI Assessment process. The key question you want to be able to answer is given the data set provided, how complete is it? Are there inconsistencies and/or gaps in the data?
As we return to think about the sixth and seventh rows of data from Figure 1, there is a situation of a data gap (row six) and an inconsistency (row seven) present in the data set. Blog #7 will address the issue of data gaps given that when data gaps occur, a broader examination of data quality needs to be considered. In the case of the inconsistency, there is only one instance of this happening in the example data set. What caused this inconsistency to arise? Was it an entry error? Was it actually the response of the person surveyed? Did data get transposed from one field to another? Any of these are a possibility with the first two being more probable that others as they are linked to data entry error, which is very common. In the case of this example, the seventh row of data would be eliminated from further use in any analysis given that there are too many questions as to why the data is inconsistent and it is only 1 out of 7 (14%) of the data. However, if this same error occurred in a data set with 1,000 rows and 14% of the data set had inconsistencies (amounting to 140 cases) serious concerns about data quality may arise.
If 140 cases were inconsistent, time would need to be spent to determine if there is a pattern to the data inconsistency that could be caused by an algorithmic or a consistent data collection protocol error. For example, if through further investigation, it was determined that all 140 cases of inconsistency had a value of 100 as the shoe size, but was supposed to be 10. Additional information was provided by the survey team explaining that the “0” key of the keyboard kept sticking and added more zeros that should have been included. Therefore, the cause was consistent data collection protocol error.
An investigation of inconsistencies will lead to decisions about whether to:
Blog #6's question: Is the data complete?
Blog #7 sneak peek: Data Gaps
I find it like opening a present. You get a feeling of excitement and anticipation as you open a data set for the first time. Today we are going to spend time reviewing data that has been extracted from one of the applications that you inventoried as outlined in yesterday’s process, Identifying Data Sources. If this experience of opening a data set for the first time is new to you, don’t worry because we are going to work through this process together using the following steps:
Take in the view. Look at the columns and rows and get a feel for the construction of the file. Look at the very bottom of the screen. There should be a tab with the name of the data sheet that is currently open. Is there more than one tab? If so, a colleague must have programmed an extract with several tabs or organized the data into tabs after extraction. Take a moment to click through those tabs and see what information is contained on each.
Blog #5’s question: Have you set aside time to review each data file and take in its “artistic” qualities?
Blog #6 Sneak Peak: Data completeness.
Progress has been made! We are moving to the next part of the KPI assessment process. As referenced in the second blog post entitled, KPI Assessment Timeline, we are moving from the project plan to the data retrieval stage. I cannot tell you how many times I have been asked directly or been in the presence of others who were asked “How complicated is it to retrieve data?” When the technical answer begins to roll off the tongue, some seem to get bored. I really don’t know why that is as it is a truly amazing process! In an over-avoidance of not getting too technical, let's talk about a data retrieval process like going grocery shopping. There are different types of “trips” to the grocery store. Data retrieval processes are not like stopping to get a gallon of milk, it is more like stopping to get the necessary ingredients for a family reunion style Thanksgiving dinner. If you have been to a large family event you can appreciate the amount of time and effort that goes into planning and preparing for numerous guests. If you have not had this pleasure, think about a big gathering that you participated in recently. Now think specifically about the family member/person who was in-charge of the event. I would venture to say that they:
These characteristics are reflective of a great data retrieval team, they know:
Often, data retrieval is simplified by routinely extracting the same data. This routine process makes the location of the data continually familiar. However, routine is not best practice. Documentation is a process that makes the location of data known. If your organization has not yet documented the location of all data attributes, the applications that store them, and the servers that house them it is time to set aside some time in the near future and complete this task. This documentation is organized in a manner that has similarities to Figure 1. Applications are stored on servers, and tables of data are associated with each application.
Having an inventory of each of these three aspects of information architecture is extremely important. Figure 1 only considers one server, but most organizations have multiple servers. (If everything you do is cloud based consider each cloud application its own application and the cloud as the server.) If your organization does not have a detailed mapping at this point in time, create a data map skeleton for the time being. An example of a data map skeleton is shown in Figure 2.
If documentation does exist, it is time to cross check and verify that all information is up-to-date. Even if the documentation has a “last reviewed” date that is within a couple of weeks or days, spend the time to verify. Today’s step is all about checking and inventorying what you have. Some of this data may not be used in the KPI assessment process, but in order to know what you are going to be using you need to be aware of all of the data attributes available.
Some organizations may have an inventory of data that they store in a database of its own. Even in this case, it is time to do a data audit to make sure that the system has captured all that you intended it to capture.
Other organizations may feel comfortable in the fact that they mapped out the data they would need to assess their KPIs from the time that the KPIs were set. Having been consistent throughout is wonderful. Make sure you take an inventory to verify. Another data point(s) may have been collected since the KPIs were put into place and may be a better fit than the data that was available at time of the original mapping.
Check and recheck. That is today’s bottom line. By investing in taking an inventory of what you have, it will help prevent a missing piece from being discovered at a later time. A significant amount of time can be wasted when a critical piece is over looked and later discovered. This waste of time is not only effecting performance but patience as frustration is certain to rise. Stay a step ahead by being thorough.
Blog #4’s Question: Do you know what data your organization has, where it is stored, and how to retrieve it?
Blog #5 sneak peek: Data Review