Back in Blog Post #1, together we began the journey through this blog to assess the Key Performance Indicators (KPIs) that we labeled the KPI Assessment (KPIA) process. As the end of the annual performance period nears, an analysis team is like a fire battalion getting ready for the next emergency call. The analysis team will do whatever it can to prepare to complete an analysis in as timely a manner as possible while preserving the highest levels of accuracy. The last blog article outlined some considerations for calculating the quarterly values for each annual KPI. The question remains, how can you assess fourth quarter results and generate the final KPI value before the fourth quarter ends? The answer is quite simple: you can’t.
One of the wisest people I know says “focus on what you can, not what you can’t.” In that spirit, although it is not possible to calculate the fourth quarter and final KPI values at this time, you can project what they will be with a high sense of reliability.
Just before I began to write this blog article, I watched the election returns occur for the special senate election that in Alabama. I wasn’t watching commentary on a news station, or listening to the radio, instead I was watching real-time updates on the New York Times’s Election 2017 returns website. The data used on the site to allow for comprehensive monitoring was awesome. The data displayed was not only representative of what was happening at the moment, but there were various tools that the observer could use to interpret the validity of the votes that had been processed up to that moment. What you will notice when visiting the site is that there are more widgets to monitor the validity of the data than there are for analysis of the data itself. These verification tools are what make the site especially awesome. I viewed the site for more than an hour (beginning at 9:30 PM to about 10:45 PM (EST)). It was really interesting to watch the variation in the confidence of the data at any one moment. The site registered an update every minute whether new results had come in or not. Once 90% of the votes were counted the data verification widgets had little fluctuation, and shortly thereafter the race was called.
I bring up the New York Times election result site as a point of reference to think about when calculating fourth quarter values before the end of the fourth quarter. Actually, given that the fourth quarter is not yet completed, you won’t be calculating fourth quarter results, rather you will be projecting the anticipated value. Projections are nothing to take lightly. They can be very dangerous if interpreted incorrectly and are subject to change. The main purpose of a projection is to give you an idea about the direction that something is headed toward. Is the value going to me more or less likely X? When making a projection as to where the fourth quarter may end up, the quality of the data set should be in better shape than real time election results pouring in. Real time data analysis has to factor in a level of uncertainty that is acceptable. Given how close the race was in the December 2017 Alabama special election, 90% of the votes needed to be in for teams of analysts to be comfortable with making a call about a winner before the remaining votes were tallied. Unlike processing a data set that is changing in real time, examining a data set that is still active, yet with the majority of the data collected, the levels of uncertainty should be less. With only 15 days or so until the end of the fourth quarter, out of a total of 92 possible days (84% of the quarter has been completed),[i] there is sufficient data to make an informed projection for the fourth quarter.
Fourth Quarter Projections
Having sufficient data does not mean that a projection is always a good thing to use. A projection is not a guess. As defined by Merriam Webster[i], a projection (as related to statistical analysis) is “an estimate of future possibilities based on a current trend.” Whereas a guess[ii] is defined “to arrive at a correct conclusion about by conjecture, chance, or intuition.” One is based on a series of facts from the past that estimate the future, whereas the other is based on conjecture and intuition, a true chance. Therefore, projections require methodology and past evidence. When projecting the values of the fourth quarter to best prepare timely evaluative analysis there are four things to check for:
Please note that this can be a tricky situation. By considering that enough time has passed also assumes that data has been readily collected. If data has not been collected, then the validity of the projection will be lower because it will solely rely on past performance data.
Once the fourth quarter KPI values are calculated and subsequently the year-end KPI, the team is ready to identify the limitations, deductions, completeness, and special circumstances. Both a limitation and special circumstance for these values is that they are projections. Like quarters one through three, each team member should sign-off on the projected value for the fourth quarter and for the year as a whole signifying that the prescribed business rules were followed, the values were cross referenced, and notes relevant to the KPI value(s) are recorded correctly.
With these projections in place, the team can meet to do one final data quality check. The steps involved, and the next phase will be discussed in the next blog article #14.
Blog #13 Question: Is it ok to project?
Blog #14 Sneak Peek: The KPI Assessment wrap-up.
Resources as cited:
[i]https://www.merriam-webster.com/dictionary/projection retrieved on December 13, 2017
[ii] https://www.merriam-webster.com/dictionary/projection retrieved on December 13, 2017
[i] Even when considering business days only, out of a total of 63 business days in the fourth quarter and 9 remaining 85% of the fourth quarter has been completed. The one percentage point difference when compared to all possible days is not significant.
Now that the first quarter Key Performance Indicators (KPI) have been calculated and reviewed, and the purpose of the first quarter’s results have been identified, it is time to move on to the remaining quarters. Similar to the analysis for the first quarter, the second and third quarters should be analyzed in the same fashion. The data for the first through third quarters should be considered frozen and ready for summative analysis (Refer back to blog articles five through seven for guidance on determining the readiness of a “frozen” file.) The fourth quarter; however, will require a different process for review as it is still an active data set, since it is the middle the fourth quarter when this process has begun and does not end until the thirty mark on the timeline outlined in Blog #1. Therefore, the fourth quarter’s data is still being collected and verified.
After running the data for each KPI for the first quarter there are various items the team should consider:
Each of these considerations should be examined when completing the calculations for quarters two and three as well.
Nothing is perfect. There is always some sort of caveat about any data point. At this stage of the process, the team should be looking for limitations related to the collection rather than the significance, or lack thereof, of the data used to determine the value. Let’s look at an example to illustrate the difference. The temperature hit 76 degrees Fahrenheit. This temperature was highly unusual for the city that it was measured in as it was February and usually the temperature is 27 degrees Fahrenheit. What caused this unusual temperature record?
Similar to the process outlined for limitations, the focus at this stage should be on the data collection process rather than the context which will happen during the KPI evaluation process. At this stage it is important to consider related data points that bolster the meaning of the KPI data point. Consider the following KPI:
By the end of the 2017 fiscal year, 85% of vacancies that arise will be filled with candidates who have a college diploma.
For this example KPI, let’s say that the actual percentage for the second quarter was 90%. Additional deductions that can be made could include the percentage of:
The team should once again, review the data used to calculate the quarter specific data point to check to make sure that is reflective of the data set. Data quality checks that were referenced in KPI Assessment (KPIA) blog articles five through seven, should be used to once again conduct a verification for completeness.
4. Special Circumstances
When being interviewed, a common (and quite effective) question asked is: “is there anything else that I should know?” Often this question ends up being a safety net that catches information that is pertinent and useful but might not have been anticipated. When thinking about special circumstances that should be recorded about each quarter’s data this question comes in useful. The data dictionary for the data sets that were used to calculate the KPI(s) is a great resource to explore notations made during the data retrieval and analysis processes. These notations may be important to explain in greater detail as potential limitations or notes that will help later-on in the evaluation process. Items that are not already listed in the data dictionary can also be added. For example, based on notes from the data dictionary, there were some concerns about data quality during the first quarter. When the KPIs were calculated for each quarter, there was more variation with the first quarter data than the second or third. Upon further investigation, the team recalled that the team member that was usually in charge of the data collection of the KPI in question was on maternity leave during the first quarter. The person who was filling-in to manage data collection in the first quarter did not have as much experience as the team mate on maternity leave. The quality of the data was noticeably better once the team member on maternity leave returned to work. When the KPIE team constructs a narrative around the meaning and implications of the KPI values calculated during the assessment process, this type of special circumstance may become an important footnote to explain certain variations in the data.
Once the value has been determined for each KPI for the first, second, and third quarters, and the team has identified the limitations, deductions, completeness, and special circumstances, the team needs to come to consensus on the finality of the values calculated. For each quarter each team member should sign-off on the value calculated signifying that the prescribed business rules were followed, the values were cross referenced, and notes relevant to the KPI value(s) are recorded correctly.
Although the end of the fourth quarter has not yet occurred, there are steps to tentatively complete the KPIA process and commence with KPIE. In the next blog article, we will focus on the fourth quarter.
Blog #12 Questions: Should there be limitations to quarterly data?
Blog #13 Sneak Peek: Fourth Quarter
“Shouldn’t we have already analyzed first quarter data?” Ideally, yes, as well as the second and third quarters. The best way to make use of a Key Performance Indicator (KPI) is to continually monitor its progress, thereby increasing the likelihood that the KPI will be met. Continual review of a KPI throughout the year allows for timely decisions to be made to improve performance in real time and increase the prospect that the annual goal is achieved.
The use of a KPI during the time of performance has a different purpose then when engaged in the Key Performance Indicator Assessment (KPIA) process. During KPIA, the focus shifts from the formative (monitoring) to the summative use of data whereby the final values of KPIs are calculated and verified. Once final KPI values are calculated and reviewed they are then ready for use in the Key Performance Indicator Evaluation (KPIE) process where an in-depth review takes place to provide narrative and insight regarding the summative KPI values.
The questions listed below are illustrative examples of those that arise when a process moves from the preparation phase to the action phase.
KPI Use Continuum
Determining where your organization is on the KPI use continuum is important in order to maximize the efforts that have already been invested in the KPI formative process.
As shown in Figure 1, organizations generally fall into one of four categories:
Category 1: The organization continually monitors KPIs
Category 2: The organization monitors KPIs quarterly
Category 3: The organization monitors KPIs annually
Category 4: The organization has KPI data but does not yet have a plan to analyze the KPI data.
Figure 1: KPI use continuum
An organization may fall in-between one of these categories, where it may be closer to one category than another with the exception of category 4. Category 4 represents an organization where there is no systematic process yet developed to use data to measure KPIs although the data may exist.
Organizations in categories one and two perform some sort of formative KPI process by the very nature of the time period associated with the time of KPI review. When it is time for the annual process, where summative information is determined, those that represent categories one and two have additional formative data that can be used to conduct a data quality check for the summative values that were calculated. Additionally, as will be discussed in the KPIE process, these values can add to the contextual narrative associated with the final numbers that were calculated. Similar to categories one and two, those organizations in category 3 simply need to follow the process outlined. Unlike categories one and two, category three organizations will not have formative data points available for comparison since they are just engaging in the review of their KPI data. A special blog within a blog will occur later on to address those organizations that identify with category four.
In addition to the final annual values calculated for each KPI, it is also important to calculate the values of KPIs for each quarter based on data that has been verified at the end of the year. Calculating quarter specific values will serve as important points of reference when establishing trends that occurred during the year.
Importance of First Quarter Data
Is the first quarter the baseline or the frontline? In many cases the emphasis of first quarter data is one or the other. When the first quarter is the baseline, then KPI performance during this time period is used to compare the performance of the remaining three quarters of the year. The first quarter is often used soley as a baseline value when a new initiative is launched at the beginning of the analytic period and tweaks to the process are made during those first three months. Alternatively, a baseline using the first quarter may be implemented when looking at a new initiative that is started in the second quarter of the year.
If the KPI is one that is routinely used, then the first quarter results may be the frontline value representing a value that is expected to sustain at similar levels the rest of the year. There is less mystery with a frontline number, it is usually predicated by historical data and is anticipated within a range of certainty. Baselines are often diagnostic and set a bar for growth or reduction expectations.
Even though the first quarter data may be a frontline, it often becomes more of a baseline as time moves on as it becomes a comparative data point. For example, consider that renewable energy consumption is a KPI for a company. The amount of solar energy the company can utilize is expected to remain consistent at X kilowatts of power a month due to being in a sunny location year-round. The first quarter produced X kilowatts of power as expected, but the second quarter yielded a decline in power consumption by one-fifth. The second quarter results were not expected. By the very nature of the unexpected second quarter result, the first quarter becomes a baseline value of what is expected versus what actually happened in the second quarter.
The team should make sure that the first quarter data is calculated and cross checked. The team will also want to make note of the following (all of these are described in the next Blog article (#12)):
Blog #11 Call to Action Question: Where is your organization on the KPI use Continuum?
Blog #12 Sneak Peek: The Quarterly
This post continues the examination of the need to conduct an initial quality check after utilizing a calculation process for one or more Key Performance Indicators (KPI) for the first time. In part 1 of this blog topic (see Part 1 in the previous entry (post #9)) an overview of the purpose for the quality check was presented as well as a specific focus on how to examine the business rules used. In part 2 of this blog topic we examine the fidelity of the algorithm used to calculate the KPI value as well as referencing the point in time of the data used for such calculation.
Algorithms Correctly Respond to the KPI
In the eighth blog post of this series entitled Initial Run Through, the concepts of area and perimeter were referenced. These concepts illustrate the part of the quality checking process specific to examining that the correct algorithms are used to calculate a particular KPI. The area of a shape is specific to certain aspects of it’s characteristics. In general, two numbers are going to be multiplied together, the length and width. This is straight forward when finding the area of a parallelogram. Webster’s online dictionary defines a parallelogram as “a quadrilateral with opposite sides parallel and equal”[i]. The means of calculating the area of a trapezoid is different. Webster’s defines a trapezoid as “a quadrilateral having only two sides parallel”[ii]. While the trapezoid also is a quadrilateral, its characteristics are slightly different then a parallelogram, and thus the formula for calculating the area is also slightly different ((base a + base b)/2) X (height). The point here is that the algorithm may appear to be calculating values with great precision; yet the team needs to be sure that the precision calculated is specific to the KPI assessed.
The Date Range is Reflective of the KPI Parameters and a Frozen File
“Ladies and Gentlemen, may I have your attention in the terminal. Generic Airlines Flight 1234 enroot to Rochester, NY is in the final stages of boarding for an on-time departure. The doors will be closing in 10 minutes and will not reopen.” Sound familiar? This page over the PA system is most prevalent in two situations where people are rushing to make their connections:
At this point in the performance year, there are still situations of data collection for the fourth quarter. As we move forward, this reality is recognized, and we will work through how to make progress while data is still coming in for this last 25% of the performance period; however, by this time, all data for the first, second and third quarters should be complete. Additional data should not be in the process of being added for the first three quarters. If this is the case where data is still being modified, then the team needs to deem the data set as not being ready. It is recommended that the strategies outlined in blog posts #4-7 be utilized to validate these three quarters of data once finalized.
The data set that was used to conduct the initial run through of data should be date stamped as to when the data set was extracted, from what system, and by what processes or people. All of this information should be included in the data dictionary that accompanies the data set. Additionally, as the calculation of a KPI may become more discrete, any subsets of data created should also contain timestamp information and be included in the data dictionaries of the subset(s) created.
An Unexpected Surprise
As the team continues to work with data sets, it is quite possible that an unforeseen glitch in the data may arise. Such a glitch may occur even after great care has been demonstrated to check for completeness and quality of the data. A common glitch that occurs at this stage is the sudden discovery of a significant set of data that has not yet been entered into the application storing the data needed. As a result, the team is now facing the reality that the data set currently being used to assess a KPI(s) is incomplete. The team should discuss the feasibility of back tracking the analytic process to include this information. Alternatively, the team may need to speak with other data leaders as to proper courses of action which may include the discontinuation of analysis for a specific KPI or proceeding with limited information. Moving forward known limitations raises a value statement once again: Do limitations that have surfaced are overpowered or outweigh the potential conclusions?
Blog #10 Question: Does your initial run through of data align with the intent of the KPI?
Blog #11 Preview: First Quarter Analysis
[i] Source: https://www.merriam-webster.com/dictionary/parallelogram retrieved 11/25/2017.
After the first time that business rules and algorithms are applied to a data set, whether it be for a Key Performance Indicator (KPI) assessment process or for any analytic purpose, a quality review should occur. A data quality review specific to the KPI assessment process will examine:
After each team member has had the opportunity to generate the values associated with the KPI in question, the team should come together to begin to deliberate on each other’s findings. The first question should focus on whether everyone generated the same value. If this is not the case, then the values should be compared. Is there a large difference? Document the value generated by each team member, the order with which each team member completed the process and the length of time it took for each team member to calculate the value. Once this information has been collected for each team member, the team should meet to begin the data quality review process buy examining the business rules used.
Business Rules Outlined Correctly
A great way to begin the discussion for each component of the quality review process, would be for the team to compare notes and see whether the value for the KPI under discussion was the same value for each member. If…
For every red car produced in the 2017 model year, the amount of carbon dioxide emitted will be 50% lower than the same red model of car manufactured from the 2016 model year.
The data set used to calculate the example KPI above contains multiple vehicle types and colors. The team should focus on how the subset of data was created. The example KPI is specific to two categorical variables: vehicle type and color (red cars). As a result, only red cars should be selected from the data set to conduct any calculations for this example KPI. As long as these two filters are the only ones being applied, then the order that the filter is applied should not matter (see Figure 1).
Figure 1: Portion of Example Data Set Dealing with Vehicle Attributes.
It does not matter whether the data set is filtered by the color of the vehicle first and then by vehicle type (results shown in yellow in column named “Color First”). The same results happen when the filtering occurs vehicle type first and then by color (results shown in blue in column named “Car First”). The label of YES means that the line of data is included.
The problem exists if the data set was used for another KPI where a different filter was applied and not removed as shown in Figure 2. Here the table includes an additional column referring to the type of interior being either leather or cloth. If the filter regarding the type of interior is not removed from the data set prior to calculating the example KPI, a problem will occur. An illustration of this problem is highlighted in green in Figure 2. If the data set is first filtered by the interior type (which has nothing to do with the example KPI), then the subset of data created to analyze the example KPI will not contain all the data needed. I encourage you to download this example table and play around with these different filtering orders.
Figure 2: Portion of Data Set Dealing with Vehicle Attributes including Distractors
In Part 2 of this Blog topic we examine the fidelity of the algorithm used and the point in time of the data.
Blog #9: Are the business rules carrying out the KPI assessment process or are they a mere reference guide?
Blog #10 Sneak Peak: Quarter 1 Analysis
The data sets are ready to go, and this blog post marks the beginning of the third phase of the KPI assessment process, the Test Run. Congratulations on now finishing the first two phases (Project Plan and Data Retrieval). The first aspect of the test run is to do the initial run through. Here you will be using data from various data sets that you prepared during the data retrieval process. You will need to identify the needed data for each KPI. The initial run through will relay heavily on the use of business rules. A business rule is made up of formulas and procedures to complete a calculation. We have all been using business rules to solve things our entire life. Remember the math problem about installing a fence?
Area and perimeter. Those are the first two formulas that I remember learning in school. Applying the formula was empowering. Being able to take numbers and generate another new value almost seemed like magic. It was also quite satisfying to solve a word problem that included certain pieces of information and then was prompted to find the solution. Key Performance Indicators (KPI) are like word problems. A more accurate label would be “word opportunities” that once explored will lead to a higher level of insight. Word problems that involved the concepts of area and perimeter were based on formulaic concepts that were specific to different families of shapes. Each of these shape families had specific business rules associated with how the values of area and perimeter are determined for them. Each KPI will have one or more business rules associated with them that outlines how the actual indicator is calculated.
At the time of KPI development, best practice would involve the identification of the data needed to measure the KPI, the availability of that data, where the data will be housed, and a business rule that explains how to complete the calculation. If all of these elements are in place and you have discussed them with your team, you are ready to run your initial analysis procedure (and you can skip ahead to the last paragraph.) It is very common that getting the KPIs developed, agreed to, and data collection protocols developed and documented at the beginning of the year is a heavy lift and getting to develop the analysis business rules is left to another time. In order to move forward in the analytic process, that time has come.
To begin the exploration of business rule construction, lets begin with an example KPI to work with. We will use our fictional company, Reusable Cup Coffee (see blog post #3 entitled, Be Part of a Team). One of Reusable’s KPIs for the 2017 performance year was the following:
Beginning in the second quarter of 2017, all stores will have trained all team members on how to make the Squared Latte, such that 90% of store staff achieve the employment level of barista associate.
A couple of notes about the example KPI:
Table 1: KPI Broken into parts with corresponding purpose to each part
Now that the KPI has been broken into parts and a purpose of each part is stated, we can add one more column to the table that provides a data source description for the applicable parts of the KPI.
Table 2: Table 1 with additional column, Data Source Description
When reviewing Table 2, the right most column labeled “Data Source Description” has a note specific to the corresponding part for the KPI. When developing the business rule, begin with the algorithm. For this KPI:
KPI Value (%)= (Number of Employees with “Barista Associate” as job title/Total Number of Store Employees) X (100)
Simplified: KPI % = (B/A) X 100
KPI Value (%) meets KPI target in > 89%.
After the algorithm has been reviewed, construct the data calculation set by implementing business rules. The data calculation set is the data that is necessary to answer the question. The following business rules will be used to create this calculation data set for our example KPI:
==Beginning of Business Rules
Let’s run through these business rules and try it out.
KPI % = .94
Once business rules have been developed for each KPI, you are all set to conduct the initial run through of calculating the KPI values. Don’t worry if you get weird looking results or encounter errors during the initial run through. Any of these special cases will be discussed a greater length in the next blog post.
Blog Post #8 Question: How did the Test Run go?
Blog Post #9 Preview: Pit Stop
It was an attention seeking statement, “Ladies and gentlemen, I have just found a wallet at the front of the cabin.” I looked up at the flight attendant who was speaking on the public-address system as if he was talking directly to me about my wallet. This strategy has been used by the lead flight attendant on several flights I have taken recently. Each time I react in the same way, by looking up-- even though I know that this strategy is used to get people’s attention about the flight safety presentation that was commencing. Why did I continue to have this small moment of panic and react the same way even though I knew that it was a way of getting my attention? Because, there was a possibility that it was true. I was not 100% sure. Instead, I had a feeling of panic when discovering that something may be missing whether it be an item or piece of information.
A same sense of panic results after reviewing a data set absent of certain information. In essence it is a data gap. In cities with mass transit rail lines (e.g. light rail, commuter rail, subway) there are warning signs posted to remind passengers to be alert about a gap between the platform and the train car. Sometimes these gaps are minimal. At other times these gaps can be as large as a foot. Regardless, the warning should be taken seriously at all times. Data gaps require the same level of care.
When a data gap is discovered it should be documented. An analyst’s mind will be the most objective about the data gap at the time of discovery. As soon as the analyst gets involved in the details of solving the question as to why the data gap occurred, the clarity of capturing descriptive details becomes clouded. At the time of discovery, the data gap documentation should include the name of the data set, the column and rows where the data is missing and any other details that may be relevant to the discovery (e.g. data values that contain out of place characters).
After initial discovery, concerns about the missing data should be discussed with your data team and a plan of investigating the data gap should be developed. The first step in the investigation is to check the data dictionary for any notes from other colleagues who were involved in the initial capture of the data or the data extraction. If no documentation is included in the data dictionary with an explanation, then the next step is to determine the data source. Was the data extracted by:
The process of identifying the cause of data gaps and making a decision as to how to move forward is time consuming. When multiple people are involved there will be time spent on research of past practice and the review of meta data. This will involve waiting; however, it is important to figure out why the data gap exists. The process of discovery should not be abandoned because it is taking too long and is complex. Assessing a KPI based on flawed data will cause major issues down the road and should be avoided.
Blog #7 Question: Are you taking care to mind the data gap?
Blog #8 Sneak Peak: Initial Run Through
Have you had the pleasure of filling out a form online recently? The level of enjoyment can range based on the person completing the survey and its complexity. I must admit there have been plenty of times where I have pushed the “Submit” button only to get the same screen refreshed with a red or pink error message telling me to look through the form one more time due to an omission or invalid data format. I also must admit, that even though I may be the only one interacting with the computer at the time, I get a little embarrassed that I had missed something. I can see why these data checkers are in place as they make the collection of a complete data set more likely.
A complete data set maximizes the value of the discoveries that can be made. It also minimizes the amount of time that it takes to investigate data gaps that may be present. As a result, data quality tools (e.g. completeness checkers that make sure that all required information has been provided and is in the right format) have become important resources to help facilitate complete data sets.
Although it is not critical to the KPI assessment process, it does help to know how data that will be used was collected. For privacy purposes, it is critical that you know where the data came from and what elements are considered personally identifiable information. (A topic for a future blog post.) For the purposes of the KPI assessment process, knowing how the data was collected may be useful when looking at the level of completeness of a data set. Let’s look at the data set referenced in blog #5 post, Data Review.
In Figure 1, there is a data set with two rows highlighted in yellow. Figure 2 outlines the data dictionary for the data set in Figure 1. When examining the data set, it contains three columns of data and eight rows, with the first row being the headers to the columns. The data appears to be complete and in the correct format until the sixth row of data where there is no date or shoe size. The seventh row of data is missing a valid shoe size. These types of observations are important to make note of in the data dictionary. At this stage of the data retrieval process, the sixth and seventh rows of data will be set aside as the remainder of the data is examined. At this point you are not deleting these rows, you are simply setting them aside.
Take notice of the column labeled “ID.” According to the data dictionary (Figure 2) the ID column contains a unique identifier. It is important to see if the unique value happens more than once in the data set. Looking through the data in Figure 1, all of the values in the ID column are unique. If there were situations where a duplicate value was found, this too would be something that should be noted in the data dictionary.
The data example in Figure 1 is only seven rows of data. This small number of rows of data is for ease of illustrating concepts. Most data sets will have many more rows of data. A key question that may be forming is how would one conduct a scan of a much larger data set? First, given that the data file is an extract, you should be able to view the file. Although the file may have 1,000 rows, it is important to take the time to review the file (as was discussed in yesterday’s post). You may see some inconsistencies by looking through the file, but due to the number of rows, it would be a good idea to use a data aggregation tool such as a pivot table or pivot chart in the spreadsheet application to quickly provide you with a summary of the data set. Additional calculated columns can be added to further determine averages, ranges, standard deviations etc. Of course, there are also statistical analysis programs that are available as well (R, SAS, SPSS, STATA); however, we are not going to be exploring tools of this type in this KPI Assessment process. The key question you want to be able to answer is given the data set provided, how complete is it? Are there inconsistencies and/or gaps in the data?
As we return to think about the sixth and seventh rows of data from Figure 1, there is a situation of a data gap (row six) and an inconsistency (row seven) present in the data set. Blog #7 will address the issue of data gaps given that when data gaps occur, a broader examination of data quality needs to be considered. In the case of the inconsistency, there is only one instance of this happening in the example data set. What caused this inconsistency to arise? Was it an entry error? Was it actually the response of the person surveyed? Did data get transposed from one field to another? Any of these are a possibility with the first two being more probable that others as they are linked to data entry error, which is very common. In the case of this example, the seventh row of data would be eliminated from further use in any analysis given that there are too many questions as to why the data is inconsistent and it is only 1 out of 7 (14%) of the data. However, if this same error occurred in a data set with 1,000 rows and 14% of the data set had inconsistencies (amounting to 140 cases) serious concerns about data quality may arise.
If 140 cases were inconsistent, time would need to be spent to determine if there is a pattern to the data inconsistency that could be caused by an algorithmic or a consistent data collection protocol error. For example, if through further investigation, it was determined that all 140 cases of inconsistency had a value of 100 as the shoe size, but was supposed to be 10. Additional information was provided by the survey team explaining that the “0” key of the keyboard kept sticking and added more zeros that should have been included. Therefore, the cause was consistent data collection protocol error.
An investigation of inconsistencies will lead to decisions about whether to:
Blog #6's question: Is the data complete?
Blog #7 sneak peek: Data Gaps
I find it like opening a present. You get a feeling of excitement and anticipation as you open a data set for the first time. Today we are going to spend time reviewing data that has been extracted from one of the applications that you inventoried as outlined in yesterday’s process, Identifying Data Sources. If this experience of opening a data set for the first time is new to you, don’t worry because we are going to work through this process together using the following steps:
Take in the view. Look at the columns and rows and get a feel for the construction of the file. Look at the very bottom of the screen. There should be a tab with the name of the data sheet that is currently open. Is there more than one tab? If so, a colleague must have programmed an extract with several tabs or organized the data into tabs after extraction. Take a moment to click through those tabs and see what information is contained on each.
Blog #5’s question: Have you set aside time to review each data file and take in its “artistic” qualities?
Blog #6 Sneak Peak: Data completeness.
Progress has been made! We are moving to the next part of the KPI assessment process. As referenced in the second blog post entitled, KPI Assessment Timeline, we are moving from the project plan to the data retrieval stage. I cannot tell you how many times I have been asked directly or been in the presence of others who were asked “How complicated is it to retrieve data?” When the technical answer begins to roll off the tongue, some seem to get bored. I really don’t know why that is as it is a truly amazing process! In an over-avoidance of not getting too technical, let's talk about a data retrieval process like going grocery shopping. There are different types of “trips” to the grocery store. Data retrieval processes are not like stopping to get a gallon of milk, it is more like stopping to get the necessary ingredients for a family reunion style Thanksgiving dinner. If you have been to a large family event you can appreciate the amount of time and effort that goes into planning and preparing for numerous guests. If you have not had this pleasure, think about a big gathering that you participated in recently. Now think specifically about the family member/person who was in-charge of the event. I would venture to say that they:
These characteristics are reflective of a great data retrieval team, they know:
Often, data retrieval is simplified by routinely extracting the same data. This routine process makes the location of the data continually familiar. However, routine is not best practice. Documentation is a process that makes the location of data known. If your organization has not yet documented the location of all data attributes, the applications that store them, and the servers that house them it is time to set aside some time in the near future and complete this task. This documentation is organized in a manner that has similarities to Figure 1. Applications are stored on servers, and tables of data are associated with each application.
Having an inventory of each of these three aspects of information architecture is extremely important. Figure 1 only considers one server, but most organizations have multiple servers. (If everything you do is cloud based consider each cloud application its own application and the cloud as the server.) If your organization does not have a detailed mapping at this point in time, create a data map skeleton for the time being. An example of a data map skeleton is shown in Figure 2.
If documentation does exist, it is time to cross check and verify that all information is up-to-date. Even if the documentation has a “last reviewed” date that is within a couple of weeks or days, spend the time to verify. Today’s step is all about checking and inventorying what you have. Some of this data may not be used in the KPI assessment process, but in order to know what you are going to be using you need to be aware of all of the data attributes available.
Some organizations may have an inventory of data that they store in a database of its own. Even in this case, it is time to do a data audit to make sure that the system has captured all that you intended it to capture.
Other organizations may feel comfortable in the fact that they mapped out the data they would need to assess their KPIs from the time that the KPIs were set. Having been consistent throughout is wonderful. Make sure you take an inventory to verify. Another data point(s) may have been collected since the KPIs were put into place and may be a better fit than the data that was available at time of the original mapping.
Check and recheck. That is today’s bottom line. By investing in taking an inventory of what you have, it will help prevent a missing piece from being discovered at a later time. A significant amount of time can be wasted when a critical piece is over looked and later discovered. This waste of time is not only effecting performance but patience as frustration is certain to rise. Stay a step ahead by being thorough.
Blog #4’s Question: Do you know what data your organization has, where it is stored, and how to retrieve it?
Blog #5 sneak peek: Data Review