It was an attention seeking statement, “Ladies and gentlemen, I have just found a wallet at the front of the cabin.” I looked up at the flight attendant who was speaking on the public-address system as if he was talking directly to me about my wallet. This strategy has been used by the lead flight attendant on several flights I have taken recently. Each time I react in the same way, by looking up-- even though I know that this strategy is used to get people’s attention about the flight safety presentation that was commencing. Why did I continue to have this small moment of panic and react the same way even though I knew that it was a way of getting my attention? Because, there was a possibility that it was true. I was not 100% sure. Instead, I had a feeling of panic when discovering that something may be missing whether it be an item or piece of information.
A same sense of panic results after reviewing a data set absent of certain information. In essence it is a data gap. In cities with mass transit rail lines (e.g. light rail, commuter rail, subway) there are warning signs posted to remind passengers to be alert about a gap between the platform and the train car. Sometimes these gaps are minimal. At other times these gaps can be as large as a foot. Regardless, the warning should be taken seriously at all times. Data gaps require the same level of care.
When a data gap is discovered it should be documented. An analyst’s mind will be the most objective about the data gap at the time of discovery. As soon as the analyst gets involved in the details of solving the question as to why the data gap occurred, the clarity of capturing descriptive details becomes clouded. At the time of discovery, the data gap documentation should include the name of the data set, the column and rows where the data is missing and any other details that may be relevant to the discovery (e.g. data values that contain out of place characters).
After initial discovery, concerns about the missing data should be discussed with your data team and a plan of investigating the data gap should be developed. The first step in the investigation is to check the data dictionary for any notes from other colleagues who were involved in the initial capture of the data or the data extraction. If no documentation is included in the data dictionary with an explanation, then the next step is to determine the data source. Was the data extracted by:
The process of identifying the cause of data gaps and making a decision as to how to move forward is time consuming. When multiple people are involved there will be time spent on research of past practice and the review of meta data. This will involve waiting; however, it is important to figure out why the data gap exists. The process of discovery should not be abandoned because it is taking too long and is complex. Assessing a KPI based on flawed data will cause major issues down the road and should be avoided.
Blog #7 Question: Are you taking care to mind the data gap?
Blog #8 Sneak Peak: Initial Run Through