ETL: Classify do not Filter out data

Wed, Feb 26, 2014

The first step when designing a fact table using Ralph Kimball’s dimensional modelling method is to decide on the grain of the table. Another way of describing the grain is the level of detail to store in your fact table. The accepted rule is that you should always store the data at the most detailed level available.

The second rule you should follow is load all available data into your fact table. Instead of filtering out some data, store as much of the data as possible and classify the data. In my experience the main reason for not loading all available data is due to incomplete data warehouse specifications. How many times have you listened to the project sponsor/manager say something like, “this data warehouse project is to only look at perishable foods sales, we will not be looking at anything else”. I say load all the data and create a dimension to separate out the perishable products from the rest. It will only be a matter of time before someone wants to see the bigger picture, how do perishable’s relate to the rest of the business?

The reasons for my suggestion are similar to the reasons for why you should store the data in the most detailed format. These reason include:

  • You will be able to can answer questions you may not have anticipated without having to make changes to your ETL process to load missing data
  • Hardware is cheaper than developer/consultant time used up in making the changes to the data
  • Your users will thank you as they wont have to wait for you to load the missing data

Apart from the project scope not being well defined, another reason data is left out of the data warehouse is due to data quality issues. Quality issues arise namely because:

  • Lack of validation in the source system
  • Data migration from one system to another for reasons including
  • - Natural evolution of software
  • - Merger or aquisation took place and all data had to be migrated to one platform

Even though the data may not be accurate this data is still part of the record of what happened in the company. I still recommend including such data and creating a data quality dimensional table to describe the quality of the data. Below is a sample data quality dimension table.

quality_id quality_name quality_description date_from
0 Accurate Accurate data from source systems 2014/02/01
1 Accurate Data from merger, may have some issues 2014/03/01
2 Not Accurate Data from acquisition in 2010, not all data was mapped correctly 2010/01/01

Remember not everyone who participated in the data migration is going to be around in a year’s time and you never know when someone if going to require the data.

If you think you might need some data even though the user didn’t say they need it now, you probably will need it and it is best to go ahead and load into your data warehouse.


Related Posts

Visualising SQL Statements
6 Simple SQL SELECT statement performance tips