Tag Archives: Fact Tables

Data warehouse – Loading data into a dimensional model – High Level – simplified.

Data warehouse – Loading data into a dimensional model – High Level – simplified.

This is a big topic and I know that there are several ways of doing this.
I am going to discuss how I like to do it.

We split the environment into Stage and (DWA/ODS) tables on the actual database platform. This can be any databasae ie Oracle,SQL server even Teradata.
Two separate databases – I will tell you later why.
Also make it a standard that you create the tables with schemas ie USA. Customer_Flat and USA.Customer_STG so that if the business grows then you support multiple countries this can even be regions depending on your business. – will list positives when splitting it by schema later.

Stage contains two sets of tables for each file(Source test file etc) coming in
Let’s say the file name is Customer then you will have Customer_Flat and Customer_STG where the Flat table only consists of string fields no type definition and on the STG table we have type definitions.
We apply type check rules on the Flat table and update the record to the correct type or mark it as error In some cases we even do MDM (Master Data Management Data) lookups and update the columns. MDM is important when you work across countries, business and operational systems etc.work with this from day one in your project – later it will more complex and it will cost you more.
We will discuss MDM definition later on the Blog for now see it as lookup values.

So the next step after this is to insert the correct values onto the STG table this is a select into from the flat table because all types are now fixed (Create one auto script on db level for this so that you never do this again (use table defs etc) I will blog how to do this later).
Now finally run other rules/check data changes on the STG table preparing it for the next insert into the fact or dimension ODS/DWA.

Note :
All data is tagged with a BatchID so that we can track it back to the source files.
I will describe the BatchID methods in a later blog and how that works for me.

Dim and Fax load Method

Dim and Fax load Method

Now we have the data in a structured database with type format for insertion to the DWA/ODS.
Firstly we populate all the dimension(s) tables – this is with type/1/2/3 etc
then the fact table be selecting the relevant key from the dimension and inserting it into the fact table. It’s a simple formula always starts outwards inwards meaning the dimensions first then the fact table. See picture.

Inner_Outer Tables

Inner_Outer Tables

In some cases if you are unable to get the actual Dim Key then use a default (ie -1) and insert the fact record we do not want to leave records out of the fact due to missing dimension values you can always update the fact later with the correct key. This happens often when you are missing a MDM lookup value due to business process waiting to define it for IT. Do not lose the data, set a default and inform business of the missing values asap. Create BI alert reports for this.

Also remember to use you own key generator when you create the dim surrogate keys when populating data.

Why two (2) separate databases for the staging and the actual ODS/DWA.

  • Data can be loaded and processed (Load steps) on stage without impacting the ODS database.
  • Data can be left on stage then later processed to ODS when ODS loads are minimal with minimum impact to users- Remember that the ODS.DWA belongs to business not IT.
  • Database backups can be done separately. Again avoiding impact.
  • You can store the databases on different Server/Drives for speed and redundancy etc. Technical consideration.
  • In some cases you have multiple countries (Businesses) and data missing for one then loads must be stopped and not loaded to ODS until all data is in stage (Per business requirement) this helps with that and well you can process separately depending on the requirements.
  • ETL standard can be implemented on both the stage and a deferent set on the ODS/DWA.
  • List a few more but I can think of more – your turn.

Splitting your Staging into Schemas for each table why?

  • Some databases allows you to place each partition on separate drive (gaining speed on disk read)
  • Easy identify the country/region for which you load data simply by looking at the table name.
  • More control over your data loads and technical architecture.
  • In some DBs you can even backup a schema.
  • You may have your own list – so share with us..

I know you have negative on both the splitting of the stage and the schema this can be another discussion later. Write them in the comment section then I will list them in this blog article and we can start that blog post..

{Views and opinions on this Blog does not reflect current/past employers view(s).}

Data Warehouse – Fact and Dimension Tables in a dimensional model

Fact and Dimension Tables (Ralph Kimball) in a dimensional model
Ok now that we know that we get a Logical and physical Models in a Data warehouse () lets go into some of the content that goes into the actual tables that resides on the Data warehouse
I will discuss Fact and dimension tables on a high level as I know you can find hundreds of books and articles on the web describing it in much more detail. One of the books I used for this summary is a book written by Ralph Kimball (The Data Warehouse Toolkit).
Remember that the tables are not platform specific i.e. it will only run on Oracle/SQL/Ter data.
It makes a lot of sense to attend his master class on this subject.

Dimension (Dim) Table

A dim table contains the textual description of an entity (Subject Area).
Ok saying that does not really make sense now does it?

Let’s do this by example taking a banking industry as our industry.
a Typical dimension will be a Customer with his surname, first name, region, country and address that it no factual data of the customer.
The customer dim table will be linked to a factual table be a Key (will chat on keys in a later blog).
a Dimension table could be used to group data by for example we can group all the customer by region and then link them to a fact table to get accounts by region.
This table should contain best description possible for any description – you can store the short description next to it but the more descriptive the data the better for use the more the user will like you for it.
Build one dim (i.e. Customer) and re-use the key in all fact tables avoid having the same dim table with deferent attributes all over the database-if you need one more description then enter it in the table do not create another table.

Thus a Dim table consists of the following attributes

  • Textual descriptive information on the subject.
  • Can use group by on data.
  • Link to Fact Tables with Keys.
  • Can have more than one attribute of the subject in fact up to a few hundred attributes can exists
  • Data can be used for reporting labels. i.e. Region description
  • Descriptive data can be sourced from the master data repository (MDM)
  • Lengthy description must be provided if possible for any short description (Key) Source from MDM if possible.
  • Table can define a hierarchy within the table (Self reference table – will provide sample later in blog) and have a flat hierarchy with column in Region Country
  • Highly renormalized (Later Blog)
  • Companion table to a fact table.
  • Only one row per business key i.e. one customer cannot have two names.
  • Dimension data can change as follow (Will go into more details on blog)
    1. Type One – Simply overwrite the old value(s).
    2. Type Two – Add a new row containing the new value Type
    3. Type Three – Add a new attribute to the existing row.

Fact Tables:

A fact table typically consist of only numerical data that you can join to the Dim table by Key.
Typically this table is deep and narrow (Not lost of columns) with millions of rows.
Let’s go back to the Customer in the banking industry we would now like to see all transactions for the customer so we create a Fact table with transactions in it.
This transaction table will simply have transaction AMT and Customer Key.
Thus joining the customer dim to this table will show us how many transactions per customers and even what the values of these transactions was , even more we can now see it by region due to the data in the dimension table.
Thus a Fact table consists of the following attributes

  • Typically only Numerical data
  • Narrow of thousands of records (deep)
  • Sums and calculations can be performed on the attributes (Additive , Non Additive, Semi Additive)
  • Must have at least one link to a Dim table in order to do MIS.
  • It is in the centre of a star schema.
  • Grain on fact important for each fact table.(define it before you create it)
  • Types of fact table : Transactional, Periodic snapshots, Accumulating snapshots (Later blog on this)

Sample of both.

Logical and physical model sample

Logical and physical model sample

References:
Kimball, Ralph. The Data Warehouse Lifecycle Toolkit Second Edition. Winely Publishing Inc.
http://en.wikipedia.org/wiki/Dimension_table

{Views and opinions on this Blog does not reflect current/past employers view(s).}