Tag Archives: Database

Data warehouse – Loading data into a dimensional model – High Level – simplified.

Data warehouse – Loading data into a dimensional model – High Level – simplified.

This is a big topic and I know that there are several ways of doing this.
I am going to discuss how I like to do it.

We split the environment into Stage and (DWA/ODS) tables on the actual database platform. This can be any databasae ie Oracle,SQL server even Teradata.
Two separate databases – I will tell you later why.
Also make it a standard that you create the tables with schemas ie USA. Customer_Flat and USA.Customer_STG so that if the business grows then you support multiple countries this can even be regions depending on your business. – will list positives when splitting it by schema later.

Stage contains two sets of tables for each file(Source test file etc) coming in
Let’s say the file name is Customer then you will have Customer_Flat and Customer_STG where the Flat table only consists of string fields no type definition and on the STG table we have type definitions.
We apply type check rules on the Flat table and update the record to the correct type or mark it as error In some cases we even do MDM (Master Data Management Data) lookups and update the columns. MDM is important when you work across countries, business and operational systems etc.work with this from day one in your project – later it will more complex and it will cost you more.
We will discuss MDM definition later on the Blog for now see it as lookup values.

So the next step after this is to insert the correct values onto the STG table this is a select into from the flat table because all types are now fixed (Create one auto script on db level for this so that you never do this again (use table defs etc) I will blog how to do this later).
Now finally run other rules/check data changes on the STG table preparing it for the next insert into the fact or dimension ODS/DWA.

Note :
All data is tagged with a BatchID so that we can track it back to the source files.
I will describe the BatchID methods in a later blog and how that works for me.

Dim and Fax load Method

Dim and Fax load Method

Now we have the data in a structured database with type format for insertion to the DWA/ODS.
Firstly we populate all the dimension(s) tables – this is with type/1/2/3 etc
then the fact table be selecting the relevant key from the dimension and inserting it into the fact table. It’s a simple formula always starts outwards inwards meaning the dimensions first then the fact table. See picture.

Inner_Outer Tables

Inner_Outer Tables

In some cases if you are unable to get the actual Dim Key then use a default (ie -1) and insert the fact record we do not want to leave records out of the fact due to missing dimension values you can always update the fact later with the correct key. This happens often when you are missing a MDM lookup value due to business process waiting to define it for IT. Do not lose the data, set a default and inform business of the missing values asap. Create BI alert reports for this.

Also remember to use you own key generator when you create the dim surrogate keys when populating data.

Why two (2) separate databases for the staging and the actual ODS/DWA.

  • Data can be loaded and processed (Load steps) on stage without impacting the ODS database.
  • Data can be left on stage then later processed to ODS when ODS loads are minimal with minimum impact to users- Remember that the ODS.DWA belongs to business not IT.
  • Database backups can be done separately. Again avoiding impact.
  • You can store the databases on different Server/Drives for speed and redundancy etc. Technical consideration.
  • In some cases you have multiple countries (Businesses) and data missing for one then loads must be stopped and not loaded to ODS until all data is in stage (Per business requirement) this helps with that and well you can process separately depending on the requirements.
  • ETL standard can be implemented on both the stage and a deferent set on the ODS/DWA.
  • List a few more but I can think of more – your turn.

Splitting your Staging into Schemas for each table why?

  • Some databases allows you to place each partition on separate drive (gaining speed on disk read)
  • Easy identify the country/region for which you load data simply by looking at the table name.
  • More control over your data loads and technical architecture.
  • In some DBs you can even backup a schema.
  • You may have your own list – so share with us..

I know you have negative on both the splitting of the stage and the schema this can be another discussion later. Write them in the comment section then I will list them in this blog article and we can start that blog post..

{Views and opinions on this Blog does not reflect current/past employers view(s).}

Data Warehouse – Surrogate keys and Foreign Keys in a Dimensional Data Model

Dictionary: (http://www.agiledata.org/essays/keys.html)

  • Key. A key is one or more data attributes that uniquely identify an entity.  In a physical database a key would be formed of one or more table columns whose value(s) uniquely identifies a row within a relational table.
  • Composite key.  A key that is composed of two or more attributes.
  • Natural key.  A key that is formed of attributes that already exist in the real world.  For example, U.S. citizens are issued a Social Security Number (SSN)  that is unique to them (this isn’t guaranteed to be true, but it’s pretty darn close in practice).  SSN could be used as a natural key, assuming privacy laws allow it, for a Person entity (assuming the scope of your organization is limited to the U.S.).
  • Surrogate key.  A key with no business meaning.
  • Candidate key.  An entity type in a logical data model will have zero or more candidate keys, also referred to simply as unique identifiers (note: some people don’t believe in identifying candidate keys in LDMs, so there’s no hard and fast rules).  For example, if we only interact with American citizens then SSN is one candidate key for the Person entity type and the combination of name and phone number (assuming the combination is unique) is potentially a second candidate key.  Both of these keys are called candidate keys because they are candidates to be chosen as the primary key, an alternate key  or perhaps not even a key at all within a physical data model.
  • Primary key.  The preferred key for an entity type.
  • Alternate key. Also known as a secondary key, is another unique identifier of a row within a table.
  • Foreign key. One or more attributes in an entity type that represents a key, either primary or secondary, in another entity type.

Surrogate/ Foreign Keys are used to link data between Fact and Dimension table and in some cases between dimension and dimension.

Surrogate Keys:

You will find every a few synonyms for a Surrogate key – meaningless keys, integer keys, nonnutural keys, artificial key, synthetic keys, link keys etc.
I propose creating a process to generate these keys during load. These keys are numeric in type with absolutely no meaning. Keep these keys small in order to optimise the retrieval of records between tables.
Create your own algorithm (i.e. use GUID ID from a system) that always unique creates the keys over platforms
What do I mean by over more than one platform? Examine the following:
Your company is now global with a data warehouse in the US and in Africa and the same data model now you are asked. – We need to see customer globally?
You need to combine the data to one Global Warehouse – this will not work if your Surrogate keys are the same thus ensure they are unique between systems. Using the auto number on a table will only work for one database, but not when merging more than database unless you specified a seed on each database from day one – but what will that seed be?
As Per Ralph Kimball : Surrogate keys “ One of the primary benefits of surrogate keys is that they buffer the data warehouse environment for operational changes” Ok so what is he saying – imagine you have used the Product code as key and the operation system re-uses product code 1 what do you now do with the rest of the old data?
So do not use a business bound soft coded values (Like product code or CIF number) as a Key this will become a major flaw in you design
Surrogate keys value

  • Enables ETL Updates to do slowly changing dimensions (Separate blog entry)
  • Binds table together in Dimensional Model
  • This key can also be the primary key (U-key) on the table

More reading: http://en.wikipedia.org/wiki/Surrogate_key

Foreign Key

this is a key stored in the fact and or the dimension that links to the foreign table for example you have the Customer Key as a foreign key in the Transaction Fact table in order to join to the fact table to the customer.
Primary Keys( AKA Unique key)

Key is generated on a table that is unique to the table only this can also be the surrogate key.
Normally this is a constrain on your database in order to ensure uniqueness.

More reading: http://en.wikipedia.org/wiki/Unique_key

{Views and opinions on this Blog does not reflect current/past employers view(s).}