What is big Data

Several definition exists I have included some site links and comments from several web sources.


Big data is being generated by everything around us at all times. Every digital process and social media exchange produces it. Systems, sensors and mobile devices transmit it. Big data is arriving from multiple sources at an alarming velocity, volume and variety. To extract meaningful value from big data, you need optimal processing power, analytics capabilities and skills”



“Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information”



“Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.”



Big data is a collection of data from traditional and digital sources inside and outside your company that represents a source for ongoing discovery and analysis ….



“Big data is a buzzword, or catch-phrase, meaning a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques. In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity”

Berkeley School of information

““Big data.” It seems like the phrase is everywhere. The term was added to the Oxford English Dictionary in 2013 and appeared in Merriam-Webster’s Collegiate Dictionary in 2014. Now, Gartner’s just-released 2014 Hype Cycle shows “big data” passing the “peak of inflated expectations” and moving on its way down into the “trough of disillusionment.” Big data is all the rage. But what does it actually mean?

A commonly repeated definition cites the three Vs: volume, velocity, and variety. But others argue that it’s not the size of data that counts, but the tools being used or the insights that can be drawn from a dataset.”


My opinion:

Big data is data that is structured, unstructured from machines, AI, Human, systems thus any object that can generate any form of data/information and which the data has volume, velocity, and variety. This data can be stored into a data store using computer based tools.







This is the start of my journey with Hadoop and most of the tools used to access, control, monitor, edit, modify and read data etc. I will try and share this journey with you but please note this will most of the time be very technical thus I will make assumptions based on my reading and this is my view of how it works if it is incorrect then post a comment and I will try and get you the correct information.

This is a Big world using Big data (no punt intended) . Thus it is a confusing world and
I will try to simplify this for everyone reading this blog – including making it simple for myself, I like simplicity.

By no means am I an expert on this I am just learning and sharing my learnings with you-you are welcome to follow my journey and make this your own.

My first source of information comes from hortonworks.com(http://hortonworks.com/)  I am currently in the process of completing the online self passed training they provide – its good training – go for it and enjoy it.
Hortonworks University Self-Paced Learning Library – http://hortonworks.com/training/self-paced-learning-library/
At this stage it is over 250 modules that you need to complete and I am aware that they will be growing it to more soon, introducing more module on their platform called HDP. I will jump around trying to do labs with you, install the software etc. – assisting with configuration etc. – Lets see how it goes as you can imagine this is going to be a lot of work thus this journey will span over a few months.

Also note setup your own VM and test, play and enjoy.

Just a note I am not at all associated with hortonwork or they with me, as indicated they are my starting point to learn and develop big data solutions Please note I only use them as reference to information and I will be using terms that they might use that they own I will also try and explain how their technology fits together as I understand it.

Lets start as follow.

What is big data :






Data warehouse – Loading data into a dimensional model – High Level – simplified.

Data warehouse – Loading data into a dimensional model – High Level – simplified.

This is a big topic and I know that there are several ways of doing this.
I am going to discuss how I like to do it.

We split the environment into Stage and (DWA/ODS) tables on the actual database platform. This can be any databasae ie Oracle,SQL server even Teradata.
Two separate databases – I will tell you later why.
Also make it a standard that you create the tables with schemas ie USA. Customer_Flat and USA.Customer_STG so that if the business grows then you support multiple countries this can even be regions depending on your business. – will list positives when splitting it by schema later.

Stage contains two sets of tables for each file(Source test file etc) coming in
Let’s say the file name is Customer then you will have Customer_Flat and Customer_STG where the Flat table only consists of string fields no type definition and on the STG table we have type definitions.
We apply type check rules on the Flat table and update the record to the correct type or mark it as error In some cases we even do MDM (Master Data Management Data) lookups and update the columns. MDM is important when you work across countries, business and operational systems etc.work with this from day one in your project – later it will more complex and it will cost you more.
We will discuss MDM definition later on the Blog for now see it as lookup values.

So the next step after this is to insert the correct values onto the STG table this is a select into from the flat table because all types are now fixed (Create one auto script on db level for this so that you never do this again (use table defs etc) I will blog how to do this later).
Now finally run other rules/check data changes on the STG table preparing it for the next insert into the fact or dimension ODS/DWA.

Note :
All data is tagged with a BatchID so that we can track it back to the source files.
I will describe the BatchID methods in a later blog and how that works for me.

Dim and Fax load Method

Dim and Fax load Method

Now we have the data in a structured database with type format for insertion to the DWA/ODS.
Firstly we populate all the dimension(s) tables – this is with type/1/2/3 etc
then the fact table be selecting the relevant key from the dimension and inserting it into the fact table. It’s a simple formula always starts outwards inwards meaning the dimensions first then the fact table. See picture.

Inner_Outer Tables

Inner_Outer Tables

In some cases if you are unable to get the actual Dim Key then use a default (ie -1) and insert the fact record we do not want to leave records out of the fact due to missing dimension values you can always update the fact later with the correct key. This happens often when you are missing a MDM lookup value due to business process waiting to define it for IT. Do not lose the data, set a default and inform business of the missing values asap. Create BI alert reports for this.

Also remember to use you own key generator when you create the dim surrogate keys when populating data.

Why two (2) separate databases for the staging and the actual ODS/DWA.

  • Data can be loaded and processed (Load steps) on stage without impacting the ODS database.
  • Data can be left on stage then later processed to ODS when ODS loads are minimal with minimum impact to users- Remember that the ODS.DWA belongs to business not IT.
  • Database backups can be done separately. Again avoiding impact.
  • You can store the databases on different Server/Drives for speed and redundancy etc. Technical consideration.
  • In some cases you have multiple countries (Businesses) and data missing for one then loads must be stopped and not loaded to ODS until all data is in stage (Per business requirement) this helps with that and well you can process separately depending on the requirements.
  • ETL standard can be implemented on both the stage and a deferent set on the ODS/DWA.
  • List a few more but I can think of more – your turn.

Splitting your Staging into Schemas for each table why?

  • Some databases allows you to place each partition on separate drive (gaining speed on disk read)
  • Easy identify the country/region for which you load data simply by looking at the table name.
  • More control over your data loads and technical architecture.
  • In some DBs you can even backup a schema.
  • You may have your own list – so share with us..

I know you have negative on both the splitting of the stage and the schema this can be another discussion later. Write them in the comment section then I will list them in this blog article and we can start that blog post..

{Views and opinions on this Blog does not reflect current/past employers view(s).}

Data Warehouse – Slowly changing dimensions (High Level) in a dimensional model

Slowly changing dimensions (High Level) on a Dimensional model

So why do we need changes on a dimension in any case lets take a real world example: we have a dimension (Table) with customer information and one of the columns (field) are named Customer Surname and another Marital status now as we all know people grow old and things change and some of us get married so our marital status change from being a single person to having a partner thus we surely need to see this change in the data and in some cases we even take the surname of the partner.

We could have simply overwritten the data but then how do we see how many people was single at any point in time and what was that persons surname before. That was one sample where you would like to see the history of the dimension we have places where we simply will update the existing records it maybe the age of the person this does not need to be tracked (if we have the birth date) in the dimensions history thus we will simply overwrite the value.
Then we even have one more possibility that I know about the (not use it myself before) but here we go as from the Ralph Kimball books a type 3 where you have an additional column that will display the prev value from the fact table thus keeping the joining in history and going forward.

Ok now the breakdown and properties of all the above mentioned Types:

SCD Types

  • Type 1

    Overwrite existing attribute.
    No history
    Update by business KeyDay 1
Key Age DOB
1 12 1978/10/12

Day 2

Key Age DOB
2 13 1978/10/12
  • Type 2
    Inserting new records
    Update new records Open date. – My preference to make tracking easy.
    NB: Open date indicator is system column not related to the data’s data you may also use version number.
Key Surname M Status Close data
1 Small Single 01/01/2005
2 Small Married 01/02/2012
3 Van den Heever Married 01/03/2012
  • Type 3
    Allows us to associate existing Fact table data with a deferent view of the data like an alternative hierarchy depending on how you select the report.
Product Description Current Department History Department
1 Games Children Children
1 Games Adult Children
1 Games Retired Adult
  • Type 4
    This is simply keeping history in a deferent table with a date stamp when the record was inserted on the history table.
  • Type 6/Hybrid
    Inserting new record and updating fields on the existing record before closing it.
    Update existing records active indicator – my preference to make tracking easy.NB: Active indicator is system column not related to the data’s data.
    Update existing records Close date – my preference to make tracking easy. Insert a default date here when no date far into the feature (makes MIS simpler)NB: Active Close date is system column not related to the data’s data. Active can be called Current record.
    Inserted new record must have closing records ID(Key) – my preference to make tracking easy.NB: Active closing records ID (Key) is system column not related to the data’s data.
Key Surname M Status Open data Close Date Active Prev Key
1 Small Single 01/01/2005 01/02/2012 False -1
2 Small Married 01/02/2012 01/03/2012 False 1
3 Van den Heever Married 01/03/2012 01/12/2999 True 2

These are all slowly changing attribute when dealing with rapidly changing attributes then we need to consider other alternatives – I will go into details on this in later blog posts.
As you can clearly see you can now play with def types on one table but I would suggest keep it simple you goal at the end of the day is to ensure that the end user can use the table and that its all fast!!.

Fast for read and insert do not overdo the SDC or you will entangle yourself in a spaghetti bowl once you need to make changes to any ETL process.

More details have a look at The Data Warehouse Toolkit from Kimball Group.

For more types and more advance stuff stay tuned and read on the web.


Remember Google is your friend with searching and make an informed decision when you create you model do not say for example all must be SDC type 2 this can be detrimental to your end gaol.

Did you notice we are missing a type 5?

A funny thing we asked Ralph why did he call it Type 1,2,3 … changes – his answer – “We could not think of a name and the industry started to use it” lol …

{Views and opinions on this Blog does not reflect current/past employers view(s).}

Data Warehouse – Surrogate keys and Foreign Keys in a Dimensional Data Model

Dictionary: (http://www.agiledata.org/essays/keys.html)

  • Key. A key is one or more data attributes that uniquely identify an entity.  In a physical database a key would be formed of one or more table columns whose value(s) uniquely identifies a row within a relational table.
  • Composite key.  A key that is composed of two or more attributes.
  • Natural key.  A key that is formed of attributes that already exist in the real world.  For example, U.S. citizens are issued a Social Security Number (SSN)  that is unique to them (this isn’t guaranteed to be true, but it’s pretty darn close in practice).  SSN could be used as a natural key, assuming privacy laws allow it, for a Person entity (assuming the scope of your organization is limited to the U.S.).
  • Surrogate key.  A key with no business meaning.
  • Candidate key.  An entity type in a logical data model will have zero or more candidate keys, also referred to simply as unique identifiers (note: some people don’t believe in identifying candidate keys in LDMs, so there’s no hard and fast rules).  For example, if we only interact with American citizens then SSN is one candidate key for the Person entity type and the combination of name and phone number (assuming the combination is unique) is potentially a second candidate key.  Both of these keys are called candidate keys because they are candidates to be chosen as the primary key, an alternate key  or perhaps not even a key at all within a physical data model.
  • Primary key.  The preferred key for an entity type.
  • Alternate key. Also known as a secondary key, is another unique identifier of a row within a table.
  • Foreign key. One or more attributes in an entity type that represents a key, either primary or secondary, in another entity type.

Surrogate/ Foreign Keys are used to link data between Fact and Dimension table and in some cases between dimension and dimension.

Surrogate Keys:

You will find every a few synonyms for a Surrogate key – meaningless keys, integer keys, nonnutural keys, artificial key, synthetic keys, link keys etc.
I propose creating a process to generate these keys during load. These keys are numeric in type with absolutely no meaning. Keep these keys small in order to optimise the retrieval of records between tables.
Create your own algorithm (i.e. use GUID ID from a system) that always unique creates the keys over platforms
What do I mean by over more than one platform? Examine the following:
Your company is now global with a data warehouse in the US and in Africa and the same data model now you are asked. – We need to see customer globally?
You need to combine the data to one Global Warehouse – this will not work if your Surrogate keys are the same thus ensure they are unique between systems. Using the auto number on a table will only work for one database, but not when merging more than database unless you specified a seed on each database from day one – but what will that seed be?
As Per Ralph Kimball : Surrogate keys “ One of the primary benefits of surrogate keys is that they buffer the data warehouse environment for operational changes” Ok so what is he saying – imagine you have used the Product code as key and the operation system re-uses product code 1 what do you now do with the rest of the old data?
So do not use a business bound soft coded values (Like product code or CIF number) as a Key this will become a major flaw in you design
Surrogate keys value

  • Enables ETL Updates to do slowly changing dimensions (Separate blog entry)
  • Binds table together in Dimensional Model
  • This key can also be the primary key (U-key) on the table

More reading: http://en.wikipedia.org/wiki/Surrogate_key

Foreign Key

this is a key stored in the fact and or the dimension that links to the foreign table for example you have the Customer Key as a foreign key in the Transaction Fact table in order to join to the fact table to the customer.
Primary Keys( AKA Unique key)

Key is generated on a table that is unique to the table only this can also be the surrogate key.
Normally this is a constrain on your database in order to ensure uniqueness.

More reading: http://en.wikipedia.org/wiki/Unique_key

{Views and opinions on this Blog does not reflect current/past employers view(s).}

Data Warehouse – Fact and Dimension Tables in a dimensional model

Fact and Dimension Tables (Ralph Kimball) in a dimensional model
Ok now that we know that we get a Logical and physical Models in a Data warehouse () lets go into some of the content that goes into the actual tables that resides on the Data warehouse
I will discuss Fact and dimension tables on a high level as I know you can find hundreds of books and articles on the web describing it in much more detail. One of the books I used for this summary is a book written by Ralph Kimball (The Data Warehouse Toolkit).
Remember that the tables are not platform specific i.e. it will only run on Oracle/SQL/Ter data.
It makes a lot of sense to attend his master class on this subject.

Dimension (Dim) Table

A dim table contains the textual description of an entity (Subject Area).
Ok saying that does not really make sense now does it?

Let’s do this by example taking a banking industry as our industry.
a Typical dimension will be a Customer with his surname, first name, region, country and address that it no factual data of the customer.
The customer dim table will be linked to a factual table be a Key (will chat on keys in a later blog).
a Dimension table could be used to group data by for example we can group all the customer by region and then link them to a fact table to get accounts by region.
This table should contain best description possible for any description – you can store the short description next to it but the more descriptive the data the better for use the more the user will like you for it.
Build one dim (i.e. Customer) and re-use the key in all fact tables avoid having the same dim table with deferent attributes all over the database-if you need one more description then enter it in the table do not create another table.

Thus a Dim table consists of the following attributes

  • Textual descriptive information on the subject.
  • Can use group by on data.
  • Link to Fact Tables with Keys.
  • Can have more than one attribute of the subject in fact up to a few hundred attributes can exists
  • Data can be used for reporting labels. i.e. Region description
  • Descriptive data can be sourced from the master data repository (MDM)
  • Lengthy description must be provided if possible for any short description (Key) Source from MDM if possible.
  • Table can define a hierarchy within the table (Self reference table – will provide sample later in blog) and have a flat hierarchy with column in Region Country
  • Highly renormalized (Later Blog)
  • Companion table to a fact table.
  • Only one row per business key i.e. one customer cannot have two names.
  • Dimension data can change as follow (Will go into more details on blog)
    1. Type One – Simply overwrite the old value(s).
    2. Type Two – Add a new row containing the new value Type
    3. Type Three – Add a new attribute to the existing row.

Fact Tables:

A fact table typically consist of only numerical data that you can join to the Dim table by Key.
Typically this table is deep and narrow (Not lost of columns) with millions of rows.
Let’s go back to the Customer in the banking industry we would now like to see all transactions for the customer so we create a Fact table with transactions in it.
This transaction table will simply have transaction AMT and Customer Key.
Thus joining the customer dim to this table will show us how many transactions per customers and even what the values of these transactions was , even more we can now see it by region due to the data in the dimension table.
Thus a Fact table consists of the following attributes

  • Typically only Numerical data
  • Narrow of thousands of records (deep)
  • Sums and calculations can be performed on the attributes (Additive , Non Additive, Semi Additive)
  • Must have at least one link to a Dim table in order to do MIS.
  • It is in the centre of a star schema.
  • Grain on fact important for each fact table.(define it before you create it)
  • Types of fact table : Transactional, Periodic snapshots, Accumulating snapshots (Later blog on this)

Sample of both.

Logical and physical model sample

Logical and physical model sample

Kimball, Ralph. The Data Warehouse Lifecycle Toolkit Second Edition. Winely Publishing Inc.

{Views and opinions on this Blog does not reflect current/past employers view(s).}

Data Warehouse – Physical(PDM) and logical model (LDM) Why ??

Data Warehousing – Why a physical(PDM) and logical model (LDM).

I have been asking myself this question a few times that’s when I realised it is needed for several reasons.

1) Logical Model
Reason for having it:

  • Shows a non technical person what is in a database with attributes that are normal names ie FirstName and not FNAME saying this do not show types to the user – no need they will see this on the physical model.
  • Shows relationships between information (tables) in a logical manner i.e. a company can have multiple staff members not like One Co_Company to many EMP_Staff Table.
  • Shows sharing between information
  • A building block to implementing the physical structures
  • Assists in the impact analysis if changes are to be made on the database.
  • Pokes the business users mind to think a bit more before signing off the technical solution and later just to simply play the blame game. – Lets not do that.I know it seems like a waste of time and money to do the logical model first.Yes I am technical person first then a “documenter” and it’s “easy “ to do a physical models in minutes with all the new database tools like SQL server. Its click-click and you are done these days not?

Now imagine the user ask for a change on one table, what is the impact? And damn it is hard to explain a simple physical data model to a non-technical person without the logical model. I have tried a few times and they always know what is going on until we implement the solutions then we start all over again.
So avoid this and do the logical model first and answer those question that they needed to be answered on paper first with a logical model. Remember the without the business there will be no IT or work to be done so the business need comes first. The simpler the data format the better the results to the business.

See more:

2) Physical Model

Reason for having it:

  • Not to be obvious – no physical model then no data it’s where you store it all.
  • To optimize the actual data use. I.e. store one table’s data on several hard drives for fast retrieval.
  • To split the actual subject areas into several tables with joins in order to make every table as effective and speedy as possible.
  • All fields are data type specified and optimised for the database platform selected – man you need to take really good care of this or it will bite you later with regards to storage/speed etc.
  • Re-doing it will take time and a lot of money depending on the usage in your business.
  • Surrogate and lookup keys splits data and relationships into tables (Will go into this on a feature blog)

This is my favourite part getting down in dirty in the code to load the physical data model and to create it in order to get it to load as fast as possible and for the business users to get to the data and use it in its simplest format, a format that must be understandable to them – or what’s the use of the database.
I will blog on this “how to load data in star schema etc” a bit later.
Remember the model must be created in such a way that it takes consideration of the database platform – use the best techniques for the platform forget what you know of other platforms they differ or we would have had only one platform.

See more:

{Views and opinions on this Blog does not reflect current/past employers view(s).}

C# – Christian van den Heever’s C# Development high level history (Into).

Generally I have been coding in C# for the last couple of years on an on and off basis as you can imagine being employed full time as a data warehouse lead provides me with limited time during the day – but great coders are borne during these hours (after hours) … not ?

Though I must say every opportunity given to me at my current employment are great and I have coded the following for them.
C# Windows services for file monetor project. Simply two services that takes control over data file loading, transporting etc. makes life easy for all ETL processing in the division.
MDM web portal. Using infragistics web control and SQL back-end.to maintain Master data for the Africa division of Banking Simple pages controlling entering of data via intranet.
Data take-on portal for monthly stats that is pulled through to Cognos to drive business revenue with nice reports.
Some of these jobs are simple others a bit more complex with threading and online grids with datasets bounded etc – but I had a great time creating them.
Thus with limited exposure during the day to code in c# I have done a lot of afterhours work even on temp basis for customers during the last decade or so.
Some exciting stuff was delivers using c# – not ground breaking for a semi-senior developer.
Reply device integration in c# and SQL server 2008 for AGM voting – have a few top companies using it in ZA. With a full suite of reports. Again see . (za.linkedin.com/in/managingitforbusiness) and look at the slide presentations on the project. I have sold the source code to an event company .
This was a nice example of API integration with C# using remote base station controls.
Windows Socket server development with android devices chatting to the socket – some nice networking code here – see LinkedIn profile for code (za.linkedin.com/in/managingitforbusiness) used direct TCP connection to communicate data from Android device down nto windows exe file and back.
XNA – Game development on xbox with xna creators club. Started on a Poker game what a challenge the AI for poker is difficult- game semi completed then live happens and I needed to focus on real money driven projects and also we had to move on as we were only two people on the project a graphics designer and myself.
But now I know xna a bit with c#.

Whiles I was employed at Fl Media
I did have a few month where I fully occupied with development for CMS system integration to C# and ASP.net for Biztools.com and HRtools.com and even some flash game integration into SQL server.
Using Ektron with a smile.
This was really nice to focus on development but then life happened and I got blessed with a second child and I could not travel a lot – thus back to the bank with security and career growth but one never know when I will get a gret opportunity again to do some real development stuff.

I am going to further grow this blog with more technical stuff going into this C# section.
You are most welcome to help me here – comment and break apart the code that I submit. I will love and respect you for it.

Christian van den Heever

{Views and opinions on this Blog does not reflect current/past employers view(s).}



  • Koos van Wyk
  • Dalene Alberts
  • Saroj Govender
  • Peter Venter
  • Frank  Columbo

Contact Details will be provided on request.

CV-Other Details

Other details.

  • Awarded top 70 Management within the Banking group 2006.
  • Awarded Service excellence within the group for project(s) delivered.
  • Promoted from ITD08 to ITD09.