24 May 2010

ETL Design Consideration - Multiple Steps/Stages/Area Required for Different Purposes

Normally we'll need multiple steps/stages/area which can be flat files, dataset or RDBMS to allow us to achieve the followings -
  1. Reusability
  2. Restartablility
  3. Performance
  4. Quality Assurance
  5. Reconciliation
  6. Data Integrity
  7. Change Detection
RDBMS is typically the recommended method of storage of all the above steps/stages/area as the End Stage while datasets are preferred for Intermediate Stage.

The name maybe different but the functions/purposes/needs should be the same -
  1. Staging Area - This is the place where we load the source files into RDBMS. RDBMS is used to allow us to reduce the impact on ETL Jobs when source files change. With this RDBMS, we just need to change the Loading ETL jobs which is simple as it is typically 1 to 1 simple load. It also facilitate in reconciliation, quality assurance and testing as we can make queries in RDBMS compare to flat files.
  2. Working Area - This is the place where we will stored the X-Ref table which is used to standardized all the codes and also all the Intermediate Tables/datasets that are used for ETL to produce the final tables for Holding Area. Source Change Detection will be done here if the Source cannot provide only records that has been change.
  3. Holding Area - This is the final step before moving the data into SOR. This area have the tables totally the same as SOR and normally we will keep only 1 day of data. This area allow us to do the follow quality assurance, reconciliation, data integrity and change detection. It also provide better performance and flexibility for loading as the Holding Area may not need all the indexes that SOR required and it will also prevent locking of tables which is a requirements for LOAD that's not good for SOR if the loading happened in office hours.
  4. SOR - This is where our Industry Models will be. Due to our highly normalize nature of our Industry Model it maybe difficult for user to use and most of the time even for data modeler to understand as our model come with hundreds of entities and thousands of attributes. We will need to build the necessary Views to help simplify the model to allow better usage.
  5. Data Marts - This is normally optional for high performance RDMS like DB2 with DPF or our main competitors Teradata. We will use Views, MQT, Cube Services, MDC etc. to reduce the requirements to build Physical Data Marts. The requirements of Physical Data Marts will also increase efforts and availability of the system in term of ETL.

No comments: