Saturday, June 25, 2011

Kettle – Best Practices

 

During my ETL development work, there have been instances when I had to do a DDL change in the Data Mart tables, sometimes for an enhancement or because of bug fixes.This involves fixing the Kettle Transformation as well. For this you need to follow this simple yet useful practice.

•Clear Database Cache whenever you do a DDL operation on Lookup or Target tables otherwise Spoon might mark Transformation as Invalid on Validation.

Whenever we introduce a table in Kettle, it keeps a cached version of the table metadata into the repository so as to optimize the transformation viewing/editing. But it doesn’t refresh this metadata from database every time we open the transformations as its memory expensive operation. For this reason we have to manually clear this cache whenever we make a change to the Database  tables.

•For this you need to click on the Target (or Dimension Lookup) Step and then Click on SQL and then hit Clear Cache. It’ll ask you to Clean Cache only for this Table on Entire Database.

image

Tuesday, June 21, 2011

Kettle – Best Practices

In tune with the last post, I’ll discuss more about ETL Best practices which I follow with Kettle.

• Enable cache for lookup on Dimension tables.

This improves the Performance of your Transformation for Small/Medium size Dimension tables. But you need to monitor the Logs to check whether It is beneficial or not. In some cases it might add un-necessary load on ETL server.

Its always a dilemma for ETL developer when it comes to Caching of Dimensional tables while doing the lookups. Usually right approach is collect Performance data, for the ETL where Caching is Implemented. This data usually includes Time taken,

  • To fetch Dimension table into memory
  • To filter this data
  • To update the Cache if underlying lookup table data gets changed.

Another good practice which I learned from Informatica, is to Place these Cached files on  Faster Drives.

image

Friday, June 10, 2011

Pentaho Data Integration : Powerful and Cost Effective ETL Solution

 

I would like to start this post with little info about Pentaho Data Integration(PDI):

  • Pentaho Data Integration (PDI), initially called Kettle, is one of the most successful Open Source ETL tool available in the BI space. Initially Pentaho released the only Community version of Kettle but when later it became popular they released the Commercial version called PDI. Kettle was first conceived about four years ago by Matt Casters, who needed a platform-Independent ETL tool for his work as a BI Consultant.
  • Pentaho includes most of the open source projects that comprise its BI suite: Reporting based on JFreeReports (converted more than four years ago and now known as Pentaho Reporting); ETL based on Kettle; OLAP based on Mondrian (with a GUI based on JPivot and a more interactive component that it recently licensed from Lucidera); and advanced analytics based on Weka.
  • Pentaho recently announced plans to provide integration with Hadoop, which can potentially address some of the scalability issues of open source BI tools.

Kettle Architecture

I re-searched a bit so as to get a visual representation for Kettle overall architecture and No surprise I found it on Matt Casters’s blog.

image

Kettle is built with the java programming language. It consists of four distinct applications:

Spoon: Spoon is a GUI tool to model the flow of data from input step to output step. One such model is also called a transformation.
Pan: Pan is a command line tool that executes transformations modeled with Spoon.
Chef: Chef is a graphically oriented end-user tool used to model jobs. Jobs consist of job entries such as transformations, FTP downloads etc. that are placed in a flow of control.
Kitchen: Kitchen is a command line tool used to execute jobs created with Chef.

Few Key points about the Architecture:

Model Based: Spoon and Chef are used to create Models which are XML based and can be interpreted by command line tools, Pan and Kitchen.

Repository based: Kettle allows the model to be saved either in database repository or as XML Documents. You cannot mix the two methods (files and repository) in the same project.

Kettle – Best Practices

There are few things which I’ve learned with my experience with Kettle, which I feel can be termed as Best Practices.

•Use variables instead of Hardcoded strings while using Input SQL transformation. You can set these variables in an externalized file “KETTLE.PROPERTIES” defined in Kettle Home. Do remember to Select “Replace variables in script” option.

•This allows you to configure the business rules without the need to change the ETL code.

image

I’ll continue on the Best Practices in the next post.