Tech: Pentaho Data Integration : Powerful and Cost Effective ETL Solution

Friday, June 10, 2011

Pentaho Data Integration : Powerful and Cost Effective ETL Solution

I would like to start this post with little info about Pentaho Data Integration(PDI):

Pentaho Data Integration (PDI), initially called Kettle, is one of the most successful Open Source ETL tool available in the BI space. Initially Pentaho released the only Community version of Kettle but when later it became popular they released the Commercial version called PDI. Kettle was first conceived about four years ago by Matt Casters, who needed a platform-Independent ETL tool for his work as a BI Consultant.
Pentaho includes most of the open source projects that comprise its BI suite: Reporting based on JFreeReports (converted more than four years ago and now known as Pentaho Reporting); ETL based on Kettle; OLAP based on Mondrian (with a GUI based on JPivot and a more interactive component that it recently licensed from Lucidera); and advanced analytics based on Weka.
Pentaho recently announced plans to provide integration with Hadoop, which can potentially address some of the scalability issues of open source BI tools.

Kettle Architecture

I re-searched a bit so as to get a visual representation for Kettle overall architecture and No surprise I found it on Matt Casters’s blog.

Kettle is built with the java programming language. It consists of four distinct applications:

•Spoon: Spoon is a GUI tool to model the flow of data from input step to output step. One such model is also called a transformation.
•Pan: Pan is a command line tool that executes transformations modeled with Spoon.
•Chef: Chef is a graphically oriented end-user tool used to model jobs. Jobs consist of job entries such as transformations, FTP downloads etc. that are placed in a flow of control.
•Kitchen: Kitchen is a command line tool used to execute jobs created with Chef.

Few Key points about the Architecture:

•Model Based: Spoon and Chef are used to create Models which are XML based and can be interpreted by command line tools, Pan and Kitchen.

•Repository based: Kettle allows the model to be saved either in database repository or as XML Documents. You cannot mix the two methods (files and repository) in the same project.

Kettle – Best Practices

There are few things which I’ve learned with my experience with Kettle, which I feel can be termed as Best Practices.

•Use variables instead of Hardcoded strings while using Input SQL transformation. You can set these variables in an externalized file “KETTLE.PROPERTIES” defined in Kettle Home. Do remember to Select “Replace variables in script” option.

•This allows you to configure the business rules without the need to change the ETL code.

I’ll continue on the Best Practices in the next post.

Tech

Friday, June 10, 2011

Pentaho Data Integration : Powerful and Cost Effective ETL Solution

Kettle Architecture

Kettle – Best Practices

No comments:

Post a Comment