Tech: November 2010

Monday, November 22, 2010

BI with Pentaho Part-III

In this post I will discuss my experience with Kettle Output Transformations. As like Input Transformations, Kettle has ample of Output Transformations as well. Most of them are similar to their Input counterparts. I’ve list down the ones which are different.

S No.	Transformation	Usage
1	Delete	Allows you to delete data from database table depending upon stream value
2	Excel Output	It allows you to write to a specific Sheet in Excel files. It also gives you an option to use Template as well.
3	Insert / Update	Another useful transformation. Ideally suited for Data warehouse applications. You can select columns for which you want to lookup the values and the corresponding columns which will be updated on match or Insert
4	SQL File Output	To generate SQL statements fir Inserts with option to Create Table or Truncate table alongside.
5	Synchronize after merge	As per the Kettle Wiki, this transformation should be used in conjunction with Merge Rows transformation. It commits the changes which are marked by Merge Rows transformation as changed, New and Delete etc. As per my understanding its similar to Update Strategy of Informatica.
6	Update	Similar to Insert / Update transformation, but for a very specific case where you just want to Update records and skip the Inserts.

I’ll cover more transformations in next posts.

Thursday, November 18, 2010

BI with Pentaho Part-II

In this post I’ll write about my experiments with ETL transformations in Kettle. As for the POC, I needed to load Flat files, XML Files and then Staging to Fact load as well. I’ll start with basics about Kettle UI and then discuss how I achieved the above mentioned ETL.

So when you open Spoon you get a nice Tip dialog box, which Can sometimes give you a good tip, but not always. Usually Its the regular stuff, which you yourself can figure out.

Next screen is the Welcome screen, which has links to all the documentation, Samples and Kettle related blogs. Main screen is divided into two parts, a Navigator and the Workspace. Navigator is provided with two separate tabs, Design and View. Design lists down all the transformations and View displays the transformation you are currently working with.

Different Transformations

Kettle provide loads of Transformations for your day-to-day ETL needs. All the Transformations have been divided into various categories depending upon theirs usage. Although I don’t find it useful as it an be real pain in the neck for some unusual transformations like for Dimension lookup is in Data Warehouse, instead of Lookup. Maybe this dislike is because of my Informatica mindset.

Another issue with Kettle is that there is no standard documentation for transformations, although I shouldn’t expect this from an Open Source Tool, maybe that’s how Pentaho earns their Dollars, by providing Support and Training for these transformations.

List of Kettle Transformations

I created a list of Kettle transformations while doing research for my project, which I think might help others, who are in a similar need.

Input Transformations

S No. Transformation Usage

1
Access Input

To get data from Access Database files.

2

CSV File Input

To read delimited data from CSV files.

3

Data Grid

To Insert Constant Data to flow e.g.
You want to populate a Dimension table with Static master data

4

De-serialize from file

To read data from a Cube file, usually created
by Kettle with the "Serialize to file" Output transformation.

5

ESRI Shapefile Reader

This is a custom plugin developed to read Shape files created via ESRI, which is used to distribute geographical information via shape files

6

Excel Input

Get data from Excel Files. It’s a useful and capable transformation to read complex excel a well. I found it better than Informatica Excel reader transformation.

7

Fixed file input

This is another variant of CSV file where data file has fixed size.

8

Get data from XML

Really useful transformation. This transformation lets you get data form an XML file, depending upon the level or Xpath Node you select. You can read multiple files with Regular Expression support.

9

Get System Info

Another Useful transformation which lets you get system info (Memory), Software versions, Date related manipulations First Day, last day of Week etc.

10

Mondrian Input

To get data from Mondrian Cube

11

OLAP Input

Another plugin to read OLAP data.

12

Property Input

To read key-value pair type of data.

13

RSS Input

Really unique transformation. This transformation lets you get data form RSS Feeds. You can specify the time frame as well, from which you want your Feeds to be refreshed.

14

Salesforce Input

Its similar to what other ETL tools have provided so as to work with Salesforce data. You just need to provide the SF link, credentials and the Module, with which you want to fetch the data.

15

SAP Input

Similar to Salesforce transformation, but it needs extra configuration. You may need to download few extra jars from pentaho SVN.

16

Table input

Simplest and Most useful Input Transformation.

17

Text file input

To get data from fixed data field files. You can add multiple files or just give the regular expression for File name.

There are few other Transformations as well, which are either very application specific, or for which I couldn’t find a practical use from my experience. These are mostly Plug-Ins developed by ETL developers for very specific application.

Generate random value

Generate Rows

Get File Names

Get Files Rows Count

Get SubFolder names

LDAP Input

LDIF Input

Load file content in memory

Palo Cells Input

Palo Dimension Input

S3 CSV Input

XBase input

You can find more such plug-ins at Kettle Wiki.

Few common and distinctive features about Input transformations, which I worth mentioning:

Lazy Conversion is a algorithm which Kettle uses in certain transformations (CSV and Fixed Input) where it doesn’t make sense to un-necessarily change the data types within ETL. Standard scenarios is where the source and the target type are of same type like Text files or same db schema. It also facilitates binary mode for data movement as well.

Regular Expression: I had a similar name for all of my flat files which I used as Input. So instead of adding them one-by-one, I gave regular expression as “req_uid*.*” and the directory where all my files were located.

Add Filename to Result: This can be useful in cases where we need to filer data based upon filename. For example we get similar files from different regions and we need to process them differently. So all we need to do is name the differently and we can add our logic based upon data fields.

In the next post I’ll discuss about Output Transformations.

Tuesday, November 16, 2010

BI with Pentaho Part-I

Couple of months back I needed to do a POC(Proof of Concept) for our BI group, with Open source technology. After some quick analysis we decided to use Pentaho BI solutions. Kettle, Pentaho’s ETL solution was the leading Open source ETL solution, plus we had an experienced resource for Kettle as well. Regarding BI, we had few Java resources who had BI exposure (with BIRT) so we decided to try something new and finalized Pentaho for BI as well.

ETL with Kettle

I have previously worked on Informatica for ETL so I had this mindset so as to compare every other ETL tool with Informatica. But I knew that Kettle is an Open source so I didn’t raise my expectation too high. I downloaded the software from Sourceforge although you can try the Pentaho paid version called Enterprise Edition as well, which comes with an 30 day trial. I’ll discuss more about the features of Enterprise edition later with BI discussion. I couldn’t find any difference in ETL part between Open source and Enterprise edition, except that they call the paid version “Pentaho Data Integration”.

So after all the setup I started exploring Kettle. Unzipped the software and searched for setup.exe or MSI file to start the Installation. But I was surprised to find just few scripts (bat files) at the root. Secondly you don’t have to download different versions for Linux as well. Similar set of scripts(in the sh format) are provided for Linux as well. I really liked that feature. So Plus 1 to the score for Kettle.

After that I searched that which script should I use to start the ETL tool. There were 3 batch files, Spoon, Kitchen and Pan, strange name haan….even Funny as well. Then I searched the Net and found this wiki. Its a really useful link for getting yourself familiarized with Kettle with some basic transformations examples as well. So now I knew that Which tool to use for What:

Kitchen: is the tool used to execute Jobs from a terminal window. Limited functionality as that of Informatica Workflow Manager.
Pan: allows you to execute Transformations from a terminal window.
Spoon: The graphical tool with which you design various Transformations and test them. Similar to Informatica Mapping Designer

Its like this, Create a Transformation with Spoon, test it save it in the repository. Use Pan to run a single
Transformation from command prompt. Kitchen is used to execute Job, which is created to execute a set of Transformations, which are Inter-dependent.

There are few more scripts present alongside the above mentioned like Encr, Carte.

Carte: Carte is a simple web server that allows you to execute transformations and jobs remotely. I really liked this utility. This is similar to what Informatica has Workflow Monitor or Web Admin Console, although not as robust or versatile as Workflow Monitor. It gives you a basic level of User Security as well.

Encr: This is another useful tool for enhanced security. You can encrypt your password as well. Carte allows you to store the passwords in an external file, which I found very strange (from the security perspective), but then I used Encr, which allows you to encrypt the password and then store in the file.

So after this brief about Kettle utilities, I started Spoon, but nothing happened, a command prompt appeared and vanished. Then I opened a command prompt and started Spoon.bat. Then it gave me a proper Error message that I need to set Java Home on my machine. Kettle requires the Sun Java Runtime Environment (JRE) version 1.5 or newer which you can get at Java or from your Local IT department. You need to set the environment variable JAVA_HOME correctly.

So after this I was able to start Spoon. Now you have to decided what kind of method you want to use to save Kettle Metadata i.e. Repository. Kettle allows you two ways for this, one is File based and the other as usual Database. Benefits are obvious like if you’re working in an distributed environment, use database solution otherwise File based. Secondly database solution allows user security as well.

Database Repository:

File based Repository:

In the next post I’ll discuss my experience with various Transformations which I used with Kettle.

S No.	Transformation	Usage
1	Access Input	To get data from Access Database files.
2	CSV File Input	To read delimited data from CSV files.
3	Data Grid	To Insert Constant Data to flow e.g. You want to populate a Dimension table with Static master data
4	De-serialize from file	To read data from a Cube file, usually created by Kettle with the "Serialize to file" Output transformation.
5	ESRI Shapefile Reader	This is a custom plugin developed to read Shape files created via ESRI, which is used to distribute geographical information via shape files
6	Excel Input	Get data from Excel Files. It’s a useful and capable transformation to read complex excel a well. I found it better than Informatica Excel reader transformation.
7	Fixed file input	This is another variant of CSV file where data file has fixed size.
8	Get data from XML	Really useful transformation. This transformation lets you get data form an XML file, depending upon the level or Xpath Node you select. You can read multiple files with Regular Expression support.
9	Get System Info	Another Useful transformation which lets you get system info (Memory), Software versions, Date related manipulations First Day, last day of Week etc.
10	Mondrian Input	To get data from Mondrian Cube
11	OLAP Input	Another plugin to read OLAP data.
12	Property Input	To read key-value pair type of data.
13	RSS Input	Really unique transformation. This transformation lets you get data form RSS Feeds. You can specify the time frame as well, from which you want your Feeds to be refreshed.
14	Salesforce Input	Its similar to what other ETL tools have provided so as to work with Salesforce data. You just need to provide the SF link, credentials and the Module, with which you want to fetch the data.
15	SAP Input	Similar to Salesforce transformation, but it needs extra configuration. You may need to download few extra jars from pentaho SVN.
16	Table input	Simplest and Most useful Input Transformation.
17	Text file input	To get data from fixed data field files. You can add multiple files or just give the regular expression for File name.