Thursday, November 18, 2010

BI with Pentaho Part-II

In this post I’ll write about my experiments with ETL transformations in Kettle. As for the POC, I needed to load Flat files, XML Files and then Staging to Fact load as well. I’ll start with basics about Kettle UI and then discuss how I achieved the above mentioned ETL.

So when you open Spoon you get a nice Tip dialog box, which Can sometimes give you a good tip, but not always. Usually Its the regular stuff, which you yourself can figure out.

image

Next screen is the Welcome screen, which has links to all the documentation, Samples and Kettle related blogs. Main screen is divided into two parts, a Navigator and the Workspace. Navigator is provided with two separate tabs, Design and View. Design lists down all the transformations and View displays the transformation you are currently working with.

Different Transformations

Kettle provide loads of Transformations for your day-to-day ETL needs. All the Transformations have been divided into various categories depending upon theirs usage. Although I don’t find it useful as it an be real pain in the neck for some unusual transformations like for Dimension lookup is in Data Warehouse, instead of Lookup. Maybe this dislike is because of my Informatica mindset.

Another issue with Kettle is that there is no standard documentation for transformations, although I shouldn’t expect this from an Open Source Tool, maybe that’s how Pentaho earns their Dollars, by providing Support and Training for these transformations.

List of Kettle Transformations

I created a list of Kettle transformations while doing research for my project, which I think might help others, who are in a similar need.

Input Transformations

S No. Transformation Usage
1

Access Input

To get data from Access Database files.

2

CSV File Input

To read delimited data from CSV files.

3

Data Grid

To Insert Constant Data to flow e.g.
You want to populate a Dimension table with Static master data

4

De-serialize from file

To read data from a Cube file, usually created
by Kettle with the "Serialize to file" Output transformation.

5

ESRI Shapefile Reader

This is a custom plugin developed to read Shape files created via ESRI, which is used to distribute geographical information via shape files

6

Excel Input

Get data from Excel Files. It’s a useful and capable transformation to read complex excel a well. I found it better than Informatica Excel reader transformation.

7

Fixed file input

This is another variant of CSV file where data file has fixed size.

8

Get data from XML

Really useful transformation. This transformation lets you get data form an XML file, depending upon the level or Xpath Node you select. You can read multiple files with Regular Expression support.

9

Get System Info

Another Useful transformation which lets you get system info (Memory), Software versions, Date related manipulations First Day, last day of Week etc.

10

Mondrian Input

To get data from Mondrian Cube

11

OLAP Input

Another plugin to read OLAP data.

12

Property Input

To read key-value pair type of data.

13

RSS Input

Really unique transformation. This transformation lets you get data form RSS Feeds. You can specify the time frame as well, from which you want your Feeds to be refreshed.

14

Salesforce Input

Its similar to what other ETL tools have provided so as to work with Salesforce data. You just need to provide the SF link, credentials and the Module, with which you want to fetch the data.

15

SAP Input

Similar to Salesforce transformation, but it needs extra configuration. You may need to download few extra jars from pentaho SVN.

16

Table input

Simplest and Most useful Input Transformation.

17

Text file input

To get data from fixed data field files. You can add multiple files or just give the regular expression for File name.

There are few other Transformations as well, which are either very application specific, or for which I couldn’t find a practical use from my experience. These are mostly Plug-Ins developed by ETL developers for very specific application.

Generate random value

Generate Rows

Get File Names

Get Files Rows Count

Get SubFolder names

LDAP Input

LDIF Input

Load file content in memory

Palo Cells Input

Palo Dimension Input

S3 CSV Input

XBase input

You can find more such plug-ins at Kettle Wiki.

Few common and distinctive features about Input transformations, which I worth mentioning:

Lazy Conversion is a algorithm which Kettle uses in certain transformations (CSV and Fixed Input) where it doesn’t make sense to un-necessarily change the data types within ETL. Standard scenarios is where the source and the target type are of same type like Text files or same db schema. It also facilitates binary mode for data movement as well.

Regular Expression: I had a similar name for all of my flat files which I used as Input. So instead of adding them one-by-one, I gave regular expression as “req_uid*.*” and the directory where all my files were located.

image

Add Filename to Result: This can be useful in cases where we need to filer data based upon filename. For example we get similar files from different regions and we need to process them differently. So all we need to do is name the differently and we can add our logic based upon data fields.

In the next post I’ll discuss about Output Transformations.

1 comment:

  1. Dont forget all those steps are documented on the kettle wiki too.

    ReplyDelete