In this post I’ll write about my experiments with ETL transformations in Kettle. As for the POC, I needed to load Flat files, XML Files and then Staging to Fact load as well. I’ll start with basics about Kettle UI and then discuss how I achieved the above mentioned ETL.
So when you open Spoon you get a nice Tip dialog box, which Can sometimes give you a good tip, but not always. Usually Its the regular stuff, which you yourself can figure out.
Next screen is the Welcome screen, which has links to all the documentation, Samples and Kettle related blogs. Main screen is divided into two parts, a Navigator and the Workspace. Navigator is provided with two separate tabs, Design and View. Design lists down all the transformations and View displays the transformation you are currently working with.
Different Transformations
Kettle provide loads of Transformations for your day-to-day ETL needs. All the Transformations have been divided into various categories depending upon theirs usage. Although I don’t find it useful as it an be real pain in the neck for some unusual transformations like for Dimension lookup is in Data Warehouse, instead of Lookup. Maybe this dislike is because of my Informatica mindset.
Another issue with Kettle is that there is no standard documentation for transformations, although I shouldn’t expect this from an Open Source Tool, maybe that’s how Pentaho earns their Dollars, by providing Support and Training for these transformations.
List of Kettle Transformations
I created a list of Kettle transformations while doing research for my project, which I think might help others, who are in a similar need.
Input Transformations
S No. Transformation Usage 1 Access Input
To get data from Access Database files.
2
CSV File Input
To read delimited data from CSV files.
3
Data Grid
To Insert Constant Data to flow e.g.
You want to populate a Dimension table with Static master data4
De-serialize from file
To read data from a Cube file, usually created
by Kettle with the "Serialize to file" Output transformation.5
ESRI Shapefile Reader
This is a custom plugin developed to read Shape files created via ESRI, which is used to distribute geographical information via shape files
6
Excel Input
Get data from Excel Files. It’s a useful and capable transformation to read complex excel a well. I found it better than Informatica Excel reader transformation.
7
Fixed file input
This is another variant of CSV file where data file has fixed size.
8
Get data from XML
Really useful transformation. This transformation lets you get data form an XML file, depending upon the level or Xpath Node you select. You can read multiple files with Regular Expression support.
9
Get System Info
Another Useful transformation which lets you get system info (Memory), Software versions, Date related manipulations First Day, last day of Week etc.
10
Mondrian Input
To get data from Mondrian Cube
11
OLAP Input
Another plugin to read OLAP data.
12
Property Input
To read key-value pair type of data.
13
RSS Input
Really unique transformation. This transformation lets you get data form RSS Feeds. You can specify the time frame as well, from which you want your Feeds to be refreshed.
14
Salesforce Input
Its similar to what other ETL tools have provided so as to work with Salesforce data. You just need to provide the SF link, credentials and the Module, with which you want to fetch the data.
15
SAP Input
Similar to Salesforce transformation, but it needs extra configuration. You may need to download few extra jars from pentaho SVN.
16
Table input
Simplest and Most useful Input Transformation.
17
Text file input
To get data from fixed data field files. You can add multiple files or just give the regular expression for File name.
There are few other Transformations as well, which are either very application specific, or for which I couldn’t find a practical use from my experience. These are mostly Plug-Ins developed by ETL developers for very specific application.
Generate random value |
Generate Rows |
Get File Names |
Get Files Rows Count |
Get SubFolder names |
LDAP Input |
LDIF Input |
Load file content in memory |
Palo Cells Input |
Palo Dimension Input |
S3 CSV Input |
XBase input |
You can find more such plug-ins at Kettle Wiki.
Few common and distinctive features about Input transformations, which I worth mentioning:
Lazy Conversion is a algorithm which Kettle uses in certain transformations (CSV and Fixed Input) where it doesn’t make sense to un-necessarily change the data types within ETL. Standard scenarios is where the source and the target type are of same type like Text files or same db schema. It also facilitates binary mode for data movement as well.
Regular Expression: I had a similar name for all of my flat files which I used as Input. So instead of adding them one-by-one, I gave regular expression as “req_uid*.*” and the directory where all my files were located.
Add Filename to Result: This can be useful in cases where we need to filer data based upon filename. For example we get similar files from different regions and we need to process them differently. So all we need to do is name the differently and we can add our logic based upon data fields.
In the next post I’ll discuss about Output Transformations.
Dont forget all those steps are documented on the kettle wiki too.
ReplyDelete