Thursday, November 17, 2011

How to Install CTools on Windows

Last week I read a post from ‘Slawomir Chodnicki’ on “Creating Dashboards with CDE”. Thanks to him for a really useful post.

I wanted to install the CTools on my Pentaho environment. But the issue was that I was on Windows and the CTools installer is made for Unix/Linux environment.

Pedro Alves published a blog on How to install it on Unix and Yogaraj Khanal has used Cygwin to use it on Windows.

I am not a big fan of Cygwin plus due to IT restrictions it was a lengthy process to get CYGWIN installed on my machine,  I so I tried to install CDF/CDE/CDA without a script. It’s not That difficult actually. You just need to follow the steps mentioned in the Shell script “ctools-installer.sh”.

Steps:

1. Get the CDF/CDE/CDA files from location mentioned under ‘downloadCDF/CDA/CDE’ functions.

2. Put them under a temporary directory.

3. Remove any existing CDE/CDF/CDA files.

4. Unzip them as per details given under function ‘installCDF/CDE/CDA’.

5. Copy them to respective location.

Things to remember:

1. You’ve to first Unzip the dist.zip files. This will give you another set of Zip and Jar’s.

List of files:

cda-samples-TRUNK-20111013.zip
cda-TRUNK-20111013.zip                                                       CDA
cda-TRUNK-SNAPSHOT.jar

pentaho-cdf-dd-solution-TRUNK-20111028.zip
pentaho-cdf-dd-TRUNK-20111028.zip                         CDF Dashboard
pentaho-cdf-dd-TRUNK-SNAPSHOT.jar

pentaho-cdf-TRUNK-SNAPSHOT.jar
pentaho-cdf-TRUNK-SNAPSHOT.zip                           CDE

2. Put the Jar’s into the location under Tomcat installation for Pentaho BI Server. In my case it was “tomcat\webapps\pentaho\WEB-INF\lib”.

3. Now the Zip files which You get from the Step 1, contain two sets of directory structures. One will be the configuration files for CDF/CDE/CDA and other will be Samples.

4. Samples need to be placed under ‘/biserver/pentaho-solutions’ and Configuration files under ‘/biserver/pentaho-solutions/system

It took me 20 mins max to set this up. I started the BI server and lived “Happily ever After”

It’s no big deal but it might save you some time and un-necessary CYGWIN setup.

Friday, November 4, 2011

Kettle Trials: Using Modified JavaScript Value to Generate Surrogate Key

 

In the last Post I explained how I implemented Surrogate Key using SSIS, for Dimension Table Updates.

I tried to do the Same thing with Kettle and I was not amazed that How easy it was. Kettle gives you so many options to do a task.

There are many ways to implement it. I’ve implemented it using “Add Sequence” and “Modified JavaScript”.

1.  Logic is to First get the Last Used Surrogate Value. This can be done in two ways, either we can use a sequence (if the DB supports) or we can use Kettle’s own sequence generator. In this example I’ve used the DB sequence as I want this example to be similar to the SSIS exercise which I did although Latter should be a better option as it’ll have Performance benefits.

2. Next to pass this value to a Kettle Variable using “Set Variable” Step.

3. Save this Transformation which will act as a Configuration Transformation for the Main Transformation.

4. Create a New transformation which will do the Actual Loading.

5. Get the Staging Data by using Input Step (I’ve used Flat file Input to keep things simple)

6. Then I used “Modified JavaScript” Step.

In this step first Get the Variable Value, which we set in the Step2. Then increment this Value for Each row and we’re done.

7. Now just pass the Output of above step to a Output step (a DB table or a Flat file).

Another alternative could’ve been: Pick Max value from the Primary Key column of the Table and assign it the Variable, but that can be a performance issue with Larger tables.

Issues which I faced:

How to use Variables into “Modified JavaScript Step”? There is a nice method “getVariable” available under “Special Functions” Category with which we can reference the Variable which we are using in the Transformation. This is the thing which I like most about Kettle, It is an ETL Tool with a Programmers Mind-set rather than being a mere Drag-N-Drop Tool.

Another issue was with Formatting. With the JavaScript I had converted few variables from String to Number and Vice-versa, without passing the Optional Formatting Parameter.

So a thing to note always use the formatter explicitly otherwise Kettle will format the Variables as per its Logic.

So I was expecting “1234” and I got “1234.00” or “000000001234”.

str2num(MAX_ID, "#")

num2str(strVarValue, "#")

clip_image001clip_image002

I kept getting this Error “Sequence Not Found” when I validated the Transformation, although the Sequence was there. I tried the SQL Editor available inside the Kettle and even it verified that Sequence was there. It was really annoying. I restarted the Database, Cleared the Cache of the DB but nothing worked.

Actual issue was that I had created my DB schema under a DB with Multiple Schemas in it and while creating the Connection in Kettle I had filtered it (Attached image).

But it seems Kettle and DB2 aren’t so friendly and that’s why the Error.

Solution to this issue is that Create a OS User with same name as Schema and then Create the User Objects in the schema with the connection under this New User’s Credentials.

Job:

clip_image003

Setup Variable Transformation:

clip_image004

Set Variable and Add Sequence Step:

clip_image005clip_image006

Main Transformation:

clip_image007

Modified JavaScript Step

clip_image008

Wednesday, November 2, 2011

SSIS Trials : Implementing Surrogate Key in Dimensions

Yesterday I finally decided to try my Hands on Microsoft’s ETL tool, SSIS. I have worked with various ETL tools and this was the one I never used. It was part of a PoC (Proof of Concept) work which I had to. The objective was to read a text file and load it into a DB. This was kind of a Standard Dimension loading procedure. So I just had to extract the attributes from the Flat file, transform it and then load it into the DB.

Interesting part was the Creation of Surrogate Key or Primary Key. This is a really a good way to analyze capability of your ETL tool.

Informatica gives you many options like You can use “Sequence Generator Transformation” or You can use Global Variable or You can use Database Sequence (if your DB supports it).

Pentaho-Kettle ETL allows you to use Variables and “Database Sequence” to deal with this situation.

In SSIS you can do it with the Help of an “Execute SQL Task” and “Script Component” Transformation. It used a simple logic:

1. Pick data from a Source (flat file in my case).

2. Get Max value of the Surrogate key from Database Table.

3. Pass it to a Solution Variable.

4. Use the VB script, which will assign it to a Script Variable which will be incremented for Each row of data.

5. Concatenate the Resultset using Derived Column Transformation.

I googled for the Script (to increase the variable For Each Row) and found this Blog really useful.

There were few hiccups which actually were quite frustrating and time almost took me 2 Hours to finally solve them.

Firstly You should always match the datatypes of Variable and the Return type of Database Resultset.

Suppose your Query Return a SUM and the Variable is of type String you might get this Error:

[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_ID": "The type of the value being assigned to variable "User::Max_ID" differs from the current variable type. Variables may not change type during execution. Variable types are strict, except for variables of type Object. ".

Next thing to take care of is the Resultset mapping under Execute SQL Task editor. When you’re using ODBC data source, you should use Resultset Name as 1 otherwise you might get this Error:

“[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_Seq_ID": "Result column index 0 is not valid.".”

Then I tried changing it to Column name which I’ve used in the Query and then it gives error: (I’ve connected to a DB2 database with ODBC)

[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_Seq_ID": "Value does not fall within the expected range.".

If you change Resultset name to 1 with ODBC connection, you’ll get:

“[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_Seq_ID": "[IBM][CLI Driver] CLI0122E Program type out of range. SQLSTATE=HY003".”

Then I found out that we need to use “1” when using ODBC and use “0” with ADO.NET, as Resultset Name.

If you use Alias name from the Query in Resultset name with ADO.NET:

[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_Seq_ID": "Result binding by name "dasdas" is not supported for this connection type. ".

image013 image003

So many errors for one simple solution. I don’t want to Sound Critical but Informatica is Easy in regards to Transformation Usability but so is its Price.

Anyways Once the Execute SQL task is configured properly, I added a Data Task to the Solution and then Added a Script Component with Output to a Derived Column and then to a Flat file (You can use a database table as well). In the Script Component Editor, Please do the following changes:

image013

1. Add the Variable we used in Execute SQL task to ReadOnlyVariables.

2. Add a Output Column to “Inputs and Outputs” Section with Correct Datatype and Encoding.

3. Design the script with proper Type casting between Variables and Columns.

4. Set the Pre-Compile flag to False or you might get error:

“The script component is configured to pre-compile the script, but binary code is not found. Please visit the IDE in Script Component Editor by clicking Design Script button to cause binary code to be generated.”

I used an Derived Column Transformation to Add the Base column from the Flat file to the Generated Surrogate key from Script Component.

image007