Thursday, November 17, 2011

How to Install CTools on Windows

Last week I read a post from ‘Slawomir Chodnicki’ on “Creating Dashboards with CDE”. Thanks to him for a really useful post.

I wanted to install the CTools on my Pentaho environment. But the issue was that I was on Windows and the CTools installer is made for Unix/Linux environment.

Pedro Alves published a blog on How to install it on Unix and Yogaraj Khanal has used Cygwin to use it on Windows.

I am not a big fan of Cygwin plus due to IT restrictions it was a lengthy process to get CYGWIN installed on my machine,  I so I tried to install CDF/CDE/CDA without a script. It’s not That difficult actually. You just need to follow the steps mentioned in the Shell script “ctools-installer.sh”.

Steps:

1. Get the CDF/CDE/CDA files from location mentioned under ‘downloadCDF/CDA/CDE’ functions.

2. Put them under a temporary directory.

3. Remove any existing CDE/CDF/CDA files.

4. Unzip them as per details given under function ‘installCDF/CDE/CDA’.

5. Copy them to respective location.

Things to remember:

1. You’ve to first Unzip the dist.zip files. This will give you another set of Zip and Jar’s.

List of files:

cda-samples-TRUNK-20111013.zip
cda-TRUNK-20111013.zip                                                       CDA
cda-TRUNK-SNAPSHOT.jar

pentaho-cdf-dd-solution-TRUNK-20111028.zip
pentaho-cdf-dd-TRUNK-20111028.zip                         CDF Dashboard
pentaho-cdf-dd-TRUNK-SNAPSHOT.jar

pentaho-cdf-TRUNK-SNAPSHOT.jar
pentaho-cdf-TRUNK-SNAPSHOT.zip                           CDE

2. Put the Jar’s into the location under Tomcat installation for Pentaho BI Server. In my case it was “tomcat\webapps\pentaho\WEB-INF\lib”.

3. Now the Zip files which You get from the Step 1, contain two sets of directory structures. One will be the configuration files for CDF/CDE/CDA and other will be Samples.

4. Samples need to be placed under ‘/biserver/pentaho-solutions’ and Configuration files under ‘/biserver/pentaho-solutions/system

It took me 20 mins max to set this up. I started the BI server and lived “Happily ever After”

It’s no big deal but it might save you some time and un-necessary CYGWIN setup.

Friday, November 4, 2011

Kettle Trials: Using Modified JavaScript Value to Generate Surrogate Key

 

In the last Post I explained how I implemented Surrogate Key using SSIS, for Dimension Table Updates.

I tried to do the Same thing with Kettle and I was not amazed that How easy it was. Kettle gives you so many options to do a task.

There are many ways to implement it. I’ve implemented it using “Add Sequence” and “Modified JavaScript”.

1.  Logic is to First get the Last Used Surrogate Value. This can be done in two ways, either we can use a sequence (if the DB supports) or we can use Kettle’s own sequence generator. In this example I’ve used the DB sequence as I want this example to be similar to the SSIS exercise which I did although Latter should be a better option as it’ll have Performance benefits.

2. Next to pass this value to a Kettle Variable using “Set Variable” Step.

3. Save this Transformation which will act as a Configuration Transformation for the Main Transformation.

4. Create a New transformation which will do the Actual Loading.

5. Get the Staging Data by using Input Step (I’ve used Flat file Input to keep things simple)

6. Then I used “Modified JavaScript” Step.

In this step first Get the Variable Value, which we set in the Step2. Then increment this Value for Each row and we’re done.

7. Now just pass the Output of above step to a Output step (a DB table or a Flat file).

Another alternative could’ve been: Pick Max value from the Primary Key column of the Table and assign it the Variable, but that can be a performance issue with Larger tables.

Issues which I faced:

How to use Variables into “Modified JavaScript Step”? There is a nice method “getVariable” available under “Special Functions” Category with which we can reference the Variable which we are using in the Transformation. This is the thing which I like most about Kettle, It is an ETL Tool with a Programmers Mind-set rather than being a mere Drag-N-Drop Tool.

Another issue was with Formatting. With the JavaScript I had converted few variables from String to Number and Vice-versa, without passing the Optional Formatting Parameter.

So a thing to note always use the formatter explicitly otherwise Kettle will format the Variables as per its Logic.

So I was expecting “1234” and I got “1234.00” or “000000001234”.

str2num(MAX_ID, "#")

num2str(strVarValue, "#")

clip_image001clip_image002

I kept getting this Error “Sequence Not Found” when I validated the Transformation, although the Sequence was there. I tried the SQL Editor available inside the Kettle and even it verified that Sequence was there. It was really annoying. I restarted the Database, Cleared the Cache of the DB but nothing worked.

Actual issue was that I had created my DB schema under a DB with Multiple Schemas in it and while creating the Connection in Kettle I had filtered it (Attached image).

But it seems Kettle and DB2 aren’t so friendly and that’s why the Error.

Solution to this issue is that Create a OS User with same name as Schema and then Create the User Objects in the schema with the connection under this New User’s Credentials.

Job:

clip_image003

Setup Variable Transformation:

clip_image004

Set Variable and Add Sequence Step:

clip_image005clip_image006

Main Transformation:

clip_image007

Modified JavaScript Step

clip_image008

Wednesday, November 2, 2011

SSIS Trials : Implementing Surrogate Key in Dimensions

Yesterday I finally decided to try my Hands on Microsoft’s ETL tool, SSIS. I have worked with various ETL tools and this was the one I never used. It was part of a PoC (Proof of Concept) work which I had to. The objective was to read a text file and load it into a DB. This was kind of a Standard Dimension loading procedure. So I just had to extract the attributes from the Flat file, transform it and then load it into the DB.

Interesting part was the Creation of Surrogate Key or Primary Key. This is a really a good way to analyze capability of your ETL tool.

Informatica gives you many options like You can use “Sequence Generator Transformation” or You can use Global Variable or You can use Database Sequence (if your DB supports it).

Pentaho-Kettle ETL allows you to use Variables and “Database Sequence” to deal with this situation.

In SSIS you can do it with the Help of an “Execute SQL Task” and “Script Component” Transformation. It used a simple logic:

1. Pick data from a Source (flat file in my case).

2. Get Max value of the Surrogate key from Database Table.

3. Pass it to a Solution Variable.

4. Use the VB script, which will assign it to a Script Variable which will be incremented for Each row of data.

5. Concatenate the Resultset using Derived Column Transformation.

I googled for the Script (to increase the variable For Each Row) and found this Blog really useful.

There were few hiccups which actually were quite frustrating and time almost took me 2 Hours to finally solve them.

Firstly You should always match the datatypes of Variable and the Return type of Database Resultset.

Suppose your Query Return a SUM and the Variable is of type String you might get this Error:

[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_ID": "The type of the value being assigned to variable "User::Max_ID" differs from the current variable type. Variables may not change type during execution. Variable types are strict, except for variables of type Object. ".

Next thing to take care of is the Resultset mapping under Execute SQL Task editor. When you’re using ODBC data source, you should use Resultset Name as 1 otherwise you might get this Error:

“[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_Seq_ID": "Result column index 0 is not valid.".”

Then I tried changing it to Column name which I’ve used in the Query and then it gives error: (I’ve connected to a DB2 database with ODBC)

[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_Seq_ID": "Value does not fall within the expected range.".

If you change Resultset name to 1 with ODBC connection, you’ll get:

“[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_Seq_ID": "[IBM][CLI Driver] CLI0122E Program type out of range. SQLSTATE=HY003".”

Then I found out that we need to use “1” when using ODBC and use “0” with ADO.NET, as Resultset Name.

If you use Alias name from the Query in Resultset name with ADO.NET:

[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_Seq_ID": "Result binding by name "dasdas" is not supported for this connection type. ".

image013 image003

So many errors for one simple solution. I don’t want to Sound Critical but Informatica is Easy in regards to Transformation Usability but so is its Price.

Anyways Once the Execute SQL task is configured properly, I added a Data Task to the Solution and then Added a Script Component with Output to a Derived Column and then to a Flat file (You can use a database table as well). In the Script Component Editor, Please do the following changes:

image013

1. Add the Variable we used in Execute SQL task to ReadOnlyVariables.

2. Add a Output Column to “Inputs and Outputs” Section with Correct Datatype and Encoding.

3. Design the script with proper Type casting between Variables and Columns.

4. Set the Pre-Compile flag to False or you might get error:

“The script component is configured to pre-compile the script, but binary code is not found. Please visit the IDE in Script Component Editor by clicking Design Script button to cause binary code to be generated.”

I used an Derived Column Transformation to Add the Base column from the Flat file to the Generated Surrogate key from Script Component.

image007

 

Thursday, September 29, 2011

How to Synchronize your data to Backup server


Last year I realized the importance of data backup when one of my colleague lost all of his critical data due to Hard disk failure and as usual he didn’t had any backup. So then I thought that it’s better to be Safe than Sorry. After this It occurred to me that my IT department has already provided me quota on backup servers, which I never utilized. IT provided me a batch file (as most of the machines in my organization are on windows), which goes like this:

xcopy c:\data\*.* h:\c-data /c /e /h /y
xcopy D:\data\*.* h:\d-data /c /e /h /y
@echo Backup is complete
@pause

I somehow didn’t like this logic and modified it as per my convenience.

echo Close Outlook
@pause
echo remove stale data
echo Y | rmdir Old\c-data /S
echo Y | rmdir Old\d-data /S
@pause

echo Move Last backup to Old location
move c-data Old\
move d-data Old\
mkdir c-data
mkdir d-data

echo Start backup
echo D | xcopy D:\Dev H:\d-data /c /e /h /y  >backup.log
echo D | xcopy "c:\Documents and Settings\manish.mishra\My Documents" "H:\c-data\My Documents" /c /E /h /y >>backup.log
echo D | xcopy "c:\Documents and Settings\manish.mishra\Desktop" H:\c-data\Desktop /c /e /h /y >>backup.log
@echo Backup is complete
@pause

I used this for few weeks but still it was not an optimized method because it was copying entire data irrespective of the fact whether it was changed or not. It should be incremental.

So I Googled it and found a useful tool from Microsoft for this purpose.

It’s called “SyncToy”. I downloaded and installed it. It’s pretty simple to use and has a very intuitive GUI. I didn’t bother to read all the FAQ’s and Help manuals.

1. So first you need to “Create New Folder Pair”

clip_image001

2. Next Select the folder you want to backed up as “Left Folder”

clip_image001[4]

3. Then select Backup Server location as “Right Folder”

4. Next comes the interesting part. “Synctoy” gives you an option so as to Select what kind of Synchronization you’ll like to perform.

clip_image002clip_image003 clip_image004

a. Synchronize: This option will check both the left and right folders for “Add/Update/Delete/Rename” and synchronize both of them. This option can be useful in case you want to use Synctoy as a Development repository, when you’re working on the same project but from multiple machines.

b. Echo: This option only compares Left to Right not vice versa. (I chose this one as I don’t need to synchronize Backup server to Local).

c. Contribute: This one is similar to Echo with the only difference being Delete. File which are deleted on Left won’t be synced to Right.

You can change these options later as well.

5. Give this Synchronization pair a Meaningful name so that you can identify it later.

clip_image005

6. After this you’ll get this Pair listed in the Home Screen of the Tool and you can select which Pair you want to Run.

7. Selecting the pair will give you other options as well, which you don’t get at the time of Creating the Pair.

clip_image006

8. You can also click on Preview button which will show you a report displaying all the modified content.

clip_image007

9. Click on Run and you’re done.

Monday, August 15, 2011

Prompts with Multi-Select and Conditional Cascading

Another issue with Cognos:
Today I learnt a new caveat with Cognos Report Studio. I had a business case where I need to create two prompts, One cascading onto the other.

But the trick was that Cascading had to be conditional. For ex: for a particular value (let’s say ‘YYY’) of Prompt1, Prompt2 should display otherwise it should be blank. And to make it more complicated, Prompt1 was Multi-select.
I am no expert with Cognos so I started with my usual instincts and implemented the situation as:
1. Created Prompt1 with Multi-Select option.
2. Created Prompt2 without the cascade.
3. Add a filter to the Query of Prompt2: ?filter1? like ‘YYY’

Being a database guy, I thought that this should work and to a extent it did. Issue was that when I select ‘YYY’ as the first selection and rest of the values after that then it worked fine.


clip_image001

Now the problem arises when ‘YYY’ is not the first selection. Then second prompt remains blank.


clip_image002

I couldn’t understand that what was the reason. I changed filter to ?filter1? like ‘%YYY%’ but still no result.

Then I thought that Cognos might not treat Like as databases do. So I tried CONTAINS but still the same.
Then I checked IN function and it works like a charm. ‘YYY’ IN ?filter1?

It’s nothing great but just a minor issue which might spoil your Friday evenings.

Hope this helps.

I’ll try to find exact difference between “LIKE” and “CONTAINS”.

Friday, August 5, 2011

How to reference Table data in Excel

 

Recently I had a annoying little issue in Excel. Every month I have to update a excel, which requires updating a source table with raw data in one of the sheets. Rest of the sheets require reference to this table for various calculations.

I’ll use an example to describe the issue more clearly.

image

I’ve to enter the raw data in the table defined at 5th row. Now there is a formula at Cell G3 which requires summation of 3rd column of the table. So the issue here is that every month raw data might change and hence the cell range will also change. So lets say for Month of June the formula range might be SUM(I6:I43) but for next month I have to update it depending upon the number of rows in the raw data.

So this is the problem and below is the solution for it.

First make the data table recognized as Table to Excel i.e. just making the Header as Bold won’t tell Excel that “Hey, I’m a Table and I have so and so Columns with these many Rows.”

So to do this, Select the Header and the row next to it then Select Insert Tab and Click on Table.

image

That’s it. Now Excel will recognize your table.

Now comes the second part “How to reference this Table into a Formula”.
For this Just type your regular formula, like "=SUM(“  and then add Table name and add “[‘ after it, like “=SUM(Rally[”.

As soon as you do this excel will show you all the Columns in that table,

image

Just select the relevant column and close it with “])” and that’s it. So in the end your formula should look like:

=SUM(Rally[SP''s])

I hope it’ll save some precious time of my techno geek community who always wants to do stuff in an optimized and efficient way.

Saturday, June 25, 2011

Kettle – Best Practices

 

During my ETL development work, there have been instances when I had to do a DDL change in the Data Mart tables, sometimes for an enhancement or because of bug fixes.This involves fixing the Kettle Transformation as well. For this you need to follow this simple yet useful practice.

•Clear Database Cache whenever you do a DDL operation on Lookup or Target tables otherwise Spoon might mark Transformation as Invalid on Validation.

Whenever we introduce a table in Kettle, it keeps a cached version of the table metadata into the repository so as to optimize the transformation viewing/editing. But it doesn’t refresh this metadata from database every time we open the transformations as its memory expensive operation. For this reason we have to manually clear this cache whenever we make a change to the Database  tables.

•For this you need to click on the Target (or Dimension Lookup) Step and then Click on SQL and then hit Clear Cache. It’ll ask you to Clean Cache only for this Table on Entire Database.

image

Tuesday, June 21, 2011

Kettle – Best Practices

In tune with the last post, I’ll discuss more about ETL Best practices which I follow with Kettle.

• Enable cache for lookup on Dimension tables.

This improves the Performance of your Transformation for Small/Medium size Dimension tables. But you need to monitor the Logs to check whether It is beneficial or not. In some cases it might add un-necessary load on ETL server.

Its always a dilemma for ETL developer when it comes to Caching of Dimensional tables while doing the lookups. Usually right approach is collect Performance data, for the ETL where Caching is Implemented. This data usually includes Time taken,

  • To fetch Dimension table into memory
  • To filter this data
  • To update the Cache if underlying lookup table data gets changed.

Another good practice which I learned from Informatica, is to Place these Cached files on  Faster Drives.

image

Friday, June 10, 2011

Pentaho Data Integration : Powerful and Cost Effective ETL Solution

 

I would like to start this post with little info about Pentaho Data Integration(PDI):

  • Pentaho Data Integration (PDI), initially called Kettle, is one of the most successful Open Source ETL tool available in the BI space. Initially Pentaho released the only Community version of Kettle but when later it became popular they released the Commercial version called PDI. Kettle was first conceived about four years ago by Matt Casters, who needed a platform-Independent ETL tool for his work as a BI Consultant.
  • Pentaho includes most of the open source projects that comprise its BI suite: Reporting based on JFreeReports (converted more than four years ago and now known as Pentaho Reporting); ETL based on Kettle; OLAP based on Mondrian (with a GUI based on JPivot and a more interactive component that it recently licensed from Lucidera); and advanced analytics based on Weka.
  • Pentaho recently announced plans to provide integration with Hadoop, which can potentially address some of the scalability issues of open source BI tools.

Kettle Architecture

I re-searched a bit so as to get a visual representation for Kettle overall architecture and No surprise I found it on Matt Casters’s blog.

image

Kettle is built with the java programming language. It consists of four distinct applications:

Spoon: Spoon is a GUI tool to model the flow of data from input step to output step. One such model is also called a transformation.
Pan: Pan is a command line tool that executes transformations modeled with Spoon.
Chef: Chef is a graphically oriented end-user tool used to model jobs. Jobs consist of job entries such as transformations, FTP downloads etc. that are placed in a flow of control.
Kitchen: Kitchen is a command line tool used to execute jobs created with Chef.

Few Key points about the Architecture:

Model Based: Spoon and Chef are used to create Models which are XML based and can be interpreted by command line tools, Pan and Kitchen.

Repository based: Kettle allows the model to be saved either in database repository or as XML Documents. You cannot mix the two methods (files and repository) in the same project.

Kettle – Best Practices

There are few things which I’ve learned with my experience with Kettle, which I feel can be termed as Best Practices.

•Use variables instead of Hardcoded strings while using Input SQL transformation. You can set these variables in an externalized file “KETTLE.PROPERTIES” defined in Kettle Home. Do remember to Select “Replace variables in script” option.

•This allows you to configure the business rules without the need to change the ETL code.

image

I’ll continue on the Best Practices in the next post.

Saturday, April 30, 2011

How to make your Windows 7 computer as WiFi Router

1. Go to Control Panel > Network and Sharing Center .
2. Click on Set up a new network connection.
3. Select connect option "Set up wireless ad hoc"
4. Give details .

5. Done!!!
 Your computer is ready as WiFi router.