Thursday, July 24, 2014

Learning MongoDB

I spent sometime to learn some basic stuff about MongoDB. I started searching internet and found some very helpful links.
Try the MongoDB online (interactive tutorial): http://try.mongodb.org/
Little MongoDB book (PDF) : http://openmymind.net/mongodb.pdf

After going through above links I got some confidence and I downloaded the MongoDB from http://www.mongodb.org/downloads .  Installing the db was very simple and within minutes I was having mongo up and running on my Macbook.

Then I thought of creating a sample spring application to learn spring integration with mongo. I followed the documentation at http://docs.spring.io/autorepo/docs/spring-data-mongodb/1.4.3.RELEASE/reference/html/index.html.

I created an application which fetches all users from the database. The code for the sample application is available at https://github.com/itsmanishagarwal/SpringMongoDb.

Anybody can download the code and use as per their need.

~Manish

Why should we allow only single session per user?

Some time back me , Mishra and Mittal were discussing some technical issues and then Mishra mentioned that he had a requirement to implement this feature to allow only one session per user. He implemented this without any problem but then we started thinking what are the possible reasons for which people want to implement this requirement. After some thinking, discussing and googling we could come with below reasons:
  1. Security. If we allow only one session per user then on creation of second session we can alert the user that there is already a active session and allow a way to kill the previous session. This way if the user is unaware of previous session then this will serve as a warning that his/her credentials may have been compromised. And user can change his credentials.
  2. Licensing. Some products are priced as per number of users using the product. So, avoiding the multiple session per user will prevent the misuse of the license.
  3. Product Implementation. This is very specific to the product requirement. If the application maintains some kind of user's working state then multiple sessions can mess it up.
 I will update this post if I could find some more reasons.

~Manish

Wednesday, June 11, 2014

Unable to add any directory into the watched directory chain in virgo tomcat server 3.0.3

Recently I had a requirement where I had to add a directory into the watched directory chain in virgo tomcat server (VTS 3.0.3). Requirement was to check if some jars are present in a particular directory, and if jars are present then add the directory programmatically (using shell script) in the watched directory chain.
So, I created a shell script which looks like below:

if [ -f /usr/local/vts/external_jars/util* ] ; then
echo "Found the util jar under /usr/local/vts/external_jars"
sed -i '
/usr.watchDirectory=repository\/usr/ a\
external_jars.type=watched \
external_jars.watchDirectory=external_jars
' $VTS_HOME/config/org.eclipse.virgo.repository.properties
sed -i "s/usr,/usr,external_jars,/" $VTS_HOME/config/org.eclipse.virgo.repository.properties
fi
When I ran the script it worked all fine, and the configuration file also got updated, but when I restarted my virgo server it did not pick my "external_jars" directory into the chain of watched directory.
I compared the changes with other watched directories but could not find any clue.

Then after struggling a lot I realized that there is a extra space after the value of property "external_jars.type". And as soon as I removed the extra space, it worked.

Corrected script is:

if [ -f /usr/local/vts/external_jars/util* ] ; then
echo "Found the util jar under /usr/local/vts/external_jars"
sed -i '
/usr.watchDirectory=repository\/usr/ a\
external_jars.type=watched\
external_jars.watchDirectory=external_jars
' $VTS_HOME/config/org.eclipse.virgo.repository.properties
sed -i "s/usr,/usr,external_jars,/" $VTS_HOME/config/org.eclipse.virgo.repository.properties
fi

Just thought of sharing this info, so I blogged it.

Saturday, March 29, 2014

How to quickly setup ActiveMQ broker, publisher and subscriber

Last week I was working on an issue related to ActiveMQ messaging. During my debugging the most painful part was to start the entire application and then execute the test scenario just to test some functionality or feature in ActiveMQ. After spending some time I realized that to speed up my debugging and anaysis I have to create a separate application/program which I can start and stop quickly after making changes. So, I created two programs:
ActiveMQPublisherTest: This program start the ActiveMQ broker and then push messages into a queue.
ActiveMQSubscriberTest: This program listens to the ActiveMQ broker started by "ActiveMQPublisherTest" and receives the event published by it.

I have published the entire source code on GitHub at: https://github.com/itsmanishagarwal/ActiveMQTest

To use these programs you just need to change the XML files to point to your database.

Any suggestions or feedback is welcome.

~Manish

Thursday, March 20, 2014

How to verify if a file which belongs to an RPM is modified?

Recently I had to struggle to find out a way to verify if a file which belongs to an RPM is modified or not. After searching a bit on google I found that there is a option in "rpm" tool to verify all the files but there is no direct way to find if a particular file is modified. So, I decided to create a function which can help me doing that.


function
 isFileModified {
  FILE=$1
  if rpm -Vf $FILE | grep $FILE >/dev/null 2>&1 && rpm -Vf $FILE | grep $FILE \
| awk -F" " '{print $1}' | grep -e ".*5.*" >/dev/null 2>&1; then
    return 0
  else
    return 1
  fi
}

Explanation:
rpm -Vf $FILE : Returns list of all the files which got modified in the RPM package.
grep $FILE : Check if the file to be checked is in the list of modified file.
awk -F" " '{print $1}' : Truncates the attributes of the provided file
 grep -e ".*5.*"  : Check if the md5 digest of the file is changed.

So, the functions returns 0 if the file's md5 digect is changed after it is installed by the RPM. Else it will return 0.

Thanks,
Agry

Tuesday, October 2, 2012

Override the properties in wro.properties

Sometimes there is a need to override some properties on local environment to facilitate faster development. If you have implemented the wro4j using spring then it is very easy to override any property in wro.properties file.

Just replace the  wroProperties bean in the applicationContext.xml with below code and spring will look for the property file from other locations.
    <bean id="wroProperties"
       
class="org.springframework.beans.factory.config.PropertiesFactoryBean">
       
<property name="ignoreResourceNotFound" value="true"></property>
       
<property name="locations">
           
<list>
               
<value>file:${catalina.home}/conf/wro.properties</value>
               
<value>file:${catalina.home}/wro.properties</value>
               
<value>file:${user.home}/wro.properties</value>
           
</list>
       
</property>
   
</bean>

Property "<property name="ignoreResourceNotFound" value="true"></property>" ensures that bean creation will not fail even if the wro.properties file is missing. And property file at location mentioned latter overrides the property file before it. Means, if there is a property file at user home then it will override all the property file at other locations.

In my development setup I have placed a wro.properties i user home and set the managerfactoryclassname property to my custom class which disables the minimization. 
(To disable minimization check my blog at: http://msquare-tech.blogspot.in/2012/10/disable-minimizing-resources-when-using.html)

~Manish


Disable minimizing the resources when using wro4j


After implementing the wro4j in my application, the performance of the pages improved but there was one problem. Now because all the resources are minified, it become difficult to debug the Javascript issues from Firebug. 
I resolved the above issue by following below steps:

1. Extending the "DefaultGroupExtractor" class and overriding only one method :
/*
* Never minimize the resources
*/ 

@Override
  public boolean isMinimized(HttpServletRequest request) {
    return false;
  }
2.  Extending the "BaseWroManagerFactory" class and setting new group extractor as created in step 1:
/*
* Return the custom extractor as created above.
*/
  @Override
  protected GroupExtractor newGroupExtractor() {
    return new CustomDefaultGroupExtractor(); // extractor created in step 1
  }
3. In the wro.properties file add the manager factory class as below:
managerFactoryClassName=com.vmops.web.optimizer.CustomWroManagerFactory

4. Restart the server. Now you will see no resources are minimized.

~Manish

Wednesday, June 27, 2012

Implement wro4j in five steps.


WRO4J (Web Resource Optimizer For Java) is an awesome open source resource optimizer. I recently implemented it in my application. So here I am providing steps I followed to implement it and issues I faced.

Tools used: Maven

1. Add maven dependency for the WRO4J in you pom.xml as follows:

2. Add a filter in web.xml as follows:

3. Under WEB-INF create a folder wro.xml with content as follows:

This will create a js and css at runtime by combining all js and css under the group all and return and all.js and all.css respectively.

4. Under same folder WEB-INF create another file wro.properties with content as follows:
debug=true
disableCache=true
gzipResources=true
jmxEnabled=false
preProcessors=semicolonAppender
postProcessors=jsMin,cssCompressor,cssMin

5. Open any new existing JSP page and add a js call as follows:
and for css add the call as:
Thats it!! Now start your server and open your page.

I faced one issue while implementing wro4j and that is due to a dependency of wro4j jars. This version of wro4j requires commons-io.2.1 but because some other dependency older version of common-io got loaded. I did not get any error but was getting empty results when calling /wro/all.js and /wro/all.css. So, be careful.

~Manish


Thursday, November 17, 2011

How to Install CTools on Windows

Last week I read a post from ‘Slawomir Chodnicki’ on “Creating Dashboards with CDE”. Thanks to him for a really useful post.

I wanted to install the CTools on my Pentaho environment. But the issue was that I was on Windows and the CTools installer is made for Unix/Linux environment.

Pedro Alves published a blog on How to install it on Unix and Yogaraj Khanal has used Cygwin to use it on Windows.

I am not a big fan of Cygwin plus due to IT restrictions it was a lengthy process to get CYGWIN installed on my machine,  I so I tried to install CDF/CDE/CDA without a script. It’s not That difficult actually. You just need to follow the steps mentioned in the Shell script “ctools-installer.sh”.

Steps:

1. Get the CDF/CDE/CDA files from location mentioned under ‘downloadCDF/CDA/CDE’ functions.

2. Put them under a temporary directory.

3. Remove any existing CDE/CDF/CDA files.

4. Unzip them as per details given under function ‘installCDF/CDE/CDA’.

5. Copy them to respective location.

Things to remember:

1. You’ve to first Unzip the dist.zip files. This will give you another set of Zip and Jar’s.

List of files:

cda-samples-TRUNK-20111013.zip
cda-TRUNK-20111013.zip                                                       CDA
cda-TRUNK-SNAPSHOT.jar

pentaho-cdf-dd-solution-TRUNK-20111028.zip
pentaho-cdf-dd-TRUNK-20111028.zip                         CDF Dashboard
pentaho-cdf-dd-TRUNK-SNAPSHOT.jar

pentaho-cdf-TRUNK-SNAPSHOT.jar
pentaho-cdf-TRUNK-SNAPSHOT.zip                           CDE

2. Put the Jar’s into the location under Tomcat installation for Pentaho BI Server. In my case it was “tomcat\webapps\pentaho\WEB-INF\lib”.

3. Now the Zip files which You get from the Step 1, contain two sets of directory structures. One will be the configuration files for CDF/CDE/CDA and other will be Samples.

4. Samples need to be placed under ‘/biserver/pentaho-solutions’ and Configuration files under ‘/biserver/pentaho-solutions/system

It took me 20 mins max to set this up. I started the BI server and lived “Happily ever After”

It’s no big deal but it might save you some time and un-necessary CYGWIN setup.

Friday, November 4, 2011

Kettle Trials: Using Modified JavaScript Value to Generate Surrogate Key

 

In the last Post I explained how I implemented Surrogate Key using SSIS, for Dimension Table Updates.

I tried to do the Same thing with Kettle and I was not amazed that How easy it was. Kettle gives you so many options to do a task.

There are many ways to implement it. I’ve implemented it using “Add Sequence” and “Modified JavaScript”.

1.  Logic is to First get the Last Used Surrogate Value. This can be done in two ways, either we can use a sequence (if the DB supports) or we can use Kettle’s own sequence generator. In this example I’ve used the DB sequence as I want this example to be similar to the SSIS exercise which I did although Latter should be a better option as it’ll have Performance benefits.

2. Next to pass this value to a Kettle Variable using “Set Variable” Step.

3. Save this Transformation which will act as a Configuration Transformation for the Main Transformation.

4. Create a New transformation which will do the Actual Loading.

5. Get the Staging Data by using Input Step (I’ve used Flat file Input to keep things simple)

6. Then I used “Modified JavaScript” Step.

In this step first Get the Variable Value, which we set in the Step2. Then increment this Value for Each row and we’re done.

7. Now just pass the Output of above step to a Output step (a DB table or a Flat file).

Another alternative could’ve been: Pick Max value from the Primary Key column of the Table and assign it the Variable, but that can be a performance issue with Larger tables.

Issues which I faced:

How to use Variables into “Modified JavaScript Step”? There is a nice method “getVariable” available under “Special Functions” Category with which we can reference the Variable which we are using in the Transformation. This is the thing which I like most about Kettle, It is an ETL Tool with a Programmers Mind-set rather than being a mere Drag-N-Drop Tool.

Another issue was with Formatting. With the JavaScript I had converted few variables from String to Number and Vice-versa, without passing the Optional Formatting Parameter.

So a thing to note always use the formatter explicitly otherwise Kettle will format the Variables as per its Logic.

So I was expecting “1234” and I got “1234.00” or “000000001234”.

str2num(MAX_ID, "#")

num2str(strVarValue, "#")

clip_image001clip_image002

I kept getting this Error “Sequence Not Found” when I validated the Transformation, although the Sequence was there. I tried the SQL Editor available inside the Kettle and even it verified that Sequence was there. It was really annoying. I restarted the Database, Cleared the Cache of the DB but nothing worked.

Actual issue was that I had created my DB schema under a DB with Multiple Schemas in it and while creating the Connection in Kettle I had filtered it (Attached image).

But it seems Kettle and DB2 aren’t so friendly and that’s why the Error.

Solution to this issue is that Create a OS User with same name as Schema and then Create the User Objects in the schema with the connection under this New User’s Credentials.

Job:

clip_image003

Setup Variable Transformation:

clip_image004

Set Variable and Add Sequence Step:

clip_image005clip_image006

Main Transformation:

clip_image007

Modified JavaScript Step

clip_image008

Wednesday, November 2, 2011

SSIS Trials : Implementing Surrogate Key in Dimensions

Yesterday I finally decided to try my Hands on Microsoft’s ETL tool, SSIS. I have worked with various ETL tools and this was the one I never used. It was part of a PoC (Proof of Concept) work which I had to. The objective was to read a text file and load it into a DB. This was kind of a Standard Dimension loading procedure. So I just had to extract the attributes from the Flat file, transform it and then load it into the DB.

Interesting part was the Creation of Surrogate Key or Primary Key. This is a really a good way to analyze capability of your ETL tool.

Informatica gives you many options like You can use “Sequence Generator Transformation” or You can use Global Variable or You can use Database Sequence (if your DB supports it).

Pentaho-Kettle ETL allows you to use Variables and “Database Sequence” to deal with this situation.

In SSIS you can do it with the Help of an “Execute SQL Task” and “Script Component” Transformation. It used a simple logic:

1. Pick data from a Source (flat file in my case).

2. Get Max value of the Surrogate key from Database Table.

3. Pass it to a Solution Variable.

4. Use the VB script, which will assign it to a Script Variable which will be incremented for Each row of data.

5. Concatenate the Resultset using Derived Column Transformation.

I googled for the Script (to increase the variable For Each Row) and found this Blog really useful.

There were few hiccups which actually were quite frustrating and time almost took me 2 Hours to finally solve them.

Firstly You should always match the datatypes of Variable and the Return type of Database Resultset.

Suppose your Query Return a SUM and the Variable is of type String you might get this Error:

[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_ID": "The type of the value being assigned to variable "User::Max_ID" differs from the current variable type. Variables may not change type during execution. Variable types are strict, except for variables of type Object. ".

Next thing to take care of is the Resultset mapping under Execute SQL Task editor. When you’re using ODBC data source, you should use Resultset Name as 1 otherwise you might get this Error:

“[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_Seq_ID": "Result column index 0 is not valid.".”

Then I tried changing it to Column name which I’ve used in the Query and then it gives error: (I’ve connected to a DB2 database with ODBC)

[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_Seq_ID": "Value does not fall within the expected range.".

If you change Resultset name to 1 with ODBC connection, you’ll get:

“[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_Seq_ID": "[IBM][CLI Driver] CLI0122E Program type out of range. SQLSTATE=HY003".”

Then I found out that we need to use “1” when using ODBC and use “0” with ADO.NET, as Resultset Name.

If you use Alias name from the Query in Resultset name with ADO.NET:

[Execute SQL Task] Error: An error occurred while assigning a value to variable "Max_Seq_ID": "Result binding by name "dasdas" is not supported for this connection type. ".

image013 image003

So many errors for one simple solution. I don’t want to Sound Critical but Informatica is Easy in regards to Transformation Usability but so is its Price.

Anyways Once the Execute SQL task is configured properly, I added a Data Task to the Solution and then Added a Script Component with Output to a Derived Column and then to a Flat file (You can use a database table as well). In the Script Component Editor, Please do the following changes:

image013

1. Add the Variable we used in Execute SQL task to ReadOnlyVariables.

2. Add a Output Column to “Inputs and Outputs” Section with Correct Datatype and Encoding.

3. Design the script with proper Type casting between Variables and Columns.

4. Set the Pre-Compile flag to False or you might get error:

“The script component is configured to pre-compile the script, but binary code is not found. Please visit the IDE in Script Component Editor by clicking Design Script button to cause binary code to be generated.”

I used an Derived Column Transformation to Add the Base column from the Flat file to the Generated Surrogate key from Script Component.

image007

 

Thursday, September 29, 2011

How to Synchronize your data to Backup server


Last year I realized the importance of data backup when one of my colleague lost all of his critical data due to Hard disk failure and as usual he didn’t had any backup. So then I thought that it’s better to be Safe than Sorry. After this It occurred to me that my IT department has already provided me quota on backup servers, which I never utilized. IT provided me a batch file (as most of the machines in my organization are on windows), which goes like this:

xcopy c:\data\*.* h:\c-data /c /e /h /y
xcopy D:\data\*.* h:\d-data /c /e /h /y
@echo Backup is complete
@pause

I somehow didn’t like this logic and modified it as per my convenience.

echo Close Outlook
@pause
echo remove stale data
echo Y | rmdir Old\c-data /S
echo Y | rmdir Old\d-data /S
@pause

echo Move Last backup to Old location
move c-data Old\
move d-data Old\
mkdir c-data
mkdir d-data

echo Start backup
echo D | xcopy D:\Dev H:\d-data /c /e /h /y  >backup.log
echo D | xcopy "c:\Documents and Settings\manish.mishra\My Documents" "H:\c-data\My Documents" /c /E /h /y >>backup.log
echo D | xcopy "c:\Documents and Settings\manish.mishra\Desktop" H:\c-data\Desktop /c /e /h /y >>backup.log
@echo Backup is complete
@pause

I used this for few weeks but still it was not an optimized method because it was copying entire data irrespective of the fact whether it was changed or not. It should be incremental.

So I Googled it and found a useful tool from Microsoft for this purpose.

It’s called “SyncToy”. I downloaded and installed it. It’s pretty simple to use and has a very intuitive GUI. I didn’t bother to read all the FAQ’s and Help manuals.

1. So first you need to “Create New Folder Pair”

clip_image001

2. Next Select the folder you want to backed up as “Left Folder”

clip_image001[4]

3. Then select Backup Server location as “Right Folder”

4. Next comes the interesting part. “Synctoy” gives you an option so as to Select what kind of Synchronization you’ll like to perform.

clip_image002clip_image003 clip_image004

a. Synchronize: This option will check both the left and right folders for “Add/Update/Delete/Rename” and synchronize both of them. This option can be useful in case you want to use Synctoy as a Development repository, when you’re working on the same project but from multiple machines.

b. Echo: This option only compares Left to Right not vice versa. (I chose this one as I don’t need to synchronize Backup server to Local).

c. Contribute: This one is similar to Echo with the only difference being Delete. File which are deleted on Left won’t be synced to Right.

You can change these options later as well.

5. Give this Synchronization pair a Meaningful name so that you can identify it later.

clip_image005

6. After this you’ll get this Pair listed in the Home Screen of the Tool and you can select which Pair you want to Run.

7. Selecting the pair will give you other options as well, which you don’t get at the time of Creating the Pair.

clip_image006

8. You can also click on Preview button which will show you a report displaying all the modified content.

clip_image007

9. Click on Run and you’re done.

Monday, August 15, 2011

Prompts with Multi-Select and Conditional Cascading

Another issue with Cognos:
Today I learnt a new caveat with Cognos Report Studio. I had a business case where I need to create two prompts, One cascading onto the other.

But the trick was that Cascading had to be conditional. For ex: for a particular value (let’s say ‘YYY’) of Prompt1, Prompt2 should display otherwise it should be blank. And to make it more complicated, Prompt1 was Multi-select.
I am no expert with Cognos so I started with my usual instincts and implemented the situation as:
1. Created Prompt1 with Multi-Select option.
2. Created Prompt2 without the cascade.
3. Add a filter to the Query of Prompt2: ?filter1? like ‘YYY’

Being a database guy, I thought that this should work and to a extent it did. Issue was that when I select ‘YYY’ as the first selection and rest of the values after that then it worked fine.


clip_image001

Now the problem arises when ‘YYY’ is not the first selection. Then second prompt remains blank.


clip_image002

I couldn’t understand that what was the reason. I changed filter to ?filter1? like ‘%YYY%’ but still no result.

Then I thought that Cognos might not treat Like as databases do. So I tried CONTAINS but still the same.
Then I checked IN function and it works like a charm. ‘YYY’ IN ?filter1?

It’s nothing great but just a minor issue which might spoil your Friday evenings.

Hope this helps.

I’ll try to find exact difference between “LIKE” and “CONTAINS”.

Friday, August 5, 2011

How to reference Table data in Excel

 

Recently I had a annoying little issue in Excel. Every month I have to update a excel, which requires updating a source table with raw data in one of the sheets. Rest of the sheets require reference to this table for various calculations.

I’ll use an example to describe the issue more clearly.

image

I’ve to enter the raw data in the table defined at 5th row. Now there is a formula at Cell G3 which requires summation of 3rd column of the table. So the issue here is that every month raw data might change and hence the cell range will also change. So lets say for Month of June the formula range might be SUM(I6:I43) but for next month I have to update it depending upon the number of rows in the raw data.

So this is the problem and below is the solution for it.

First make the data table recognized as Table to Excel i.e. just making the Header as Bold won’t tell Excel that “Hey, I’m a Table and I have so and so Columns with these many Rows.”

So to do this, Select the Header and the row next to it then Select Insert Tab and Click on Table.

image

That’s it. Now Excel will recognize your table.

Now comes the second part “How to reference this Table into a Formula”.
For this Just type your regular formula, like "=SUM(“  and then add Table name and add “[‘ after it, like “=SUM(Rally[”.

As soon as you do this excel will show you all the Columns in that table,

image

Just select the relevant column and close it with “])” and that’s it. So in the end your formula should look like:

=SUM(Rally[SP''s])

I hope it’ll save some precious time of my techno geek community who always wants to do stuff in an optimized and efficient way.

Saturday, June 25, 2011

Kettle – Best Practices

 

During my ETL development work, there have been instances when I had to do a DDL change in the Data Mart tables, sometimes for an enhancement or because of bug fixes.This involves fixing the Kettle Transformation as well. For this you need to follow this simple yet useful practice.

•Clear Database Cache whenever you do a DDL operation on Lookup or Target tables otherwise Spoon might mark Transformation as Invalid on Validation.

Whenever we introduce a table in Kettle, it keeps a cached version of the table metadata into the repository so as to optimize the transformation viewing/editing. But it doesn’t refresh this metadata from database every time we open the transformations as its memory expensive operation. For this reason we have to manually clear this cache whenever we make a change to the Database  tables.

•For this you need to click on the Target (or Dimension Lookup) Step and then Click on SQL and then hit Clear Cache. It’ll ask you to Clean Cache only for this Table on Entire Database.

image

Tuesday, June 21, 2011

Kettle – Best Practices

In tune with the last post, I’ll discuss more about ETL Best practices which I follow with Kettle.

• Enable cache for lookup on Dimension tables.

This improves the Performance of your Transformation for Small/Medium size Dimension tables. But you need to monitor the Logs to check whether It is beneficial or not. In some cases it might add un-necessary load on ETL server.

Its always a dilemma for ETL developer when it comes to Caching of Dimensional tables while doing the lookups. Usually right approach is collect Performance data, for the ETL where Caching is Implemented. This data usually includes Time taken,

  • To fetch Dimension table into memory
  • To filter this data
  • To update the Cache if underlying lookup table data gets changed.

Another good practice which I learned from Informatica, is to Place these Cached files on  Faster Drives.

image

Friday, June 10, 2011

Pentaho Data Integration : Powerful and Cost Effective ETL Solution

 

I would like to start this post with little info about Pentaho Data Integration(PDI):

  • Pentaho Data Integration (PDI), initially called Kettle, is one of the most successful Open Source ETL tool available in the BI space. Initially Pentaho released the only Community version of Kettle but when later it became popular they released the Commercial version called PDI. Kettle was first conceived about four years ago by Matt Casters, who needed a platform-Independent ETL tool for his work as a BI Consultant.
  • Pentaho includes most of the open source projects that comprise its BI suite: Reporting based on JFreeReports (converted more than four years ago and now known as Pentaho Reporting); ETL based on Kettle; OLAP based on Mondrian (with a GUI based on JPivot and a more interactive component that it recently licensed from Lucidera); and advanced analytics based on Weka.
  • Pentaho recently announced plans to provide integration with Hadoop, which can potentially address some of the scalability issues of open source BI tools.

Kettle Architecture

I re-searched a bit so as to get a visual representation for Kettle overall architecture and No surprise I found it on Matt Casters’s blog.

image

Kettle is built with the java programming language. It consists of four distinct applications:

Spoon: Spoon is a GUI tool to model the flow of data from input step to output step. One such model is also called a transformation.
Pan: Pan is a command line tool that executes transformations modeled with Spoon.
Chef: Chef is a graphically oriented end-user tool used to model jobs. Jobs consist of job entries such as transformations, FTP downloads etc. that are placed in a flow of control.
Kitchen: Kitchen is a command line tool used to execute jobs created with Chef.

Few Key points about the Architecture:

Model Based: Spoon and Chef are used to create Models which are XML based and can be interpreted by command line tools, Pan and Kitchen.

Repository based: Kettle allows the model to be saved either in database repository or as XML Documents. You cannot mix the two methods (files and repository) in the same project.

Kettle – Best Practices

There are few things which I’ve learned with my experience with Kettle, which I feel can be termed as Best Practices.

•Use variables instead of Hardcoded strings while using Input SQL transformation. You can set these variables in an externalized file “KETTLE.PROPERTIES” defined in Kettle Home. Do remember to Select “Replace variables in script” option.

•This allows you to configure the business rules without the need to change the ETL code.

image

I’ll continue on the Best Practices in the next post.

Saturday, April 30, 2011

How to make your Windows 7 computer as WiFi Router

1. Go to Control Panel > Network and Sharing Center .
2. Click on Set up a new network connection.
3. Select connect option "Set up wireless ad hoc"
4. Give details .

5. Done!!!
 Your computer is ready as WiFi router.


Thursday, December 9, 2010

BI with Pentaho Part-IV

In this post I’ll discuss about “"How to Create Mapping/Transformations with Kettle”. I needed to load set of flat files into database staging area for my staging ETL. I created the mapping for this using various Kettle transformations (I don’t know why Kettle doesn’t call a set of related transformations as Mapping, like Informatica ).

I used the Text file Input Transformation to read the tab separated data. Then using various filter transformations I extracted the relevant data and loaded into an database staging table.

image

Drag an Text file Trans from Input Trans category, from Explorer Tab, to mapping area. Configure the Text file Trans to get the source data. Configuration includes file directory location, Regular expression for file name, Data Separator and Field formatting etc.

image

Next step is to add filter Trans, to select relevant data from Flat file. Filter Rows Trans is listed under Utility Category in Explorer Tab.

Now we need to join two transformation, like Text file Trans to Filter Rows, you need an Hop. Just select the Trans , from which data flow will begin, press middle click with your mouse and drag the mouse to the Trans where finally data will pass.

Lastly after number of Filter Trans, we get the desired staging data, which we need to Insert into the database table. For this we drag an Table Output Trans from Output Category in Explorer Tab. You need to configure the Trans to connect it to respective table in Database, which requires database connection information and the table name. You need to check that whether your Kettle version has relevant database drivers installed under “$Kettle_Root_Dir\data-integration\libext\JDBC\”. If not then you won’t be able to connect to the database. For this you need to copy the respective drivers to the above location. For example I had to copy “ojdbc14.jar” for Oracle connection.

image

After this you need to verify the mapping which you’ve created. You can do this by pressing “F11” or use Tool bar button. Once you’re done with verifications then you can test the mapping execution by using Toolbar. Either you can Play this Trans or Preview.

Toolbar

image

In this console you can view all the Trans, used in the mapping and their statistics. Kettle gives you the facility to debug the mapping as well. For Debugging you need to use the Configure option. This functionality again compares Kettle with the Informatica. You can select the logging level as well. The thing which I liked most was the Performance Graph functionality, which can be really beneficiary while working with large volumes of data. You can log all the session details in a database repository as well. Only thing lacking here is the Metadata reporting which Kettle doesn’t provide.

You can view all the Execution related metrics like Rows Read, Updated, Deleted or Inserted for each Step in the “Step Metrics” tab under Execution results Tab, which you can select from Toolbar.

image

In the Next post I’ll discuss on how to Create and Execute Kettle Jobs.