In this post I’ll discuss about “"How to Create Mapping/Transformations with Kettle”. I needed to load set of flat files into database staging area for my staging ETL. I created the mapping for this using various Kettle transformations (I don’t know why Kettle doesn’t call a set of related transformations as Mapping, like Informatica ).
I used the Text file Input Transformation to read the tab separated data. Then using various filter transformations I extracted the relevant data and loaded into an database staging table.
Drag an Text file Trans from Input Trans category, from Explorer Tab, to mapping area. Configure the Text file Trans to get the source data. Configuration includes file directory location, Regular expression for file name, Data Separator and Field formatting etc.
Next step is to add filter Trans, to select relevant data from Flat file. Filter Rows Trans is listed under Utility Category in Explorer Tab.
Now we need to join two transformation, like Text file Trans to Filter Rows, you need an Hop. Just select the Trans , from which data flow will begin, press middle click with your mouse and drag the mouse to the Trans where finally data will pass.
Lastly after number of Filter Trans, we get the desired staging data, which we need to Insert into the database table. For this we drag an Table Output Trans from Output Category in Explorer Tab. You need to configure the Trans to connect it to respective table in Database, which requires database connection information and the table name. You need to check that whether your Kettle version has relevant database drivers installed under “$Kettle_Root_Dir\data-integration\libext\JDBC\”. If not then you won’t be able to connect to the database. For this you need to copy the respective drivers to the above location. For example I had to copy “ojdbc14.jar” for Oracle connection.
After this you need to verify the mapping which you’ve created. You can do this by pressing “F11” or use Tool bar button. Once you’re done with verifications then you can test the mapping execution by using Toolbar. Either you can Play this Trans or Preview.
In this console you can view all the Trans, used in the mapping and their statistics. Kettle gives you the facility to debug the mapping as well. For Debugging you need to use the Configure option. This functionality again compares Kettle with the Informatica. You can select the logging level as well. The thing which I liked most was the Performance Graph functionality, which can be really beneficiary while working with large volumes of data. You can log all the session details in a database repository as well. Only thing lacking here is the Metadata reporting which Kettle doesn’t provide.
You can view all the Execution related metrics like Rows Read, Updated, Deleted or Inserted for each Step in the “Step Metrics” tab under Execution results Tab, which you can select from Toolbar.
In the Next post I’ll discuss on how to Create and Execute Kettle Jobs.
In this post I will discuss my experience with Kettle Output Transformations. As like Input Transformations, Kettle has ample of Output Transformations as well. Most of them are similar to their Input counterparts. I’ve list down the ones which are different.
S No.
Transformation
Usage
1
Delete
Allows you to delete data from database table depending upon stream value
2
Excel Output
It allows you to write to a specific Sheet in Excel files. It also gives you an option to use Template as well.
3
Insert / Update
Another useful transformation. Ideally suited for Data warehouse applications. You can select columns for which you want to lookup the values and the corresponding columns which will be updated on match or Insert
4
SQL File Output
To generate SQL statements fir Inserts with option to Create Table or Truncate table alongside.
5
Synchronize after merge
As per the Kettle Wiki, this transformation should be used in conjunction with Merge Rows transformation. It commits the changes which are marked by Merge Rows transformation as changed, New and Delete etc. As per my understanding its similar to Update Strategy of Informatica.
6
Update
Similar to Insert / Update transformation, but for a very specific case where you just want to Update records and skip the Inserts.
In this post I’ll write about my experiments with ETL transformations in Kettle. As for the POC, I needed to load Flat files, XML Files and then Staging to Fact load as well. I’ll start with basics about Kettle UI and then discuss how I achieved the above mentioned ETL.
So when you open Spoon you get a nice Tip dialog box, which Can sometimes give you a good tip, but not always. Usually Its the regular stuff, which you yourself can figure out.
Next screen is the Welcome screen, which has links to all the documentation, Samples and Kettle related blogs. Main screen is divided into two parts, a Navigator and the Workspace. Navigator is provided with two separate tabs, Design and View. Design lists down all the transformations and View displays the transformation you are currently working with.
Different Transformations
Kettle provide loads of Transformations for your day-to-day ETL needs. All the Transformations have been divided into various categories depending upon theirs usage. Although I don’t find it useful as it an be real pain in the neck for some unusual transformations like for Dimension lookup is in Data Warehouse, instead of Lookup. Maybe this dislike is because of my Informatica mindset.
Another issue with Kettle is that there is no standard documentation for transformations, although I shouldn’t expect this from an Open Source Tool, maybe that’s how Pentaho earns their Dollars, by providing Support and Training for these transformations.
List of Kettle Transformations
I created a list of Kettle transformations while doing research for my project, which I think might help others, who are in a similar need.
Input Transformations
S No.
Transformation
Usage
1
Access Input
To get data from Access Database files.
2
CSV File Input
To read delimited data from CSV files.
3
Data Grid
To Insert Constant Data to flow e.g. You want to populate a Dimension table with Static master data
4
De-serialize from file
To read data from a Cube file, usually created by Kettle with the "Serialize to file" Output transformation.
5
ESRI Shapefile Reader
This is a custom plugin developed to read Shape files created via ESRI, which is used to distribute geographical information via shape files
6
Excel Input
Get data from Excel Files. It’s a useful and capable transformation to read complex excel a well. I found it better than Informatica Excel reader transformation.
7
Fixed file input
This is another variant of CSV file where data file has fixed size.
8
Get data from XML
Really useful transformation. This transformation lets you get data form an XML file, depending upon the level or Xpath Node you select. You can read multiple files with Regular Expression support.
9
Get System Info
Another Useful transformation which lets you get system info (Memory), Software versions, Date related manipulations First Day, last day of Week etc.
10
Mondrian Input
To get data from Mondrian Cube
11
OLAP Input
Another plugin to read OLAP data.
12
Property Input
To read key-value pair type of data.
13
RSS Input
Really unique transformation. This transformation lets you get data form RSS Feeds. You can specify the time frame as well, from which you want your Feeds to be refreshed.
14
Salesforce Input
Its similar to what other ETL tools have provided so as to work with Salesforce data. You just need to provide the SF link, credentials and the Module, with which you want to fetch the data.
15
SAP Input
Similar to Salesforce transformation, but it needs extra configuration. You may need to download few extra jars from pentaho SVN.
16
Table input
Simplest and Most useful Input Transformation.
17
Text file input
To get data from fixed data field files. You can add multiple files or just give the regular expression for File name.
There are few other Transformations as well, which are either very application specific, or for which I couldn’t find a practical use from my experience. These are mostly Plug-Ins developed by ETL developers for very specific application.
Few common and distinctive features about Input transformations, which I worth mentioning:
Lazy Conversion is a algorithm which Kettle uses in certain transformations (CSV and Fixed Input) where it doesn’t make sense to un-necessarily change the data types within ETL. Standard scenarios is where the source and the target type are of same type like Text files or same db schema. It also facilitates binary mode for data movement as well.
Regular Expression: I had a similar name for all of my flat files which I used as Input. So instead of adding them one-by-one, I gave regular expression as “req_uid*.*” and the directory where all my files were located.
Add Filename to Result: This can be useful in cases where we need to filer data based upon filename. For example we get similar files from different regions and we need to process them differently. So all we need to do is name the differently and we can add our logic based upon data fields.
In the next post I’ll discuss about Output Transformations.
Couple of months back I needed to do a POC(Proof of Concept) for our BI group, with Open source technology. After some quick analysis we decided to use Pentaho BI solutions. Kettle, Pentaho’s ETL solution was the leading Open source ETL solution, plus we had an experienced resource for Kettle as well. Regarding BI, we had few Java resources who had BI exposure (with BIRT) so we decided to try something new and finalized Pentaho for BI as well.
ETL with Kettle
I have previously worked on Informatica for ETL so I had this mindset so as to compare every other ETL tool with Informatica. But I knew that Kettle is an Open source so I didn’t raise my expectation too high. I downloaded the software from Sourceforge although you can try the Pentaho paid version called Enterprise Edition as well, which comes with an 30 day trial. I’ll discuss more about the features of Enterprise edition later with BI discussion. I couldn’t find any difference in ETL part between Open source and Enterprise edition, except that they call the paid version “Pentaho Data Integration”.
So after all the setup I started exploring Kettle. Unzipped the software and searched for setup.exe or MSI file to start the Installation. But I was surprised to find just few scripts (bat files) at the root. Secondly you don’t have to download different versions for Linux as well. Similar set of scripts(in the sh format) are provided for Linux as well. I really liked that feature. So Plus 1 to the score for Kettle.
After that I searched that which script should I use to start the ETL tool. There were 3 batch files, Spoon, Kitchen and Pan, strange name haan….even Funny as well. Then I searched the Net and found this wiki. Its a really useful link for getting yourself familiarized with Kettle with some basic transformations examples as well. So now I knew that Which tool to use for What:
Kitchen: is the tool used to execute Jobs from a terminal window. Limited functionality as that of Informatica Workflow Manager. Pan: allows you to execute Transformations from a terminal window. Spoon: The graphical tool with which you design various Transformations and test them. Similar to Informatica Mapping Designer
Its like this, Create a Transformation with Spoon, test it save it in the repository. Use Pan to run a single Transformation from command prompt. Kitchen is used to execute Job, which is created to execute a set of Transformations, which are Inter-dependent.
There are few more scripts present alongside the above mentioned like Encr, Carte.
Carte: Carte is a simple web server that allows you to execute transformations and jobs remotely. I really liked this utility. This is similar to what Informatica has Workflow Monitor or Web Admin Console, although not as robust or versatile as Workflow Monitor. It gives you a basic level of User Security as well.
Encr: This is another useful tool for enhanced security. You can encrypt your password as well. Carte allows you to store the passwords in an external file, which I found very strange (from the security perspective), but then I used Encr, which allows you to encrypt the password and then store in the file.
So after this brief about Kettle utilities, I started Spoon, but nothing happened, a command prompt appeared and vanished. Then I opened a command prompt and started Spoon.bat. Then it gave me a proper Error message that I need to set Java Home on my machine. Kettle requires the Sun Java Runtime Environment (JRE) version 1.5 or newer which you can get at Java or from your Local IT department. You need to set the environment variable JAVA_HOME correctly.
So after this I was able to start Spoon. Now you have to decided what kind of method you want to use to save Kettle Metadata i.e. Repository. Kettle allows you two ways for this, one is File based and the other as usual Database. Benefits are obvious like if you’re working in an distributed environment, use database solution otherwise File based. Secondly database solution allows user security as well.
Database Repository:
File based Repository:
In the next post I’ll discuss my experience with various Transformations which I used with Kettle.
There are various pragma directives used in Oracle PL/SQL interface. I read them recently and thought of listing the useful ones at my blog. Out these fours I've used just two in real life cases, EXECEPTION_INIT and AUTONOMOUS_TRANSACTION.
RESTRICT_REFERENCES Pragma
To be callable from SQL statements, a stored function must obey certain purity rules, which control side-effects. The fewer side-effects a function has, the better it can be optimized within a query, particular when the PARALLEL_ENABLE or DETERMINISTIC hints are used. The same rules that apply to the function itself also apply to any functions or procedures that it calls.
If any SQL statement inside the function body violates a rule, you get an error at run time (when the statement is parsed). To check for violations of the rules at compile time, you can use the compiler directive PRAGMA RESTRICT_REFERENCES. This pragma asserts that a function does not read and/or write database tables and/or package variables. Functions that do any of these read or write operations are difficult to optimize, because any call might produce different results or encounter errors.
DEFAULT: Specifies that the pragma applies to all subprograms in the package spec or object type spec. You can still declare the pragma for individual subprograms. Such pragmas override the default pragma.
function_name: A user-defined function or procedure.
PRAGMA: Signifies that the statement is a compiler directive. Pragmas are processed at compile time, not at run time. They do not affect the meaning of a program; they convey information to the compiler.
RNDS: Asserts that the subprogram reads no database state (does not query database tables).
RNPS: Asserts that the subprogram reads no package state (does not reference the values of packaged variables)
TRUST: Asserts that the subprogram can be trusted not to violate one or more rules. This value is needed for functions written in C or Java that are called from PL/SQL, since PL/SQL cannot verify them at run time.
WNDS: Asserts that the subprogram writes no database state (does not modify tables).
WNPS: Asserts that the subprogram writes no package state (does not change the values of packaged variables).
You can declare the pragma RESTRICT_REFERENCES only in a package spec or object type spec. You can specify up to four constraints (RNDS, RNPS, WNDS, and WNPS) in any order. To call a function from parallel queries, you must specify all four constraints. No constraint implies another. Typically, this pragma is specified for functions. If a function calls procedures, then specify the pragma for those procedures as well.
When you specify TRUST, the function body is not checked for violations of the constraints listed in the pragma. The function is trusted not to violate them. Skipping these checks can improve performance.
If you specify DEFAULT instead of a subprogram name, the pragma applies to all subprograms in the package spec or object type spec (including the system-defined constructor for object types). You can still declare the pragma for individual subprograms, overriding the default pragma.
A RESTRICT_REFERENCES pragma can apply to only one subprogram declaration. A pragma that references the name of overloaded subprograms always applies to the most recent subprogram declaration.
AUTONOMOUS_TRANSACTION Pragma
The AUTONOMOUS_TRANSACTION pragma changes the way a subprogram works within a transaction. A subprogram marked with this pragma can do SQL operations and commit or roll back those operations, without committing or rolling back the data in the main transaction.
Syntax
PRAGMA AUTONOMOUS_TRANSACTION;
Usage Notes
You can apply this pragma to:
■ Top-level (not nested) anonymous PL/SQL blocks
■ Local, standalone, and packaged functions and procedures
■ Methods of a SQL object type
■ Database triggers
You cannot apply this pragma to an entire package or an entire an object type. Instead, you can apply the pragma to each packaged subprogram or object method. You can code the pragma anywhere in the declarative section. For readability, code the pragma at the top of the section.
Once started, an autonomous transaction is fully independent. It shares no locks, resources, or commit-dependencies with the main transaction. You can log events, increment retry counters, and so on, even if the main transaction rolls back. Unlike regular triggers, autonomous triggers can contain transaction control statements such as COMMIT and ROLLBACK, and can issue DDL statements (such as CREATE and DROP) through the EXECUTE IMMEDIATE statement.
Changes made by an autonomous transaction become visible to other transactions when the autonomous transaction commits. The changes also become visible to the main transaction when it resumes, but only if its isolation level is set to READ COMMITTED (the default). If you set the isolation level of the main transaction to SERIALIZABLE, changes made by its autonomous transactions are not visible to the main transaction when it resumes.
In the main transaction, rolling back to a savepoint located before the call to the autonomous subprogram does not roll back the autonomous transaction. Remember, autonomous transactions are fully independent of the main transaction. If an autonomous transaction attempts to access a resource held by the main transaction (which cannot resume until the autonomous routine exits), a deadlock can occur. Oracle raises an exception in the autonomous transaction, which is rolled back if the exception goes unhandled.
If you try to exit an active autonomous transaction without committing or rolling back, Oracle raises an exception. If the exception goes unhandled, or if the transaction ends because of some other unhandled exception, the transaction is rolled back.
EXCEPTION_INIT Pragma
The pragma EXCEPTION_INIT associates an exception name with an Oracle error number. You can intercept any ORA- error and write a specific handler for it instead of using the OTHERS handler.
Example
DECLARE
deadlock_detected EXCEPTION;
PRAGMA EXCEPTION_INIT(deadlock_detected, -60);
BEGIN
NULL; -- Some operation that causes an ORA-00060 error
EXCEPTION
WHEN deadlock_detected THEN
NULL; -- handle the error
END;
/
Keyword and Parameter Description
error_number: Any valid Oracle error number. These are the same error numbers (always negative) returned by the function SQLCODE.
exception_name: A user-defined exception declared within the current scope.
Usage Notes
You can use EXCEPTION_INIT in the declarative part of any PL/SQL block, subprogram, or package. The pragma must appear in the same declarative part as its associated exception, somewhere after the exception declaration. Be sure to assign only one exception name to an error number.
SERIALLY_REUSABLE Pragma
The pragma SERIALLY_REUSABLE indicates that the package state is needed only for the duration of one call to the server. An example could be an OCI call to the database or a stored procedure call through a database link. After this call, the storage for the package variables can be reused, reducing the memory overhead for long-running sessions.
This pragma means that every individual call to the code is treated as a new call. That means that no package states or variables are retained.
Syntax
PRAGMA SERIALLY_REUSABLE;
Here's an extreme example of how you might want to use it. Let's say you have a process that takes 2 hours to run. The first thing the process does is call a procedure that requires 500M of memory, but only runs for 30 seconds. Using normal PL/SQL, that memory might stay allocated for 2 hours until the whole process completes. Using this pragma, the memory can be released as soon as the procedure completes, allowing it to be reused while the next 2 hours of processing is completed.
Keyword and Parameter Description
Usage Notes
This pragma is appropriate for packages that declare large temporary work areas that are used once and not needed during subsequent database calls in the same session. You can mark a bodiless package as serially reusable. If a package has a spec and body, you must mark both.
You cannot mark only the body. The global memory for serially reusable packages is pooled in the System Global Area (SGA), not allocated to individual users in the User Global Area (UGA). That way, the package work area can be reused. When the call to the server ends, the memory is returned to the pool. Each time the package is reused, its public variables are initialized to their default values or to NULL.
Serially reusable packages cannot be accessed from database triggers or other PL/SQL subprograms that are called from SQL statements. If you try, Oracle generates an error.
Pragma SERIALLY REUSABLE is responsible for removing a packages temporary area (for example global variables) after each call and allows for (re)compiling (a new version) of that package without causing an exception such as "ORA-04061: existing state of package has been invalidated". A similar behavior can be achieved by omitting global public (and private) variables in the specification and body of a package. The downside of this pragma is that no package function can be used in a SQL statement.
This is the first thing that you have to determine.
Suppose you want to add address field to the following page:
Then you will go to "About this Page" link and find out the VO associated to this region.
In our case we will extend EmloyeeFullVO.
2. Create new workspace and new project.
Now go to JDeveloper and create new workspace and a new project.
Add the server.xml of the package where the VO to be extended is present to your . This will add complete package in your new project.
Add any other dependent packages if required.
3. In the new project create a new Business Component with default package as xxx.oracle.apps..... xxx being your identifier.
In our case we have taken xxx as xtendEmp.
4. Import the Business Component in which the VO to be extended is present. And import any other dependent component if required.
5. Create a new VO in the newly created package extending the VO to be extended.
Create a new VO with the additional address description field along with all the existing fields in the base VO.
This new VO should extend the base VO.
6. Click on .jpx and add the substitution.
In our case click on newEmp.jpx and add the substitution.
While running the above command make sure that oracle.jrad.tools.xml.importer.JPXImporter is present in classpath.
8. Upload the .java and .xml files to the instance.
Either upload the .java and .xml files to the instance directly or create a jar file and add the jar file as first element in the classpath.
In our case we have created a jar file and placed in the classpath.
9. Bounce the Apache server.
10. Create a new item on the page through personalization and attach the new field added it.
Now go back the screen where we wanted to add Address Description field.
Click on personaliza the link. And click on create item icon under the desired region.
Create the new Address item. Provide all properties: ID -- Address
Data Type -- Varchar2
Prompt -- Address Description
CSS Style -- OraDataText
View Attribute -- Description (Be careful its case sensitive. Its should exactly same as you created in new VO)
View Instance -- EmployeeFullVO1 (VO instance name of base VO)
Steps: Workflow Builder Related Changes 1. Create a work flow with a process like below.
You can see three functions. Each denotes a page.
Note: Each function should have an attribute with internal name OAPAGE and value similar to "page=/oracle/apps/ak/employee/webui/EmployeePG". Invoking OANavigation.getNextPage will return this page.
Example:
OAF Related Changes
2. When user opens the Update link,(as shown in above image) we start the workflow and navigate the user to the next page using the following code. --------------------------------------------------------- String nextPage = null; String wfkey = null; String address_Id = pageContext.getParameter("addressId"); String address_Name = pageContext.getParameter("addressName"); HashMap params = new HashMap(); params.put("addressId",address_Id); params.put("addressName",address_Name);
if ("update".equals(pageContext.getParameter(EVENT_PARAM))) // check if update event is fired
{
if(!OANavigation.isWorkflowInProgress(pageContext)) // starting a new workflow only if there is no wf associated with existing pageContext.
{
wfkey = am.invokeMethod("getNewWFId").toString();
OANavigation.createProcess(pageContext,"MANADDUP","UPDATE",wfkey); //creating new workflow process OANavigation.startProcess(pageContext,"MANADDUP","UPDATE",wfkey);
}
nextPage = OANavigation.getNextPage(pageContext,"MANADDUP",wfkey,"UPDATE",false); // this will give the value of next page which we have added in function attributes
pageContext.setForwardURL(pageContext.getApplicationJSP()+"?"+nextPage,
null,
OAWebBeanConstants.KEEP_MENU_CONTEXT,
null,
params,
true, // Retain AM
OAWebBeanConstants.ADD_BREAD_CRUMB_NO,
// Do not display breadcrumbs
OAException.ERROR);
}
The above code goes in the CO of the page which holds the Update button.
3. In the CO of subsequent pages we can get the next page by providing the result of the current function.
--------------------------------------------------------- if ("next".equals(pageContext.getParameter(EVENT_PARAM))) {
nextPage = OANavigation.getNextPage(pageContext,"NEXT"); // we are passing result as NEXT.
pageContext.setForwardURL(pageContext.getApplicationJSP()+"?"+nextPage, null, OAWebBeanConstants.KEEP_MENU_CONTEXT, null, null, true, // Retain AM OAWebBeanConstants.ADD_BREAD_CRUMB_NO, // Do not display breadcrumbs OAException.ERROR); }
---------------------------------------------------------
Similarly if you pass result as PREVIOUS then you will get value of nextPage as the previous page.
4. In the CO of last page use the same code to move the WF ahead but do not use the value of nextpage as it will be null.
--------------------------------------------------------- nextPage = OANavigation.getNextPage(pageContext,"SUBMIT"); // this will end the WF. Value of nextPage will be null.
---------------------------------------------------------
Open the CO of any region in the page that need to be refreshed.
Step 2:
In the CO write the following code to put a javascript on the page.
String onLoadFunction = "\n function onLoadFunction (refTime){ " +
"\n setTimeout(\"location.reload(true);\",refTime);" +
"\n } " ;
pageContext.putJavaScriptFunction("onLoadFunction",onLoadFunction); Step 3:
Now, in the same CO use following code to get the root web bean of the page
OABodyBean rtBodyBean = (OABodyBean) pageContext.getRootWebBean();
Do not forget to import oracle.apps.fnd.framework.webui.beans.OABodyBean;
Step 4:
Now, call setOnload on the bodybean to call the javascript function created in step 1 every time body of the page is loaded.