Thursday 20 December 2012

Joining Excel worksheets having different schema using Talend Open Studio.



In the last post here we discussed, How to read Excel spreadsheet with multiple worksheets having same schema. Today I am going to create a job in which we are going to join data from two different worksheets in a single Excel document.

Below is the screenshot of the Input excel file having two worksheets, one for employee and other for department. Employee worksheet contains columns employee_id, employee_name and employee_dept_id. Department worksheet contains columns department_id and department_name.



Objective of this Job should be to create a output which contains department name corresponding to the employee department id.

1. Let's now create two metadata components from different worksheets in the same Excel spreadsheet. Follow the same steps that we have used previously here, but on step 2 of the wizard, ensure that the only one worksheet is checked, rather than All sheets. i.e. When creating a metadata for employee worksheet then select employee sheet in Step two of wizard. similarly while creating a metadata for department worksheet then select department sheet in step two of wizard.

When you are done with creating two metadata configurations for employee and department, then it will appear in the Repository pane under the metadata section as seen in the screenshot below:


2. Now drag both the metadata configurations (excel_department_schema and excel_employee_schema) to the Job designer pane.

3. From Palette pane, drag tMap and tLogRow components to Job designer.

4. Right click employee excel input and select Row > Main and drag it to tMap. Similarly right click department excel input and select Row > Main and drag it to tMap. Similarly connect tMap to tLogRow as shown in screenshot below.


You'll notice that, even though we selected Row > Main in both cases, the resulting connector is slightly different. The first connector we made in step 4 is noted as Main, while the second is called Lookup. The order in which we connect the components to the tMap is important. Join the primary data stream first, and one or more lookup components subsequently.

If you join connections in the wrong order or wish to change which is the Main data flow and which is the Lookup data flow, you can do this by right-clicking on one of the connectors and selecting Set this connection as Lookup or Set this connection as Main.

5. Double click tMap component  to open the Map Editor. Currently, there are not fields in the output link of tMap component as shown in the screenshot below:

6. Drag fields employee_id and employee_name from row1 and department_name from row2 links to output link.

Now we need to do is join the two input rows. department id is the common field and the key that joins the two sets of data. To join row1 and row2, simply click on employee_dept_id in row1 and drag it to the expression box of department_id in row2.

The component draws a line between the two department id fields to show the connection.

7. Now Run the Job and check the output. In the output stream, We can see that Department name has been populated corresponding to dept id from the Lookup file.



You guys can also test this job by changing the Join model to Inner to see what happens to records where the match in both the files is not found.

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. I was trying to Join XML file and a CSV file tJoin but the job was failing with an error failed to convert String to Integer but I neither used string for join condition nor as an output.

    Are you aware of the error?
    Thx

    ReplyDelete