Monday 4 February 2013

Using tJoin component to Output Main and Reject data


In the previous post here, I have shown you guys how to perform lookup operation in Talend using tJoin component. In that post we had only select main output data from the join output. But in today's post I will show you, How to fetch main output data as well as reject data where the match for the mentioned keys are not found.

I am again going to use the similar example using the tJoin component. tJoin joins two tables by doing an exact match on several columns. It compares columns from the main flow with reference columns from the lookup flow and outputs the main flow data and or the rejected data. This component helps you ensure the data quality of any source data against a reference data source. 

To demonstrate tJoin, I am going to create a Talend Job to perform an exact match on department ID column between the two files and output the main (matched) and Reject data.

Lets look at the Input Files:

Main Employee File - This is our main Input.


Reference (Lookup) File - We will use this file for reference against the main file and fetch department name by using deptID field.


Let’s create a new job and following steps mentioned below;

1.    First create the metadata of the main input delimited employee file by following the steps  mentioned in post “Generating Metadata for CVS (Delimited) using Talend Open Studio”. Metadata will be available under the Metadata section in the Repository pane. Metadata should look like following:


2.    Drag the metadata created in the Step 1 to the Job designer and select component tFileInputDelimited component.

3.    Drag component tFileInputDelimited to the job designer from Palette pane. Double click the component the edit the properties.

4.    Enter the path of the Lookup delimited file in the File name/Stream text box. Also provide other details like Row Separator ("\n") and Field Separators (";") as per specifications of the Lookup files.

5.    Click on the Edit Schema button (highlighted in blue in above screenshot) and enter schema details of the lookup file as below:

 
6. Now, its time to add tJoin and two tLogRow components from the Palette pane to the job designer. tJoin component will help to join (perform lookup) two data stream defined above and First tLogRow will be used to display the matched data and other will be used to display the reject data from the main stream where the match from lookup file is not found. 

7. Right-click on the main Input delimited file component, select Row > Main, and drop the connector onto the tJoin component. 

8.Right-click on the Lookup file component, select Row > Main, and again drop this onto the tJoin component. You'll notice that, even though we selected Row > Main in both cases, the resulting connector is slightly different. The first connector we made in step 7 is noted as Main, while the second is called Lookup. The order in which we connect the components to the tJoin is important. Join the primary data stream first, and one or more lookup components subsequently. 

If you join connections in the wrong order or wish to change which is the Main data flow and which is the Lookup data flow, you can do this by right-clicking on one of the connectors and selecting Set this connection as Lookup or Set this connection as Main.

9. Open the component properties of tJoin component and Click on Edit schema button.
 

Notice that there is no column in the schema of tJoin_1 section. We will add empID and empName column from tFileInputDelimited_1 section from left window.


Similarly, add deptName column from tFileInputDelimited_1 section from left window to tJoin_1 window.


10. Now we are done with schema of the tJoin component. In the component properties apply following settings:
i.              Click Checkbox “Include lookup columns in the output”. This will allow us to fetch the value of column from the reference (Lookup) file as well. (Of-course for the matching records)
ii.             In the “Column mapping” section, Click on the (+) button to add a name of the column and its mapping which will be fetched from lookup file. In our example, we are fetching deptName from the lookup file. Hence provide the mapping for it.
iii.            In the “Key Definition” section, Click on the (+) button to add a name of the column and its mapping which will be used to join both the file (main and the reference/lookup files).
iv.           Check the check-box “Inner Join (with output rejects)”, Since our aim is to catch the rejects where the match is not found in different stream.

  
11. Now, Right-click on the tJoin component, select Row > Main, and drop the connector onto the tLogRow_1 component. This tLogRow will be used to display the matched data in the output stream.
 
12. Similarly, Right-click on the tJoin component, select Row > Inner Join Rejects, and drop the connector onto the tLogRow_2 component. This will be used to display the reject data from the main stream where the match from lookup file is not found.

13. Now, Our job is complete. It’s time to run the Job and check the output.





In the output stream, we can see that Department name has been populated corresponding to dept id from the Lookup file. Also notice that in the output of tLogRow_1 only those records have been fetched where the match has been found in the lookup file. Similarly in the second tLogRow_2 output all those records has been displayed where match from the reference/lookup file has not been found.



You may also like to read..

4 comments:

  1. Thanks Vikram!

    One thing you missed out, we need to select the 'Mode' for the tLogRow components as 'Table (print values in cells of tables)' so as to see the output in a tabular form.

    ReplyDelete
  2. I am getting nulls for joining column if we add joining column in the output, please help me on the same

    ReplyDelete
  3. I have 2 source files 1 Have CUSTOMER_NAME,CITY_ID,CITY,CUSTOMER_ID :COLUMNS and another one have
    CUSTOMER_ID,STATE....I used as tjoin based on CUSTOMER_ID and I want to CUSTOMER_ID also in the output columns but getting nulls for CUSTOMER_ID columns :.-------------+-------+----+-----+-----------.
    | tLogRow_1 |
    |=------------+-------+----+-----+----------=|
    |CUSTMORE_NAME|CITY_ID|CITY|STATE|CUSTOMER_ID|
    |=------------+-------+----+-----+----------=|
    |CHANNA |13 |GNT |TN |null |
    |KOTIREDDY |13 |GNT |KA |null |
    |HARESH |15 |SAP |DE |null |
    |HASINI |15 |SAP |GN |null |
    '-------------+-------+----+-----+-----------'

    ReplyDelete
    Replies
    1. Please check the following two points mentioned above ---
      i. Click Checkbox “Include lookup columns in the output”. This will allow us to fetch the value of column from the reference (Lookup) file as well. (Of-course for the matching records)
      ii. In the “Column mapping” section, Click on the (+) button to add a name of the column and its mapping which will be fetched from lookup file. In our example, we are fetching deptName from the lookup file. Hence provide the mapping for it.

      Delete