Friday 14 December 2012

Transform XML file to CSV using Talend Open Studio


I have worked with several enterprise ETL (Extract Transform and Load) tools like Abinitio and Datastage. Recently I have started learning open source tool named Talend. Talend is leading open source tool which is mainly used in following areas:

1. Data Integration.
2. Data Quality
3. Master Data Management
4. Business Process Management

You can find more information on  www.talend.com.

Today I have created a job to read XML file having employee data and transform it to CSV.













Lets start with creating metadata for Input XML file. We have Input file having employees data. Root tag of the XML file is Employees and for every record it contains repeating Employee child tags. 

Input XML File:

Follow the steps mentioned below to create Metadata for the above input XML file:
  1. Right click the "File XML" under the metadata tab in the Repository pane. Click "Create File XML" option.
  2. Enter the name as xmltocsv_metadata in the dialog box and click Next.
  3. In the Step2 select InputXML radio box and click Next.
  4. In Step 3, browse the sample input XML file (shown above) and select the Encoding as UTF-8.Now you can see the relation of the XML tags in the schema viewer.
  5. Drag Employee tag to XPath loop expression and  Drag all the required tags to the "Feilds to Extract" section. Click on Refresh Preview to see the data in Tabular format. (refer screenshot below)
  6. Click Next and Modify the Column defination as per requirements and click Finish.
Now we have successfully created metadata for the Employees XML input file. Drag the xmltocsv_metadata we have created to Job designer. Talend will provide the list of components we can use with the required metadata. Select tFileInputXML component. 
Drag tFileOutputDelimited component to the Job designer. Right Click tFileInputXML component and select Row--> Main and drag the link to tFileOutputDelimited component.

Now we have completed out Job. Click on Run and Check the output. For your reference, I have provided the output as below:
  

Output CSV File:
101,Vikram,Arora,M,10,45000
102,Poonam,Arora,F,11,55000
103,Puneet,Arora,M,10,35000
104,Mayank,Arora,M,12,15000
105,Tirath,Arora,M,12,60000
104,Sunita,Arora,F,10,65000


We can make this Job little more complex by adding some transformation mapping rules. As an excercise, You guys can make the Talend Job based on following mapping transformations. You will have to use tMap component to apply these transformations. Please let me know, if you face any issues in developing the Job.


Mapping Requirements



Sr. No.
Input Field
Transformation
CSV Output field


1
id
direct mapping
id


2
first_name
first_name + " " + last_name.
name


3
last_name


4
sex
If input is "M" then "Male" else if input value is "F" then "Female"
sex


5
department
must be left paded with 0 and should be displayed as an attribute for output XML
department


6
salary
direct mapping
salary






You may also like to read..

3 comments:

  1. this is a nice example .. but for xml files with multi occuring elements will it convert to tabular format correctly? i tried with another example.. probably.. we need to split multi occuring elements as separate schemas and then need to join.. please share your thoughts on that.. also.. in tMap components i can only see Inner join and Left outer join. What about full outer join? can we achieve it?

    ReplyDelete
  2. Yes, for multi occuring elements we can display in tabular format correctly. Pleave visitn my blog post

    http://vikramtakkar.blogspot.in/2013/02/read-xml-having-multiple-nested-loops.html.

    However, if the tags are repeating at the same hierarchy then yes, it needs to be handled as separate schema and needs to be joined if required as part of the Job design. I am also going to write an article on that as well.

    Regarding your query on tMap-
    Yes, in tMap there is option of left outer Join and Inner Join only.

    However, full outer join is mostly done when both the inputs have same format. we can achieve the same using tUnite component. Purpose of the tMap/tJoin is basically to perform the transformations and lookup.

    Let me know, if you have any other queries.

    ReplyDelete
  3. Hi vikram ji,
    I have tried to load XML data into my Oracle table.data got loaded but the issue is data's are loaded like below:
    Columns in table are : first-name,lastname,location
    And i tried to load XML datas thoZe are [person_first_name]Ajith [/person_first_name]
    Output :
    In my table Data stored as
    [person_first_name]Ajith [/person_first_name]

    But I need only the name: Ajith


    Please help me out .

    ReplyDelete