Tuesday 3 September 2013

Handling Huge XML files in Talend

In the last post here, I have mentioned the difference between two generation modes provided by +Talend for parsing XML files in their components like tFileInputXML, tAdvancedFileOutputXML. Today I am going to demonstrate the performance difference between these two with two simple examples, one for writing Huge XML file and second one for reading and retrieving data from these huge XML files.

Lets get started with first example of developing a simple structured, Huge XML file.

My objective is to create a XML file with following structure:

<EMPLOYEES>
   <EMPLOYEE>
      <EMP_ID>1</EMP_ID>
      <EMP_FIRST_NAME>Thomas</EMP_FIRST_NAME>
      <EMP_LAST_NAME>Eisenhower</EMP_LAST_NAME>
      <EMP_DEPT_ID>7</EMP_DEPT_ID>
      <EMP_SALARY>73199</EMP_SALARY>
   </EMPLOYEE>
   <EMPLOYEE>
      <EMP_ID>2</EMP_ID>
      <EMP_FIRST_NAME>Harry</EMP_FIRST_NAME>
      <EMP_LAST_NAME>Cleveland</EMP_LAST_NAME>
      <EMP_DEPT_ID>5</EMP_DEPT_ID>
      <EMP_SALARY>51486</EMP_SALARY>
   </EMPLOYEE>
</EMPLOYEES>  

Above is just a sample XML with only 2 EMPLOYEE Nodes. I am going to generate 10 million EMPLOYEE Nodes. Here is the design for the Talend Job by using tRowGenerator to generate 10 million records and then converting them to XML. Those who are new and are not ware of functionality of tRowGenerator component can click here to understand with example. Also, if you want to understand basics of generating metadata of XML file then click here to know more.



Screenshot of Map editor of tAdvancedFileOutputXML component.




















Now lets get to point and first set the Generation mode to Slow and memory-consuming (Dom4J) before running the above job to generate huge XML having 10 millions EMPLOYEE NODES.


You can see from the below screenshot of the Job Execution, XML is getting generated at a rate of around 300 records per second. Already 577 seconds has passed and only 155k EMPLOYEE nodes have been generated. I have to kill my job as it will be going to take long time to generate 10 million EMPLOYEE Nodes for output XML.



Now lets change the Generation mode to "Fast with low memory consumption (SAX)"


You can see from the below screenshot of the Job Execution, output XML with 10 million EMPLOYEE nodes has been generated in 194 seconds only which is way fast then using the DOm4J parser. 



Hence it is always preferable to use Fast with low memory consumption (SAX) generation mode, if you want to write a huge XML file. As it processes node by node and did not load the entire XML into memory.

Now lets try the performance considerations while reading and retrieving the information from Huge XML.

Lets create a metadata of the file generated above. Click here to know more about generating metadata of a XML file.

Snapshot of the Metadata Generation Step 4-

I am going to use tFileInputXML to read the XML file generated by the above Talend Job. Lets look at the Job design below to read and retrieve information from Huge XML file generated above.

Lets first  prase with generation mode as Slow and memory-consuming (Dom4J) in the Advanced settings of tFileInputXML component. Now run the job to check the performance.

After waiting for couple of minutes (roughly 15 mins) job failed with java.OutOfMemoryError: java heap Space Issue.

As we discussed earlier generation mode Dom4J loads the entire source XML into memory before parsing. Since the file contains10 million EMPLOYEE Nodes (around 2GB) the Job failed with heap Space Issue.




Now, Lets change the Generation mode to "Fast with low memory consumption (SAX)" in the Advanced settings of tFileInputXML component. Now run the job to check the performance.










From the above screenshot it is clear that Now the tFileInputXML is able to successfully parse the XML having 10 million records (roughly 2GB of data) in 274 seconds which is a way better than using Dom4J parser. Infact with Dom4J parser job failed to read the file.

Hence it is always preferrable to read huge XML files with SAX parse or  "Fast with low memory consumption (SAX)" generation mode.Click here for more differences in Dom4J and SAX parsers. 

Let me know, if you guys have used some other cool method of writing and retrieving data from Huge XML files.

This article is written by +Vikram Takkar   and published on www.vikramtakkar.com, please let me know, if you see this article on any other website/blog.

1 comment: