Tuesday 3 September 2013

Difference between Dom4J and SAX parser in Talend

From couple of days, I have been trying to figure out, whats the difference between the two Generation Modes (Advanced Settings of tFileInputXML

1. Slow & Memory Consuming (Dom4J).
2. Fast with low Memory Consumption (SAX).

I am listing below the differences between the two. Thanks to +raulier laurent, +leo acevedo .

Slow & Memory Consuming (Dom4J) Parser:

1. Dom4J parse loads the entire XML file into memory before parsing.
2. It uses Object based Model for parsing XML.
3. High memory usage - as it loads the file to memory.
4. We can insert or delete nodes.
5. Traverse in any direction.
6. With Dom4J we can use all the XPATH expressions.

Fast with low Memory Consumption (SAX)Parser. 

1. SAX parses the XML file Node by Node.
2. It uses Event based Model for parsing XML.
3. Low memory usage as it does not loads the entire file to memory instead read it node by node.
4. We cant insert or delete a node.
5. Top to bottom traversing.
6. With SAX parser we can only use basic XPATH expressions and can not use expressions like last etc.

So to summarize, if we have huge XML file to be read then we should always use Fast with low Memory Consumption (SAX) generation mode in the advanced settings. However, if we have to use complex XPATH expressions then we can use Dom4J. 

In the next article I have try to demonstrate with example the usage and performance of both the generation modes and How to handle Huge XML files in Talend.

Let me know, if you have anything to add to this. 

This article is written by +Vikram Takkar   and published on www.vikramtakkar.com, please let me know, if you see this article on any other website/blog.