Wednesday 19 December 2012

Reading Excel Files using Talend Open Studio

Spreadsheets are very important business tool in today’s world. Excel sheets are tool of choice for storing and manipulating business data. In this post, we will look at a way to read data from excel sheets having multiple worksheets. Talend has built in component which allows us to read multiple worksheets with in single file.

Let’s look at the Input excel file having products information in two worksheets “computer” and “dresses”. In this example file each sheet has same schema/columns. Each sheet has a product_id field and a product_name field. Our task is to extract data from each sheet and read it in single flow.

1. First step is to create a metadata for the input excel sheet. Right-click on File Excel in the Metadata section of the Repository window and select Create file Excel.

2. In step 1 of the wizard, enter the Name field to provide name to the metadata. You can also enter additional information in the Purpose and Description boxes. There are other configuration options for Version, Status, and Path, but these are not mandatory, so we will leave them for now. Click on Next.

3. In step 2, Locate the file for which metadata needs to be created. The wizard will show the available sheets in the Excel file and a preview of the data for each sheet.

Check All the required Sheets in the Set sheets parameters pane to configure all the sheets to be read by the job. In the lower right-hand pane, we need to select a sheet that will serve as a schema guide for the wizard. Click on Next.

4. In step 3, check the check box “Set heading row as column names”. Note that the heading row from the second sheet is showing the preview as data, as shown in the next screenshot. Don't worry about this for now; we'll return to it when we configure the job.

5. In step 4, enter the name to provide schema name and update the data type, null ability and length as per requirements and click on Finish.

6. Now we have finished creating metadata for excel sheet. Drag the metadata from the Repository pane under the Metadata section (as shown in the screenshot below) to Job designer pane.

Click on the Component tab below the Job Designer and check the check box named Affect each sheet (header & footer). This forces the header configuration to apply to each sheet and resolves the issue we noted earlier, that the header data was appearing in the data output.

7.  Drag the component tLogRow to Job designer pane and Right-click on the Excel component, select Row > Main, and connect it to the tLogRow component.

8. Run the job and check the output in the output stream. Rows from both the worksheets should be displayed in the output.

This article is written by Vikram Takkar  and published on, please let me know, if you see this article on any other website/blog.

1 comment: