Monday 19 December 2016

3. Hadoop Ecosystem | Hadoop Tutorial

Hadoop Ecosystem | Hadoop Tutorial

Hadoop is a framework which comprised of set of tools and technologies. They combine together to make a Eco System. Different tools can be used at different parts of projects based on its implementations and features. Hadoop ecosystem includes both official Apache open source projects and a couple of commercial tools and solutions. Some of the open source examples include Hive, Pig, Oozie and Sqoop etc. In this article we will cover open source tools which are part of Eco system. Lets start with HDFS.

Hadoop Ecosystem | Hadoop Tutorial

Hadoop Distributed File System (HDFS)
HDFS is a storage component of Hadoop and a technology to store the data in distributed manner in order to process faster.  HDFS splits the incoming file into small chunks and are stored on blocks of size 128 MB. HDFS runs on top of local file system, it is a logically created over physical storage present on the various machines in the Hadoop cluster. 

HDFS is designed to tolerate high component failure rate by replicating the files over multiple machines. Further, it is designed only to handle large files as the larger the file the less it has to seeks the data into various machines. It deals with Streaming or sequential data rather than random access, sequential data means few seeks( Hadoop only seeks to the beginning block and read sequentially). 

Any data which needs to be processed by Hadoop should be present in HDFS. Hence we need to ingest data into HDFS from local/remote machines before processing it.

Map Reduce
Map reduce engine is the job execution framework of hadoop. It is based on the papers published by Google. It consist of a MAP functions which processes and organizes the incoming data into a key-value pair. The Mapped data is provided to  a Reduce function which further processes and aggregates the incoming data according to the task requirement and provides the final processed output.

This works the same way as - Suppose you have a book containing 10 chapters and you have to write the summary of the book. So, for this task you appointed 10 people(called Mappers) to read each chapter. When the Mappers are done reading each chapter a new person is appointed(called Reducer) to hear the content of every chapter and write the final summary of the book. This way the task is accomplished in a efficient manner.  Hadoop provides flexibility in writing the Map Reduce code in numerous languages like Java, Python, Scala, Ruby,etc.

Hive is a tool in the Hadoop environment which was contributed by Facebook. It is a data warehouse infrastructure built on top of Hadoop. It was built in 2007, the basic idea behind hive was to implement the prevalent database concepts to the unstructured world of hadoop.  Hive provides a boon to the developers having a hard time the Object Oriented Programming languages by implementing simple sql-like queries(known as HQL - Hive Query Language) to run the Map reduce job. Hive stores data in the form of database and tables, the data is queried and ultimately a Map reduce job is run in the background to fetch the data.

Pig is a tool developed at Yahoo, which was later contributed to the Apache foundation. It is used for creating data-flows for Extract, Transform and Load(ETL), processing, aggregating, and analyzing large data sets. Pig programs accomplish huge tasks, but they are easy to write and maintain. It uses its its own high level language, known as Pig Latin, to process the data. Pig Latin scripts are further internally translated into Map Reduce jobs to produce output.

Hadoop core functionality is to perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire data set even for the simplest of jobs. 

HBase is a distributed column-oriented database built on top of the Hadoop file system. Hbase is modeled over Google’s Big Table designed to provide quick random access to huge amounts of structured data. It is an open source, distributed, versioned, column-oriented, No-SQL / Non-relational database management system that runs on the top of Hadoop. It allows the Hadoop ecosystem to provides random real-time read/write access to data in the Hadoop File System. We can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using HBase. 

HBase sits on top of the Hadoop File System and provides read and write access.  Data storage unit  in HBase is column i.e. columns are stored sequentially in contrast to RDBMS where data storage unit is row and rows are stored sequentially.   It is well suited for sparse data sets, which are common in many big data use cases. Hbase extends the capability of hadoop by providing  transnational capability to hadoop, allowing users to update data records and allowing more than one value for a record(versions). Hadoop is designed for batch processing, but HBase also allows real-time processing. 

Sqoop (Sql +Hadoop = Sqoop)
Sqoop is a tool used for providing interactions between Hadoop and RDBMS such as MySQL, Oracle, TeraData, IBM-DB2, MSSQL. Sqoop internally uses MapReduce jobs to import/export data to/from the HDFS to RDBMS. Sqoop can import from entire table or allows the user to specify predicates to restrict data selection. It can also directly ingest data into a hive or Hbase table. 

Zookeeper is a centralized open-source server for maintaining and managing configuration information, naming conventions and synchronization for distributed cluster environment. As Hadoop follows a distributed model case of partial failure may occur between different machines, in such cases zookeeper allows highly reliable coordination. Zookeeper manages configuration across blocks. If you have dozens or hundreds of blocks, it becomes hard to keep configuration in sync across nodes and quickly make changes. 


Oozie is a workflow management system for Apache Hadoop, it maintains order in which the sequence of task need to executed. For instance we have 3 jobs and the output of one is the input of another, in such a case the sequence would be fed to Oozie, which uses a Java web application to maintain the flow.

Previous Articles:
1. Introduction to Hadoop
2. Advantages of Hadoop

Monday 12 December 2016

2. Advantages of Hadoop | Hadoop Tutorial

Advantages of Hadoop:

Cost : Hadoop being open-sourced and using commodity hardware provides a cost effective model, unlike traditional RDBMS, which requires expensive hardware and high-end processors. The problem with traditional relational database management systems is that it is extremely costly if you want to scale in order to process massive volumes of data. In order to reduce costs, many companies try to down-sample data and classify data which may not give them correct picture of their business. The raw data would be archived or deleted, as it would be too costly to save it in database.

Scalability : Hadoop provides a highly scalable model in which the data can be processed by spreading the data over multiple inexpensive machines, which can be increased or decreased as per the enterprise requirement. Unlike traditional relational database systems (RDBMS) that can’t scale to process large amounts of data, Hadoop enables businesses to run applications on thousands of nodes involving many thousands of terabytes of data. 

Flexibility : Hadoop is  suitable for processing all types of data sets - structured, semi-structured and unstructured. This means all sort of data whether text, images, videos, clicks,etc. Can be processed by hadoop making it highly flexible.Hadoop enables businesses to easily access new  and different types of data sources which can be both structured and unstructured. This means businesses can use Hadoop to derive valuable business insights from data sources such as social media, email conversations.  Hadoop can be used for a wide variety of purposes, such as log processing, recommendation systems, data warehousing, market campaign analysis and fraud detection. 

Speed : Hadoop follows a distributed file system model in which the file is broken into pieces and is processed parallel which provides a better performance and a relatively faster speed (as numerous tasks are running side by side) when compared to traditional DBMS. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours. 

Fault-tolerance :  Hadoop uses commodity hardware to store files, what enables this to compete high-end machines is the fault tolerance provided by Hadoop, in which the data is replicated on various machines and is read from one machine. If this machine goes down, the data can be read from the other machine, where the data is replicated.

Next Article: 3. Hadoop Ecosystem
Previous Article: 1. Introduction to Hadoop

Wednesday 7 December 2016

1. Introduction to Hadoop | Hadoop Tutorial

We are living in the information age, thus with the advent of the Digital revolution, we have a large number corporations to serve our needs, to make our lives easier and to take us to the next level of humanity. To achieve this task these corporations need a base on which they could achieve this task. This “base” is the data which they have generated over the course of time.

Data is ever generating. Each second piles of data are being generated. Let’s have a look at an example : Facebook has 1.79 billion active monthly users. These users are uploading images, videos, posting texts, comments, likes, etc. which is being done every second which accounts for all types of data - structured, semi-structured and unstructured. Now let’s say, if each user hits at least one like a day, it would account to 1.79 billion likes to handle in a day. But we all know that is surely not the case, the real figures are much more humongous than that. Likewise, did you ever pondered that the thousands of Apps on the Google play store having having millions of download with each user generating data, What happens to that data ?

This data is processed and analyzed by respective companies to add competitive advantage to their Corporation and come up with future solution for them to serve the customers better, learning from the scenarios of the present. For example, people comment their reviews about specific products on Amazon, these reviews are then processed by Amazon to understand the plight of the customer and provide better service in the future.

According to the prevailing Database management system, handling this amount of data generating at this pace is a challenge as it required high memory(RAM), scaling beyond a capacity often involved downtime and came with an upper limit. Also, it would not have been cost friendly. Further, RDBMS was unable to categorize unstructured data.

So the question arises, how to process these data sets ? The answer to this is HADOOP.

Hadoop is an open source framework that allows distributed processing of large data sets across clusters of commodity hardware.

Open Source : In contrast, to the traditional RDBMS systems which required purchasing a license, Hadoop is readily available without any cost, maintained by Apache foundation.

Distributed processing : Hadoop framework splits the data into chunks and processes them in parallel. This makes it time efficient in handling big data sets. e.g. If you have to write 10,000 pages. What would you prefer, hiring the world’s fastest writer or hiring 100 writers a day ? Definitely, the latter is much faster and cheaper.

Large Data Sets: Used to process big data sets with ease. All the processing happens on the data present in HDFS (Hadoop File System). Since the data is divided into different machines and processed in parallel, it allows to process massive data.

Commodity Hardware : Hadoop uses cheap and simple hardware rather than high enterprise computers that cost too much. “China mobile”, a Telecom Company based in China, which was generating 5-8 TB data of records daily. Hadoop enabled them to use 10 times data than their older system at ⅕ cost.

Next Article : 2. Advantages of Hadoop