top of page
Writer's pictureSandeep Raut

Hadoop simplified..10+ simple descriptions


Hadoop - Big data platform

Today we live in the age of Big data which helps tremendously in Digital Transformation

Data volumes have outgrown the storage & processing capabilities of a single machine and the different types of data formats required to be analyzed have increased tremendously.

This brings 2 fundamental challenges:

  • How to store and work with huge volumes & variety of data

  • How to analyze these vast data points & use them for competitive advantage.

Hadoop fills this gap by overcoming both challenges. Hadoop is based on research papers from Google & it was created by Doug Cutting, who named the framework after his son’s yellow stuffed toy elephant.

So What is Hadoop? It is a framework made up of:

  • HDFS – Hadoop distributed file system

  • Distributed computation tier using programming of MapReduce

  • Sits on the low-cost commodity servers connected together called Cluster

  • Consists of a Master Node or NameNode to control the processing

  • Data Nodes to store & process the data

  • JobTracker & TaskTracker to manage & monitor the jobs

Let us see why Hadoop has become so much popular now.

  • Over the last decade, all the data computations were done by increasing the computing power of a single machine by adding the no of processors & increasing the RAM but they had physical limitations.

  • As the data started growing beyond these capabilities, an alternative was required to handle these storage requirements for eBay (10 PB), Facebook (30 PB), Yahoo (170 PB), JPMC (150 PB), and increasing

  • With a typical 75 MB/Sec disk data transfer rate, it was impossible to process such humongous data

  • Scalability was limited by physical size & no or limited fault tolerance

  • Additionally, various formats of data are being added to the organizations for analysis, which is not possible with traditional databases

How does Hadoop address these challenges?

  • Data is split into small blocks of 64 or 128MB and stored on a minimum of 3 machines at a time to ensure data availability & reliability

  • Many machines connected in cluster work parallel for the faster crunching of data

  • If any machine fails, the work is assigned to another automatically

  • MapReduce breaks complex tasks into smaller chunks to be executed in parallel

The benefits of using Hadoop as a Big data platform are:

  • Cheap storage – commodity servers to decrease the cost per terabyte

  • Virtually unlimited scalability – new nodes can be added without any changes to existing data giving the ability to process any amount of data, so no archival is necessary

  • The speed of processing – tremendous parallel processing to reduce processing time

  • Flexibility – schema-less, can store any data format – structured & unstructured ( audio, video, texts, CSV, pdf, images, logs, clickstream data, social media)

  • Fault-tolerant – any node failure is covered by another node automatically

Later multiple products & components were added to Hadoop so it is now called an eco-system.

  • Hive – SQL-like interface

  • Pig – data management language like commercial tools AbInitio, Informatica

  • HBase – column-oriented database on top of HDFS

  • Flume – real-time data streaming such as credit card transactions, videos

  • Sqoop – SQL interface to RDBMS and HDFS

  • Zookeeper – a DBA management for Hadoop

And multiple such products are getting added all the time from various companies like Cloudera, Hortonworks, Yahoo, etc.

How some of the world leaders are using Hadoop:

  • Chevron collects large amounts of seismic data to find where they can get more oil resources

  • JPMC uses it for storing more than 150 PB of data, over 3.5 Billion user log-ins for Credit scoring and fraud detection

  • eBay using it for real-time analysis and search of 9 PB data with 97 million active buyers, over 200 million items for Cross-Sell

  • Nokia uses it to store data from phones, service logs to analyze how people interact with apps and usage patterns to address customer churn

  • Walmart uses it to analyze customer behavior of over 200 million customer visits in a week

  • UC Irvine Health hospitals have stored 9 million patient records over 22 years to build patient surveillance algorithms

  • Manufacturers are using it for warranty analytics

Hadoop may not replace the existing data warehouses but it is becoming the 1 choice for Big data platforms with the price/performance


bottom of page