Hadoop interview questions and answers

Hadoop Interview Questions

These Hadoop questions have been designed for various interviews, competitive exams and entrance tests. We have covered questions on both basic and advanced concepts which will help you improve your skills to face interview questions on Hadoop.

Who are these Hadoop interview questions designed for?

All the Hadoop Developers, Hadoop Testers, Hadoop Analysts, Hadoop Admins, Hadoop Experts etc. will find these questions extremely useful. All freshers, BCA, BE, BTech, MCA and college students wanting to make a career in Hadoop will be highly benefited by these questions.

Hadoop interview questions topics

This section covers Hadoop topics like - Big Data, features of Hadoop, core components of Hadoop, Hadoop cluster, RDBMS vs. Hadoop, distributed cache in Hadoop, Speculative Execution in Hadoop, heartbeat in HDFS, RecordReader in Hadoop, JobTracker etc.

1. What is Big Data? Why BigData?

Big Data refers to huge volume of structured, semi-structured and unstructured data that is still growing exponentially with time. It can't be stored and processed using traditional approach effectively within a given time frame.

Examples of Big Data

  • Social media sites generate around 500 terabytes of new trade data every day.
  • The New York Stock Exchange generates about one terabyte of new trade data per day.
Types of Big Data

Structured - It refers to highly organized data that can be processed, stored, and retrieved in a fixed format. Dates, Phone Numbers, Customer Names, addresses, transaction information in a database are the examples of Structured Data.

Unstructured - It refers to the data with unknown form. It can't easily be organized or classified. Social media data such as video, audio, or social posts are examples of unstructured data.

Semi-structured - It refers to the data that although doesn't reside in a database but has some organizational properties that make it easier to segregate. Examples of semi -structured data are XML data, tab delimited files etc.

Video : Hadoop Interview Questions and Answers - Big Data Interview

2. What are the challenges faced by Big Data?

Hadoop came into existence to deal with Big Data challenges.

Some of the challenges with Big Data are -

  • Storage of unstructured data such as documents, photos, audio, videos etc.
  • Combining data from disparate data sources and reconciling it to generate meaningful reports.
  • Computing and processing of large amount of data at a high speed.
  • Securing the vast amount of data.
  • One of the most serious challenges of big data is its dramatic ability to grow.

3. What is Apache Hadoop? What are the features of Hadoop?

Hadoop is an open source framework managed by the Apache Software Foundation. Hadoop emerged as a solution to the "Big Data" problems. It is designed to store, process and analyze huge volume of data efficiently. Hadoop has the ability to analyze the data present in different machines at different locations quickly and in a cost-effective way.

Features of Hadoop

  • Hadoop is Open Source.We can customize project code according to business requirements.
  • Hadoop system is cost effective and requires low investment. It runs on a cluster of commodity hardware which is not very expensive
  • Hadoop supports huge storage system and offers enormous computing power through multiple nodes in the cluster.
  • Hadoop system supports parallel processing of the data across all nodes in the cluster giving it the speed with data processing.
  • Hadoop automatically resolves the problems, in case a node in the cluster fails, as it maintains replicas of each data block on different nodes.
  • Hadoop is extremely scalable platform. As the volume of data grows, new nodes can be easily added in the system.

4. What are the core components of Hadoop?

The Hadoop framework comprises of 3 main components

i.) HDFS(Hadoop Distributed File System) - Storage layer

  • It is the primary storage system of Hadoop.
  • It takes care of storing and managing data within the Hadoop cluster.
  • It maintains multiple copies of data blocks to be distributed across different clusters for reliable and quick data access.
ii.) MapReduce - Batch processing engine

  • It is the data processing layer of Hadoop.
  • It processes large structured and unstructured data stored in HDFS.
  • It brings with it extreme parallel processing capabilities.
  • It works in two stages - Map stage and Reduce stage
  • In Mapping Stage, a block of data is read and processed to produce key-value pairs as intermediate output.
  • The Reducer's job is to process the data that comes from the mapper to generate the output.
  • The output of the reducer is the final output, which is deployed in the HDFS.
iii.) YARN - Resource Management Layer

YARN helps in job scheduling of various applications and resource management in the cluster.

5. What is Hadoop cluster?

Hadoop cluster is a set of connected low cost commodity computers that is designed for storing and analyzing huge amount of data in a distributed computing environment.

A Hadoop cluster is resilient to failures. It maintains multiple copies of data on several nodes. If one node fails, then the data present on the other node in the cluster can be used for analysis.

Hadoop clusters have two types of machines/Nodes - Master Node and Slave node

i.) Master Node

Master node maintains knowledge about the distributed file system and schedules allocation of resources.

Master Node has 3 nodes -NameNode, Secondary NameNode and JobTracker.

  • NameNode keeps a track of all the information on files (i.e. holds the metadata for HDFS)
  • The secondary NameNode keeps a backup of the NameNode data.
  • JobTracker monitors the parallel processing of data. It receives the jobs from the applications, determines the location of the data from NamedNode, locates available TaskTracker nodes and submits the work to the chosen TaskTracker nodes. When the work is completed, the JobTracker updates its status to the client applications.
ii.) Slave Node

Slave Nodes perform the job of storing the data and running computations. Each worker node runs both a DataNode and TaskTracker service.

  • The DataNode manages the physical data stored on the node
  • Task trackers communicate with Job trackers to send status of the jobs. It notifies the JobTracker when a task fails.

6. Compare RDBMS and Hadoop.

Both Hadoop and RDBMS store and process data.

There are some differences between Hadoop and RDBMS which are as follows:

i.) RDBMS is a licensed software, you have to pay in order to buy the complete software license. Hadoop is a free and open source software framework.

ii.) RDMS is mostly used for OLTP processing whereas Hadoop is used for analytical and BIG DATA processing.

iii.) RDBMS is used only to manage structured and semi-structured data. Hadoop has the ability to process and store all variety of data whether it is structured, semi-structured or unstructured.

iv.) RDBMS is used for average size data. Hadoop is used for large data set.

v.) RDBMS provides vertical scalability which means when data increases we need to change system configuration. While Hadoop provides horizontal scalability which means we just have to add one or more nodes to the cluster.

vi.) RDBMS use high-end servers. Hadoop uses commodity hardware.

vii.) RDBMS fails in achieving a higher throughput if the data volume is high. Hadoop has higher throughput which means you can quickly access batches of large data sets.

7. What is distributed cache in Hadoop and what are its benefits?

Distributed Cache is a facility provided by the Map-Reduce framework to cache small to moderate read-only files such as text files, zip files, jar files etc. Once a file is cached for a specific job, Hadoop makes it available on each DataNode, where MapReduce job is running. This increases the performance of the task by saving the time and resources required for input/output operations from HDFS location. After a successful run of the job, the distributed cache file is deleted from DataNodes.

8. What is Speculative Execution in Hadoop?

Hadoop divides its tasks across many nodes in its cluster. When Hadoop framework feels that a certain task is taking longer, it clones redundant copies of the slower task across other nodes in the cluster. The task which finishes first is accepted for further processing and the execution of others is stopped and killed. This process is known as Speculative Execution in Hadoop.

The goal of speculative execution is to reduce job's response time.

9. Explain what is heartbeat in HDFS?

A heartbeat is a signal used between DataNode and NameNode, and between TaskTracker and JobTracker. A DataNode sends heartbeat to NameNode and TaskTracker sends its heartbeat to JobTracker. If the NameNode or JobTracker does not respond to the signal, then it is considered there are some issues with DataNode or TaskTracker.