Hadoop is an open-source software with a framework that is used to process! It helps with large amounts of data and big data and compiles with distributed computing environments! Hadoop is an open source that provides framework data and performs computations. It is framework-based Java programming itself! This open source is used for storing and processing large amounts of data! MapReduce programming model allows for the parallel processing of large datasets. So, in this article, we will talk about what Hadoop is. So stay tuned!
What is Hadoop?
Hadoop is an open-source framework after Apache and is used to store, process, and analyze data, which is enormous in volume. Hadoop is on Java paper and not OLAP (online analytical processing). It is used for batch/offline processing. Facebook, Yahoo, Google, Twitter, LinkedIn, and countless more are using it. Moreover, it can be scaled up just by calculating nodes in the cluster.
Modules of Hadoop
- HDFS: Hadoop Distributed File System. Google printed its paper GFS, and based on that, HDFS was developed. It states that the archives will be broken into blocks and stored in nodes over the distributed style.
- Yarn: Yet another Resource Negotiator is used for job scheduling and cluster management.
- Map Reduce: This framework helps Java lists do the parallel computation on data using key cost pairs. The Map task converts input data into a data set, which can be figured in a Key value pair. The reduced task consumes the Map task output, and then the out-of-reducer gives the anticipated result.
- Hadoop Common: These Java libraries are used to start Hadoop. In addition, they are used by other Hadoop modules.
Hadoop Features
- The Hadoop construction is a package of the file system, MapReduce train, and the HDFS (Hadoop Distributed File System). The MapReduce locomotive can be MapReduce/MR1 or YARN/MR2.
- A Hadoop cluster entails a single master and multiple agent nodes. The controller node comprises Job Tracker, Task Tracker, NameNode, and DataNode, whereas the agent node contains DataNode and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a dispersed file system for Hadoop. It contains a master/enslaved person construction. This architecture consists of a single NameNode that plays the role of master, and multiple DataNodes perform the role of an enslaved person.
Both NameNode and DataNode are capable enough to run on commodity machinery. Java is used to develop HDFS. So, any machine that supports Java language can run the NameNode and DataNode software.
NameNode
- It is a single controller server in the HDFS cluster.
- As it is a single node, it may convert the reason for single-point failure.
- It manages the file system namespace by capital punishment, an operation like opening, renaming, or closing the files.
- It streamlines the architecture of the system.
DataNode
- The HDFS cluster comprises multiple data nodes. Each DataNode comprises various data blocks.
- These data blocks are secondhand to store data.
- DataNode must read and write requests after the file system's clients.
- It performs block creation, deletion, and replication upon instruction since the NameNode.
Job Tracker
- The role of Job Tracker is to take the MapReduce jobs from the client and course the data using NameNode.
- In response, NameNode offers metadata to Job Tracker.
Task Tracker
- It works as an agent node for Job Tracker.
- It receives tasks and codes from Job Tracker and applies that code to the file. This development can also be called a Mapper.
MapReduce Layer
MapReduce exists when the client's application submits the MapReduce job to Job Tracker. In answer, the Job Tracker sends the request to the appropriate Task Trackers. From time to time, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.
Advantages of Hadoop
- Fast: In HDFS, the data is distributed over the cluster and mapped, which helps with faster retrieval. Even the tools to course the data are often on the same servers, thus reducing the meeting time. It can process terabytes of data in minutes and Peta bytes in hours.
- Scalable: The Hadoop cluster can be extended by adding nodes.
- Cost Effective: Hadoop is open source and uses commodity hardware to store data, so it is cost-effective as it is linked to traditional relational database management systems.
- Resilient to failure: HDFS has the property by which it can replicate data over the network, so if one node is down or approximately other network failure happens, then Hadoop takes the other copy of data and uses it. Generally, data are replicated thrice, but the duplication factor is configurable.
History of Hadoop
Doug Cutting started the Hadoop in addition to Mike Cafarella in 2002. Its origin was the Google File System paper published by Google.
Let's focus on the history of Hadoop in the following steps: -
- In 2002, Doug Cutting, besides Mike Cafarella, started to work on a project, Apache Nutch. It is an open-source web crawler software project.
- While working on Apache Nutch, they were skilled with big data. To store that data, they have to spend a lot of costs, which converts the importance of that project. This problem becomes one of the critical reasons for the emergence of Hadoop.
- In 2003, Google introduced a folder system known as GFS (Google file system). It is a proprietary spread file system developed to provide efficient access to data.
- In 2004, Google released a white paper on Map Reduce. This system simplifies data processing in large clusters.
- In 2005, Doug Cutting and Mike Cafarella presented a new file system, NDFS (Nutch Distributed File System). This file system also comprises Map Reduce. In 2006, Doug Cutting quit Google and joined Yahoo. Based on the Nutch project, Dough Cutting introduces a new Hadoop project through a file system known as HDFS (Hadoop Distributed File System). Hadoop's first version, 0.1.0, was released this year.
- Doug Cutting baptized his project, Hadoop, after his son's toy elephant.
- In 2007, Yahoo ran two clusters of 1000 machines.
- In 2008, Hadoop converted the fastest system to sort one terabyte of data on a 900-node cluster in 209 seconds.
- In 2013, Hadoop 2.2 was released.
- In 2017, Hadoop 3.0 was released.