Hadoop – Knowledge and References

Explore chapters and articles related to this topic

Why Data Science?

Published in Chong Ho Alex Yu, Data Mining and Exploration, 2022

As the data size gets bigger and bigger, it is time-consuming to transfer the data from a database to a laptop or a server housing the analytical system. There are several viable solutions, such as Hadoop and MapReduce. Hadoop is an open-source software framework for storing data and running applications on linked servers. Rather than sending data back and forth, the program is executed on the server where the data are stored. MapReduce is a programming model in the Hadoop framework for accessing and analyzing big data. To put it another way, Hadoop is a distributed file system, whereas MapReduce is a system for in-database analytics in a way that is tolerant of hardware faults due to redundancy (Leskovec et al. 2020). Significantly, although Hadoop is still in use, its usage is declining and may eventually be outdated (Hemsoth 2021). In-memory analytics (IMA) is a similar concept. With IMA the data are analyzed when they are open in the computer’s random-access memory (RAM), as opposed to analyzing data stored on physical disks (e.g., SAS Viya).

Performance Evaluation of Components of the Hadoop Ecosystem

View Chapter

Purchase Book

Published in Salah-ddine Krit, Mohamed Elhoseny, Valentina Emilia Balas, Rachid Benlamri, Marius M. Balas, Internet of Everything and Big Data, 2021

Nibareke Thérence, Laassiri Jalal, Lahrizi Sara

As organizations are getting flooded with massive amounts of raw data, the challenge is that traditional tools are poorly equipped to deal with the scale and complexity. That is where Hadoop comes in. Hadoop is well suited to meet many Big Data challenges, especially with high volumes of data and data with a variety of structures [16, 17]. Hadoop is a framework for storing data on large clusters of everyday computer hardware that is affordable and easily available and running applications against that data. A cluster is a group of interconnected computers (known as nodes) that can work together on the same problem. As mentioned, the current Apache Hadoop ecosystem consists of the Hadoop Kernel; MapReduce [18]; HDFS; and a number of various components like Apache Hive, Pig, Flume, etc.

A Classification Perspective of Big Data Mining

View Chapter

Purchase Book

Published in Ibrahiem M. M. El Emary, Anna Brzozowska, Shaping the Future of ICT, 2017

Manal Abdullah, Nojod M. Alotaibi

Apache Hadoop (2016) is an open-source software framework that enables the distributed processing of large datasets across clusters of commodity hardware using simple programming models. There are two main components of Hadoop: Hadoop distributed file system (HDFS) and MapReduce. HDFS is a distributed, scalable file system written in java for the Hadoop framework. MapReduce is a programming paradigm that allows users to define two functions—map and reduce—to process large numbers of data in parallel. In MapReduce, the input is first converted into a large set of key–value pairs. Then, the map function is applied in parallel to every pair in the input dataset, which produce a set of intermediate key–value pairs for each call. After that, the reduce function is called to merge together all intermediate values associated with the same key. Companies like Facebook, Yahoo, Amazon, Baidu, AOL, and IBM use Hadoop on a daily basis. Hadoop has many advantages including (Mirajkar, Bhujbal, and Deshmukh 2013) cost effectiveness, fault tolerant, flexibility, and scalability. Hadoop has many other related software projects that uses the MapReduce and HDFS framework such as Apache Pig, Apache Hive, Apache Mahout, and Apache HBase (Apache Hadoop 2016).

Towards data fusion-based big data analytics for intrusion detection

View Article

Journal Information

Published in Journal of Information and Telecommunication, 2023

Farah Jemili

Hadoop is an open-source framework for writing and running distributed applications that process large amounts of data. Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is (Hadoop Architecture, 2020): Accessible: Hadoop runs on large clusters of commodity machines or cloud computing services such as Amazon's Elastic Compute Cloud (EC2).Robust: Because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures.Scalable: Hadoop scales linearly to handle larger data by adding more nodes to the cluster.Simple: Hadoop allows users to quickly write efficient parallel code.Hadoop uses a programming model called MapReduce for parallelization, scalability, and fault tolerance.

Modeling and Analysis of Hadoop MapReduce Systems for Big Data Using Petri Nets

View Article

Journal Information

Published in Applied Artificial Intelligence, 2021

Dai-Lun Chiang, Sheng-Kuan Wang, Yu-Ying Wang, Yi-Nan Lin, Tsang-Yen Hsieh, Cheng-Ying Yang, Victor R. L. Shen, Hung-Wei Ho

Hadoop is a popular programming framework used for the setup of cloud computing systems. The MapReduce framework forms the core of the Hadoop program for parallel computing. The Map function sorts datasets into <key, value> pairs that are then distributed to various nodes for parallel computing. The Reduce function collects the sorted datasets and yields the results. Because Hadoop is an open-source program, the system developer can rewrite a generation method of the <key, value> pairs, for sorting and sequencing of the data sets, and collecting and sequencing of the MapReduce framework. This requires that the system developer should have a comprehensive understanding of the MapReduce framework. In the absence of customization by the system developer, Hadoop uses its default settings. However, this can produce the results that were not anticipated by the system developer. This case underlines the importance of developing guidelines to help the developer construct the systems they envisage.

Programming models and systems for Big Data analysis

View Article

Journal Information

Published in International Journal of Parallel, Emergent and Distributed Systems, 2019

Loris Belcastro, Fabrizio Marozzo, Domenico Talia

The Hadoop project is not only about the MapReduce programming model (Hadoop MapReduce module), as it includes other modules such as: Hadoop Distributed File System (HDFS): a distributed file system providing fault tolerance with automatic recovery, portability across heterogeneous commodity hardware and operating systems, high-throughput access and data reliability.Hadoop YARN: a framework for cluster resource management and job scheduling.Hadoop Common: common utilities that support the other Hadoop modules.In particular, thanks to the introduction of YARN in 2013, Hadoop turns from a batch processing solution into a platform for running a large variety of data applications, such as streaming, in-memory, and graphs analysis. As a result, Hadoop has become a reference for several other programming systems, such as: Storm and Flink for streaming data analysis; Giraph and Hama for graph analysis; Pig and Hive for querying large datasets; Oozie, for managing Hadoop jobs; Ambari for provisioning, managing, and monitoring Apache Hadoop clusters. An overview of the Hadoop software stack is shown in Figure 2.