Apache Hadoop – Knowledge and References

Explore chapters and articles related to this topic

Recent Trends of IoT and Big Data in Research Problem-Solving

Published in Shivani Agarwal, Sandhya Makkar, Duc-Tan Tran, Privacy Vulnerabilities and Data Security Challenges in the IoT, 2020

Pham Thi Viet Huong, Tran Anh Vu

Apache Hadoop is a well-known tool for handling large data sets. Currently, a lot of companies are utilizing Apache Hadoop technology in their business, including SwiftKey [10], Nokia [11], and Alacer [12], to name but a few. Apache Hadoop comprises of Hadoop kernel, MapReduce, and Hadoop distributed file system (HDFS). Hadoop uses the MapReduce programming model, which is based on the method of division and conquest issued to process large amounts of data. The master nodes and worker nodes work together in Hadoop. The master node is responsible for dividing tasks among worker nodes. When all worker nodes complete the works, they return small parts to the master nodes. Then the master nodes combine every output in reduced steps. Hadoop has many advantages; for example, it can process distributed data, perform tasks independently, and handle partial errors easily [13]. However, Hadoop still has some disadvantages, such as a restrictive programming model. It has only a single master node, and the distribution and configuration of the nodes is not obvious [13]. Despite its disadvantages, however, Hadoop is still a powerful software framework for Big Data problem solving.

Hadoop Framework: Big Data Management Platform for Internet of Things

View Chapter

Purchase Book

Published in Lavanya Sharma, Pradeep K Garg, From Visual Surveillance to Internet of Things, 2019

Pallavi H. Bhimte, Pallavi Goel, Dileep Kumar Yadav, Dharmendra Kumar

Pig is a procedural data flow with high-level scripting language mainly used for programming with Apache Hadoop. It can be used by those who are not familiar with Java and are familiar with SQL scripting language, also known as Pig Latin. There is a user-defined function (UDF) facility in Pig that invokes code in many languages, like JRuby, Jython, and Java. Pig operates on the client side of a cluster and supports the Avro file format. It can easily handle large amounts of data and can be used for ETL data pipeline; research on raw data can be processed again and again. Data in Pig Latin can be loaded, stored, streamed, filtered, grouped, joined, combined, split, and sorted. A Pig program can be run in three different ways: Script: A file containing Pig Latin commands.Grunt: As a command interpreter, it executes the command.Embedded: A pig program can also be executed as part of a Java program. Pig is used in Dataium to sort and prepare data, in “People you may know” in LinkedIn, and in PayPal to analyze transactional data and attempt to prevent fraud.

A Survey of Big Data and Computational Intelligence in Networking

View Chapter

Purchase Book

Published in Yulei Wu, Fei Hu, Geyong Min, Albert Y. Zomaya, Big Data and Computational Intelligence in Networking, 2017

Yujia Zhu, Yulei Wu, Geyong Min, Albert Zomaya, Fei Hu

An emerging wave of Internet deployments, most notably the Internet of Things (IoT) [5] and the integration of IoT and fog computing [6], often depend on a distributed networking system to collect data from geographically distributed sources, such as sensors and data centers. For example, visualizing data collected within the IoT are geo-related and sparsely distributed. Internet geographic information system (GIS)-based solutions are required to cope with this challenge. Apache Hadoop [7] is well known for its distributed storage and processing of big data. Hadoop Distributed File System (HDFS) [8] is the core part of the storage system in Hadoop, which is a distributed file system that stores data on commodity machines, providing very high aggregate bandwidth across the clusters. HDFS was inspired by Google file system (GFS) [9], which is the most widely adopted mechanism for distributed file systems.

Parallel and distributed clustering framework for big spatial data mining

View Article

Journal Information

Published in International Journal of Parallel, Emergent and Distributed Systems, 2019

Malika Bendechache, A-Kamel Tari, M-Tahar Kechadi

Apache Hadoop becomes one of the most popular parallel and distributed processing model for big data. MapReduce, the heart of Apache Hadoop, is the programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. It is capable of processing large-scale datasets by exploiting the parallelism among clusters of processing nodes. MapReduce gained popularity for its simplicity, flexibility, fault tolerance, and scalability soon after its birth. Many data mining algorithms were implemented in Hadoop MapReduce to improve and accelerate their performance [13–15]. Especially, many researchers have proposed MapReduce implementations for big data clustering [16–19]. These usually consist of MapReduce implementations of the existing clustering algorithms for some given applications.

Schema on read modeling approach as a basis of big data analytics integration in EIS

View Article

Journal Information

Published in Enterprise Information Systems, 2018

Slađana Janković, Snežana Mladenović, Dušan Mladenović, Slavko Vesković, Draženko Glavić

Apache Hadoop is an open-source distributed software platform for storing and processing data. Central to the scalability of Apache Hadoop is the distributed processing framework known as MapReduce (Sridhar and Dharmaji 2013). According to the research done by Russom (2013), the main reason to integrate Hadoop into Business Intelligence or Enterprise Data Warehouse is the expectation from Hadoop to enable Big Data analytics. The basic advantage of Hadoop is the possibility to use advanced non-OLAP (Online Analytic Processing) analytic methods, such as data mining, statistical analysis and complex SQL. However, in addition to the fact that it can be used as an analytical sandbox, Apache Hadoop includes many components useful for ETL. For example, Apache Sqoop is a tool for transferring data between Hadoop and relational databases. When data are located in the Hadoop File System, they can be efficiently subjected to the ETL tasks of cleansing, normalizing, aligning, and aggregating for an EDW by employing the massive scalability of MapReduce (Intel Corporation 2013). In this way, the Apache Hadoop platform represents a powerful ETL tool enabling the integration of the results of Big Data analysis of structured and non-structured data in an EDW.

A MapReduce C4.5 Decision Tree Algorithm Based on Fuzzy Rule-Based System

View Article

Journal Information

Published in Fuzzy Information and Engineering, 2019

Fatima Es-sabery, Abdellatif Hair

Currently, big data is the capability of extracting useful patterns or information from large-scale data [5]. For handling this huge quantity of data using a single computer node it’s inefficient in real-time. To resolve this problem the big data processing framework is deployed on cluster computers with a high-performance computing platform, and the data mining tasks are deployed on this cluster of computers by running the high-level data-parallel framework Hadoop. Apache Hadoop is an open-source software framework that efficaciously facilitates writing distributed applications. It contains two components, the distributed file system HDFS, and the MapReduce programming model.