Explore chapters and articles related to this topic
Big Data Clustering
Published in Charu C. Aggarwal, Chandan K. Reddy, Data Clustering, 2018
Many of the modern disk-based algorithms use MapReduce [11], or its open-source counterpart Hadoop, to process disk resident data. MapReduce is a programming framework for processing huge amounts of data in a massively parallel way. MapReduce has two major advantages: (a) the programmer is oblivious of the details of the data distribution, replication, load balancing, etc., and (b) the programming concept is familiar, i.e., the concept of functional programming. Briefly, the programmer needs to provide only two functions, a map and a reduce. The typical framework is as follows [34]: (a) the map stage sequentially passes over the input file and outputs (key, value) pairs; (b) the shuffling stage groups all values by key, and (c) the reduce stage processes the values with the same key and outputs the final result. Optionally, combiners can be used to aggregate the outputs from local mappers. MapReduce has been used widely for large scale data mining [29, 26].
Signal Processing in the Era of Biomedical Big Data
Published in Ervin Sejdić, Tiago H. Falk, Signal Processing and Machine Learning for Biomedical Big Data, 2018
Considering the speed, ease of use, and the sophisticated analytics, MapReduce has been recently replaced by Apache Spark. Contrary to MapReduce, Spark makes use of memory not only for computations but also for storage to achieve low latency on big data workloads. In addition, Spark provides a unified runtime that deals with multiple big data storage sources like Hadoop DFS, HBase, and Cassandra. Spark also provides ready-to-use high-level libraries for machine learning, graph processing, and real-time streaming. Such parallel/distributed processing tools have enabled, for example, novel visualization of neural connections [13,14,15], allowed for faster image retrieval and lung texture classification [16], enabled large-scale biometrics [17,18], and took genomics to the next level [19,20].
Remote Sensing Data Organization and Management in a Cloud Computing Environment
Published in Lizhe Wang, Jining Yan, Yan Ma, Cloud Computing in Remote Sensing, 2019
Lizhe Wang, Jining Yan, Yan Ma
Hadoop [13] is an open-source implementation of Google’s MapReduce paradigm. As one of the top-level Apache projects, Hadoop has formed a Hadoop-based ecosystem [15], which becomes the cornerstone technology of many big data and cloud applications [16]. Hadoop is composed of two main components: the Hadoop Distributed File System (HDFS) and the MapReduce paradigm. HDFS is the persistent underlying distributed file system, and is capable of storing terabytes and petabytes of data. The data stored in HDFS are divided into fixed-size (e.g. 128 MB) data chunks, each of which has multiple replicas (normally three replicas) across a number of commodity machines. The data stored in HDFS can be processed by the MapReduce paradigm.
Kernelized Spectral Clustering based Conditional MapReduce function with big data
Published in International Journal of Computers and Applications, 2021
The second step of the KSC-CMEMR technique is to eliminate the irrelevant data present in the clusters. The KSC-CMEMR technique extends a standard maximum entropy to identify the unknown values are learned in the cluster. The irrelevant data causes the more space complexity in the big data analytics. Therefore, KSC-CMEMR technique uses Conditional maximum entropy MapReduce function to overcome the above issues. MapReduce is a programming model introduced for processing and generating large data sets on computers. MapReduce includes two parts. A function called ‘Map,’ which allows different points of the distributed group (i.e. dataset) to distribute their work. A function called ‘Reduce,’ which is designed to reduce the final form of the output results. The CMEMR function is a training model used for processing the big data sets. In CMEMR function, a map procedure performs filtering and sorting the irrelevant data in cluster whereas reduce function performs a summary operation and provides the output results. The irrelevant data are removed to minimize the performance of the error during the clustering process.
Distributed outlier detection in hierarchically structured datasets with mixed attributes
Published in Quality Technology & Quantitative Management, 2020
We implement the outlier detection algorithm in a distributed fashion using the MapReduce programming model and the Hadoop infrastructure. MapReduce (Dean & Ghemawat, 2008) is a programming model and an associated implementation for processing and generating large datasets. Users design a MapReduce program through two functions: map and reduce. As shown in Figure 4, the users specify a map function that processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges all of the intermediate values that are associated with the same intermediate key. Hadoop (Abouzeid, Bajda-Pawlikowski, Abadi, Silberschatz, & Rasin, 2009) is an open source distributed infrastructure for the MapReduce implementation. It consists of two layers: a data storage layer called the Hadoop Distributed File System (HDFS) and a data processing layer (or MapReduce Framework). In the hybrid system of Hadoop, the advanced properties of MapReduce can be combined with the performance of parallel database systems.
Big Data Framework for Zero-Day Malware Detection
Published in Cybernetics and Systems, 2018
With this threat landscape, the malware detection has become a big data problem. In recent times, big data analytics has got considerable attention of security researchers and practitioners. Big data analytics, machine learning, and other decision-making techniques, along with augment human interface, aim to reduce the response time and increase the effectiveness in detecting zero-day malware threats. It can help in updating the antimalware solutions near real-time to deal with new malware threats. The historical data have its own significance and can provide cyber intelligence to deal with future threats. Earlier, Apache Hadoop has become the de facto standard for big data implementations. Presently, Apache Spark (Zaharia et al. 2010) has got momentum due to ease-of-use and better performance than Apache Hadoop. Apache Spark is one of the most active Apache projects in big data. It is rapidly becoming first choice for processing large-scale data instead of Hadoop MapReduce. Several organizations are adopting Apache Spark for distributed computation and it seems that Apache Spark is likely to replace MapReduce as general purpose data computation engine.