Apache Spark – Knowledge and References

Explore chapters and articles related to this topic

Role and Support of Image Processing in Big Data

Published in Ankur Dumka, Alaknanda Ashok, Parag Verma, Poonam Verma, Advanced Digital Image Processing and Its Applications in Big Data, 2020

Ankur Dumka, Alaknanda Ashok, Parag Verma, Poonam Verma

Hive is a streaming processing model which supports structured query language (SQL) and provides a high latency to the system. Apache Spark is a mini/microbatch, streaming model which uses scala java and python language for operation and integration. Apache storm is another version of Apache which is a record at a time-processing model which uses any language for integration and operation and provides better latency than Apache Spark. Map Reduce is a parallel-processing model which uses languages like Java, Ruby, Python, and C++ for its operation. Qubole is a stream-processing and ad-hoc query-based processing model which supports languages like Python, Scala, R, and Go. Flink is a batch and stream-processing model which supports languages like scala, java, and python for its operation.

A Survey of Big Data and Computational Intelligence in Networking

View Chapter

Purchase Book

Published in Yulei Wu, Fei Hu, Geyong Min, Albert Y. Zomaya, Big Data and Computational Intelligence in Networking, 2017

Yujia Zhu, Yulei Wu, Geyong Min, Albert Zomaya, Fei Hu

In light of the way the networked big data are collected, it is straightforward to process these data in a distributed and parallel manner. There have been several well-known frameworks available for distributed big data processing, e.g., Apache Hadoop [7], Apache Storm [10], Apache Spark [11], and Apache Flink [12]. Hadoop is the first major big data processing framework that provides batch processing based on its MapReduce processing engine. Since it heavily leverages permanent storage, each task involves multiple instances of reading and writing operations. When using Hadoop, time should not be a significant factor. In contrast to batch processing, stream processing systems compute over data as it enters the system, and thus could well serve the processing with near real-time demands. Storm is the first major stream processing framework for big data analytics that focuses on extremely low latency, but does not provide a batch processing mode. Apache Spark provides a hybrid processing system, where it is a batch processing framework with stream processing capabilities. Spark focuses on speeding up batch processing workloads by offering full in-memory computation and processing optimization. It provides a good candidate for those with diverse processing workloads. Apache Flink offers a stream processing framework with the support for traditional batch processing models. It treats batch processing as an extension of stream processing by reading a bounded data set off persistent storage.

Modern Predictive Analytics and Big Data Systems Engineering

View Chapter

Purchase Book

Published in Anna M. Doro-on, Handbook of Systems Engineering and Risk Management in Control Systems, Communication, Space Technology, Missile, Security and Defense Operations, 2023

Anna M. Doro-on

Apache Spark is an open-source, distributed processing system commonly used for big data workloads that utilizes in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and ad hoc queries (AWS 2018c). It is natively supported in Amazon EMR, and you can quickly and easily create managed Apache Spark clusters from the AWS Management Console, AWS CLI, or the Amazon EMR API (AWS 2018c). With Amazon EMR, one can quickly provision hundreds or thousands of instances, automatically scale to match compute requirements, and shut the cluster down when job is completed (to avoid paying for idle capacity) (AWS 2018d).

Storing, preprocessing and analyzing tweets: finding the suitable noSQL system

View Article

Journal Information

Published in International Journal of Computers and Applications, 2022

Souad Amghar, Safae Cherdal, Salma Mouline

There is a lot of analysis tools such as Hadoop [20], Apache Spark [21], and Apache storm [22]: Hadoop is a software framework that provides large scale distributed data analysis. Hadoop provides HDFS (Hadoop Distributed File System ) which is a master-slave architecture that stores data and executes read and write instructions. Nevertheless, in some applications, we need to use other database systems instead of, or with, HDFS [20].Apache Spark is a unified engine for distributed data processing. It provides API (Application Programing Interfaces) in many programing languages and also supports many tools including structured data processing (Spark SQL), machine learning (MLlib) and graph processing (GraphX) [23].Apache Storm is a stream processing system that can process unbounded streams of data very fast. Storm applications are called topologies. A Storm topology is a graph of tasks that process distributed streams of data [22].

Distributed deep learning approach for intrusion detection system in industrial control systems based on big data technique and transfer learning

View Article

Journal Information

Published in Journal of Information and Telecommunication, 2023

Ahlem Abid, Farah Jemili, Ouajdi Korbaa

Apache Spark is a powerful, flexible and distributed data processing framework. It uses in-memory computation to run jobs thus making it much faster than Apache Hadoop. It is the most active open source project in the big data field. Spark provides support for a range of libraries, including the scalable machine learning library MLlib which contains many machine learning algorithms, such as classification, clustering and regression algorithms.