Spark – Knowledge and References

Explore chapters and articles related to this topic

Modern Predictive Analytics and Big Data Systems Engineering

Published in Anna M. Doro-on, Handbook of Systems Engineering and Risk Management in Control Systems, Communication, Space Technology, Missile, Security and Defense Operations, 2023

Anna M. Doro-on

Apache Spark is an open-source, distributed processing system commonly used for big data workloads that utilizes in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and ad hoc queries (AWS 2018c). It is natively supported in Amazon EMR, and you can quickly and easily create managed Apache Spark clusters from the AWS Management Console, AWS CLI, or the Amazon EMR API (AWS 2018c). With Amazon EMR, one can quickly provision hundreds or thousands of instances, automatically scale to match compute requirements, and shut the cluster down when job is completed (to avoid paying for idle capacity) (AWS 2018d).

Understanding Distributed Semantic Analysis with Spark Data Frames

View Chapter

Purchase Book

Published in Nedunchezhian Raju, M. Rajalakshmi, Dinesh Goyal, S. Balamurugan, Ahmed A. Elngar, Bright Keswani, Empowering Artificial Intelligence Through Machine Learning, 2022

Richa Mathur, Devesh K. Bandil, Dhanesh Kumar Solanki

Where, Spark is a framework for performing general data analytics on Hadoop like distributed computing cluster by providing in-memory computation for process data and increase speed over Map Reduce. It is a fast and general engine (supports multiple language faster than Hadoop) for large scale data processing that runs on the top of Hadoop cluster. Features of Spark include in-memory computation, real-time stream processing, and advanced analytics (with SQL queries, ML algorithms, Graph algorithms, and streaming data). Spark works like a “library” that enables parallel computations via function calls using MLlib API. The main feature of Spark is Resilient Distributed Dataset (RDD), which stores data in-memory in a fault tolerant (helps in recovering from failures) and parallel way.

Big Data Stream Processing

View Chapter

Purchase Book

Published in Vivek Kale, Parallel Computing Architectures and APIs, 2019

Vivek Kale

Spark Streaming is an extended tool of the core Spark engine, used to enable this large-scale data processing engine to process live data streams. The role of Spark Streaming is very similar to that of the client adapter used by Yahoo! S4. Currently, Spark Streaming is developed by the Apache Software Foundation, which is responsible for the testing, updates, and release of each Spark version.

Efficient computation of comprehensive statistical information of large OWL datasets: a scalable approach

View Article

Journal Information

Published in Enterprise Information Systems, 2023

Heba Mohamed, Said Fathalla, Jens Lehmann, Hajira Jabeen

A variety of approaches have attempted to compute statistics about RDF datasets (Langegger and Woss 2009; Auer et al. 2012; Sejdiu et al. 2018). Although these approaches are interesting, they do not allow for computing statistics over large-scale OWL datasets. To the best of our knowledge, previous work has failed to address statistical computations for OWL datasets. Most studies have only tended to focus on triple structure analysis rather than the axiom structure of the datasets. OWLStats is the first attempt to develop a distributed approach for providing comprehensive statistical information about large-scale OWL datasets. To achieve scalability, we have implemented our approach using Apache Spark,3 a distributed in-memory computing framework. Spark is horizontally scalable and can run on multiple machine clusters, i.e., the workload is spread across multiple machine memories. Due to its efficiency in handling large-scale datasets and scalability, Apache Spark has recently gained considerable attention. The primary abstraction that Spark provides is the Resilient Distributed Dataset (RDD). Additional advantages of using RDDs are in-memory computation, fault tolerance, distributed partitioning, and persistence.

Storing, preprocessing and analyzing tweets: finding the suitable noSQL system

View Article

Journal Information

Published in International Journal of Computers and Applications, 2022

Souad Amghar, Safae Cherdal, Salma Mouline

There is a lot of analysis tools such as Hadoop [20], Apache Spark [21], and Apache storm [22]: Hadoop is a software framework that provides large scale distributed data analysis. Hadoop provides HDFS (Hadoop Distributed File System ) which is a master-slave architecture that stores data and executes read and write instructions. Nevertheless, in some applications, we need to use other database systems instead of, or with, HDFS [20].Apache Spark is a unified engine for distributed data processing. It provides API (Application Programing Interfaces) in many programing languages and also supports many tools including structured data processing (Spark SQL), machine learning (MLlib) and graph processing (GraphX) [23].Apache Storm is a stream processing system that can process unbounded streams of data very fast. Storm applications are called topologies. A Storm topology is a graph of tasks that process distributed streams of data [22].

An edge streaming data processing framework for autonomous driving

View Article

Journal Information

Published in Connection Science, 2021

Hang Zhao, LinBin Yao, ZhiXin Zeng, DongHua Li, JinLiang Xie, WeiLing Zhu, Jie Tang

At present, researchers have conducted many studies on the streaming data processing frameworks which are established in the cloud data centre. Spark Streaming is one of streaming data processing framework running on Spark developed by the University of California, Berkeley (Zaharia et al., 2013). However, there is rare research about that in the edge data centre. In this paper, based on sensor data generated by automated vehicles, we propose a streaming data processing framework, which has the two following advantages: Based on the gray model (GM), within the coverage scope of a certain edge node, we implement the traffic flow monitor and prediction for autonomous driving vehicles, so that the system can realise its flexibility by conducting the adjustment for resource utilisation strategy according to the variation of data stream.The fuzzy control method is adopted to dynamically adjust the batch interval of Spark Streaming according to the change of data streams and system workload, which contributes to reducing the delay between end to end on the premise of satisfying the throughput requirement.