Stream processing – Knowledge and References

Explore chapters and articles related to this topic

Advancing Smart and Resilient Cities with Big Spatial Disaster Data

Published in Amir H. Alavi, William G. Buttlar, Data Analytics for Smart Cities, 2018

The second transformation is the shift from processing batch data to streaming data (or synonym for real-time and near real-time data) and interactive analysis. Batch processing is designed for “data at rest” as a result, might have “medium to high latency” (or a response time from seconds to a few hours). MapReduce is a typical framework for batch processing. The Apache Hadoop framework enables the execution of applications on large computer cluster systems through the implementation of the MapReduce computational paradigm. Because of the “medium to high latency,” stream processing comes into being to satisfy fast data needs. Wähner (2014) stated that stream processing is most suitable for processing streaming sensor data. Typical computing framework for stream processing includes Apache Spark and Apache Storm. Another trend for big data analytics is interactive analysis. Interactive analysis or sometimes refers to “human in the loop,” is a set of techniques that combining computation power of machines and with the perceptive and cognitive capabilities of humans, in order to extract knowledge from large and complex datasets. The anticipation of human interaction can be more effective in dealing with unscheduled tasks and unpredictable disturbance. Moreover, it will go beyond the bottleneck of fully automation algorithms. Typical interactive analysis tools include Google’s Dremel, Apache Drill, etc. A summary of the big data analytic tools is shown in Table 3.11.

A Survey of Big Data and Computational Intelligence in Networking

View Chapter

Purchase Book

Published in Yulei Wu, Fei Hu, Geyong Min, Albert Y. Zomaya, Big Data and Computational Intelligence in Networking, 2017

Yujia Zhu, Yulei Wu, Geyong Min, Albert Zomaya, Fei Hu

In light of the way the networked big data are collected, it is straightforward to process these data in a distributed and parallel manner. There have been several well-known frameworks available for distributed big data processing, e.g., Apache Hadoop [7], Apache Storm [10], Apache Spark [11], and Apache Flink [12]. Hadoop is the first major big data processing framework that provides batch processing based on its MapReduce processing engine. Since it heavily leverages permanent storage, each task involves multiple instances of reading and writing operations. When using Hadoop, time should not be a significant factor. In contrast to batch processing, stream processing systems compute over data as it enters the system, and thus could well serve the processing with near real-time demands. Storm is the first major stream processing framework for big data analytics that focuses on extremely low latency, but does not provide a batch processing mode. Apache Spark provides a hybrid processing system, where it is a batch processing framework with stream processing capabilities. Spark focuses on speeding up batch processing workloads by offering full in-memory computation and processing optimization. It provides a good candidate for those with diverse processing workloads. Apache Flink offers a stream processing framework with the support for traditional batch processing models. It treats batch processing as an extension of stream processing by reading a bounded data set off persistent storage.

The Art of In-Memory Computing for Big Data Processing

View Chapter

Purchase Book

Published in Kuan-Ching Li, Hai Jiang, Albert Y. Zomaya, Big Data Management and Processing, 2017

Mihaela-Andreea Vasile, Florin Pop

The stream processing pattern refers to processing input data without storing it completely: online machine learning, real-time analytics, process logs streams, or streams of different events [20]. The traditional batching systems could be enhanced toward a microbenchmarking-like processing, but there is a need for native stream processing systems. MapReduce might be enhanced to group the incoming stream into small batches. The authors of [21] proposed a prototype for online Hadoop suited for online/pipeline aggregations. For computations that require a single MapReduce job, the map and reduce phases are completely decoupled; the reduce step does not pull the map result any more, but rather the map worker will push its output into the reduce phase. For multijob online computations, storing the intermediate reduce results in HDFS will be skipped; instead, the result will be pushed into the next map phase.

A review on big data real-time stream processing and its scheduling techniques

View Article

Journal Information

Published in International Journal of Parallel, Emergent and Distributed Systems, 2020

Nicoleta Tantalaki, Stavros Souravlas, Manos Roumeliotis

A stream processing system or data stream management system (DSMS), is designed to handle data streams and manage continuous queries. It executes continuous queries that are not only once performed, but are continuously executed until they are explicitly uninstalled. It produces results as long as new data arrives in the system and data is processed on the fly without the need for storing it. Data is usually stored after processing. Stream processing systems differ from batch processing systems, due to the requirement of real-time data processing. The term ‘real-time processing system’ refers to a system that responds within ‘real-world’ time deadlines. It guarantees that a certain process will be executed within a given period, maybe a few seconds, depending on the quality of service constraints. The term ‘real-time’ is a bit redundant but many systems use the term to describe themselves as low latency systems. Elaborate and agile systems have been proposed for these new demands.

A Survey on Cloud Computing Applications in Smart Distribution Systems

View Article

Journal Information

Published in Electric Power Components and Systems, 2018

Jeovane V. de Sousa, Denis V. Coury, Ricardo A. S. Fernandes

Hadoop is an open source framework, maintained by the Apache Organization, that allows scalable distributed computing. As part of Hadoop, there is a distributed open-source NoSQL database, called HBase, capable of handling huge datasets with billions of rows and millions of columns, that runs on top of Hadoop Distributed Filesystem (HDFS). The Hadoop framework supports the MapReduce programing paradigm to create and execute the applications and has been used in many large sites such as Amazon, Facebook, Yahoo, and so on [81]. MapReduce is a programing model introduced by Google for large-scale data processing over computer clusters [82]. Hadoop/MapReduce is intended to be used when data processing is highly parallelizable, i.e., where each process is executed independently from each other. In other situations, it would be necessary to analyze in detail if this solution is suitable. For example, when data need to be processed fast and/or continuously, it is a better solution to use stream processing, where data can be analyzed as it enter the data workflow. However, depending on the situation, a hybrid solution could be more suitable. A comparison between different solutions for stream processing is presented by [71].

Parallel and Distributed Powerset Generation Using Big Data Processing

View Article

Journal Information

Published in Applied Artificial Intelligence, 2019

Youssef M. Essa, Ahmed El-Mahalawy, Gamal Attiya, Ayman El-Sayed

The third model is a distributed computation model based on the big data stream processing platform. Spark is the most popular platforms used in stream processing (Cristian et al. 2016). These platforms allow running stream processing code across distributed machines. Stream processing model is designed to generate powerset on real-time streaming data, as shown in Algorithm 5. The main idea is that portioning powerset to subsets using powerset theorem and sending each portion as an event to stream processing platform. Each subset uses all machines in the cluster to compute the value of the subset. At the same time, the value of subset is used to update final powerset of the set S using the set contents Union operation. Algorithm 5: Forming powersets based on streaming processing.Require: input set (S), a temporary array E of size i on each machine regarding to event.Ensure: User defined code will be executed for each element of E ∈ Pj(S) to find P(S).1. Partitioning S into subsets S[m] and send each subset after reading2. 3. if ()//d is number of elements in subset4. END if5. END for6. Start stream processing program to process events subset E[Si]7. Distributed subset into distributed cluster8. Each machine generates list of possible elements9. Combine results from all executed tasks to find sub-powerset using Union operation10. Update final powerset P(s)using value of current events11. END Stream Processing.