Explore chapters and articles related to this topic
Big Data, Cloud, Semantic Web, and Social Network Technologies
Published in Bhavani Thuraisngham, Murat Kantarcioglu, Latifur Khan, Secure Data Science, 2022
Bhavani Thuraisngham, Murat Kantarcioglu, Latifur Khan
Apache Storm is an open source distributed real-time computation system for processing massive amounts of data. Storm is essentially a real-time framework for processing streaming data and real-time analytics. It can be integrated with the HDFS. It provides features like scalability, reliability, and fault tolerance. The latest version of Storm supports streaming SQL, predictive modeling and integration with systems such as Kafka. In summary, Storm is for real-time processing and Hadoop is for batch processing. More details on Storm can be found in [STOR].
Storing, preprocessing and analyzing tweets: finding the suitable noSQL system
Published in International Journal of Computers and Applications, 2022
Souad Amghar, Safae Cherdal, Salma Mouline
There is a lot of analysis tools such as Hadoop [20], Apache Spark [21], and Apache storm [22]: Hadoop is a software framework that provides large scale distributed data analysis. Hadoop provides HDFS (Hadoop Distributed File System ) which is a master-slave architecture that stores data and executes read and write instructions. Nevertheless, in some applications, we need to use other database systems instead of, or with, HDFS [20].Apache Spark is a unified engine for distributed data processing. It provides API (Application Programing Interfaces) in many programing languages and also supports many tools including structured data processing (Spark SQL), machine learning (MLlib) and graph processing (GraphX) [23].Apache Storm is a stream processing system that can process unbounded streams of data very fast. Storm applications are called topologies. A Storm topology is a graph of tasks that process distributed streams of data [22].
Programming models and systems for Big Data analysis
Published in International Journal of Parallel, Emergent and Distributed Systems, 2019
Loris Belcastro, Fabrizio Marozzo, Domenico Talia
Apache Storm9 is an open source system for real-time stream processing of large volumes of data. Storm is designed to ensure a high degree of scalability, fault-tolerance, high-speed data processing and low-latency response time.
A review on big data real-time stream processing and its scheduling techniques
Published in International Journal of Parallel, Emergent and Distributed Systems, 2020
Nicoleta Tantalaki, Stavros Souravlas, Manos Roumeliotis
The Stream-Dataflow Approach, where an application is viewed as a dataflow graph with operators and data dependencies between them (sometimes referred as operator-based approach). A task encapsulates the logic of a predefined operator like filter, window, aggregate or join or even a routine with user-specified logic. A data stream between two operators represents an infinite sequence of data produced by a task, which is available for further consumption. Data is delivered and consumed in arbitrary order across parallel tasks, and as a result, there is a lack of a coarse-grain unit for transactional processing. Everything is automatically pipelined.The Micro-Batch Approach, that offers a solution to enable processing data streams on batch processing systems. With micro-batching, we can treat a streaming computation as a sequence of transformations on bounded sets by discretising a distributed data stream into batches, and then scheduling these batches sequentially in a cluster of worker nodes. The progress of real-time stream processing systems in the cloud has been relatively slow, but nowadays there are several solutions offered. Apache Storm [30] is a real-time distributed computing technology for processing streaming messages on a continuous basis. Individual logical processing units are connected like a pipeline to express a series of transformations and expose opportunities for concurrent processing. Heron is a processing engine for streaming and real-time data at a scale that was developed at Twitter as a replacement for Apache Storm. Spark Streaming [29] is another solution that makes it easy to build streaming applications using the micro-batching approach. The idea behind this is to process in the same fashion as the batch processing but keeping the batch sizes very small. Apache Samza [32] is a distributed stream-processing framework that provides a simple API, comparable to MapReduce. There are also commercial solutions like Amazon Kinesis [38], a fully managed service for real-time processing of streaming data at massive scale and IBM InfoSphere Streams [39]. Research and development in the area of stream processing is continuous and becomes of great importance in the mentioned context of IoT. In the following section, we are going to outline issues and requirements that stream processing systems have to meet to excel at real-time stream processing applications. Our work relies mostly on Stonebraker's et al. [4] analysis. Also, we present the mechanisms used in the big data era to face the aforementioned challenges.