Explore chapters and articles related to this topic
A Survey of Big Data and Computational Intelligence in Networking
Published in Yulei Wu, Fei Hu, Geyong Min, Albert Y. Zomaya, Big Data and Computational Intelligence in Networking, 2017
Yujia Zhu, Yulei Wu, Geyong Min, Albert Zomaya, Fei Hu
In light of the way the networked big data are collected, it is straightforward to process these data in a distributed and parallel manner. There have been several well-known frameworks available for distributed big data processing, e.g., Apache Hadoop [7], Apache Storm [10], Apache Spark [11], and Apache Flink [12]. Hadoop is the first major big data processing framework that provides batch processing based on its MapReduce processing engine. Since it heavily leverages permanent storage, each task involves multiple instances of reading and writing operations. When using Hadoop, time should not be a significant factor. In contrast to batch processing, stream processing systems compute over data as it enters the system, and thus could well serve the processing with near real-time demands. Storm is the first major stream processing framework for big data analytics that focuses on extremely low latency, but does not provide a batch processing mode. Apache Spark provides a hybrid processing system, where it is a batch processing framework with stream processing capabilities. Spark focuses on speeding up batch processing workloads by offering full in-memory computation and processing optimization. It provides a good candidate for those with diverse processing workloads. Apache Flink offers a stream processing framework with the support for traditional batch processing models. It treats batch processing as an extension of stream processing by reading a bounded data set off persistent storage.
Data Lakes: A Panacea for Big Data Problems, Cyber Safety Issues, and Enterprise Security
Published in Mohiuddin Ahmed, Nour Moustafa, Abu Barkat, Paul Haskell-Dowland, Next-Generation Enterprise Security and Governance, 2022
A. N. M. Bazlur Rashid, Mohiuddin Ahmed, Abu Barkat Ullah
Nowadays, stream processing frameworks, such as Apache Spark or Apache Flink, are used for real-time data load. The data required for analytics systems are transformed on the fly during query time. Data lake may also include semantic databases, a conceptual model, and a layer of context for defining the relationship of data with other data. Data lake eventually can consist of all data types from SQL and NoSQL databases and combine OLTP with OLAP. Here, SQL databases store structured data, and NoSQL databases store semi-structured and unstructured data [6,16]. The merits and demerits of the associated tools and technologies (HDFS, MapReduce, Apache Spark, Apache Flink, SQL, NoSQL, OLTP, and OLAP) are listed in Table 6.2 [28–30].
Big Data Stream Processing
Published in Vivek Kale, Parallel Computing Architectures and APIs, 2019
Apache Flink is a highly scalable, high-performance processing engine that can handle low latency as well as batch analytics. Flink is a relatively new project that originated as a joint effort of several German and Swedish universities under the name “Stratosphere.” The project changed its name to Flink (meaning “agile” or “swift” in German) when it entered incubation as an Apache project in 2015. Flink became a top-level Apache project later that year and now has an international team of collaborators.
A review on big data real-time stream processing and its scheduling techniques
Published in International Journal of Parallel, Emergent and Distributed Systems, 2020
Nicoleta Tantalaki, Stavros Souravlas, Manos Roumeliotis
The basic data abstraction for stream processing is called DataStream. It executes arbitrary dataflow programmes in a data-parallel and pipelined manner, which results in achieving low latency. Apache Flink's dataflow programming model provides event-at-a-time processing [61]. Tuples can be collected in buffers with an adjustable timeout before they are sent to the next operator to turn the knob between throughput and latency. It performs at large scale, running on thousands of nodes with very good throughput and latency characteristics based on existing benchmarks. When using stateful computations, it ensures exactly once semantics. Apache Flink includes a lightweight fault tolerance mechanism based on distributed checkpoints. Its algorithm is based on a technique introduced by Chandy and Lamport [62] and periodically draws consistent snapshots of the current state of the distributed system without missing information and without recording duplicates. These snapshots are stored to a durable storage. In case of failure, the latest snapshot is restored, the stream source is rewinded to the point when the snapshot was taken, and is replayed [23]. Flink is currently a unique option in the processing framework world but at the moment, it is a young project and there hasn't been much research into its scaling limitations. It is a declarative system, providing higher level abstractions to users like Spark. The DAG is implied by the ordering of the transformations while its engine can reorder the transformations if needed.
Programming models and systems for Big Data analysis
Published in International Journal of Parallel, Emergent and Distributed Systems, 2019
Loris Belcastro, Fabrizio Marozzo, Domenico Talia
Apache Flink10 is an open source stream processing system for large volumes of data. Flink allows programmers to implement distributed, high-performing and high-available data streaming applications. It provides a streaming dataflow paradigm for processing event-at-a-time, rather than as a series of batch of events, on both finite and infinite datasets. The programming paradigm is quite simple and it is based on three abstractions: Data source: it represents the incoming data that Flink processes in parallel.Transformation: it represents the processing entity, when incoming data is modified.Data sink: it represents the output of a Flink task.The core of Flink is a distributed streaming dataflow runtime, which is alternative to that provided by Hadoop MapReduce. However, despite having its own runtime Flink can work on a cluster or Cloud infrastructure managed by YARN and access data on HDFS. Flink provides programmers with a series of APIs: DataStream API and Dataset API for transformations respectively on data streams and datasets; Table API for relational stream and batch processing; Streaming SQL API for enabling SQL queries on streaming and batch tables.