Explore chapters and articles related to this topic
Big Data Stream Processing
Published in Vivek Kale, Parallel Computing Architectures and APIs, 2019
Spark extends its predecessors (such as MapReduce) with in-memory processing. An RDD enables developers to materialize any point in a processing pipeline into memory across the cluster, meaning that future steps that want to deal with the same dataset need not recompute it or reload it from disk. This capability opens up use cases that distributed processing engines could not previously approach. Spark is well suited for highly iterative algorithms that require multiple passes over a dataset, as well as reactive applications that quickly respond to user queries by scanning large in-memory datasets.
Experiences with big data: Accounts from a data scientist’s perspective
Published in Quality Engineering, 2020
Murat Kulahci, Flavia Dalia Frumosu, Abdul Rauf Khan, Georg Ørnskov Rønsch, Max Peter Spooner
With increased accumulation of production data, one of the biggest challenges has become the allocation of enough computational resources to process it. Although new technologies, such as parallel computing and quantum computing have revolutionized the whole field, memory capabilities are still limited. Most of the well-known data analytics methods worked on the principal of in-memory processing. Computing frameworks such as Hadoop and Spark (Zaharia et al. 2010) enable in-memory computation of large data streams and provide solutions to the problems prompted by the continuous streams of data (Agneeswaran 2014). In terms of data storage, there is currently a transition towards NoSQL (“non-SQL” or “non-relational”) databases (Leavitt 2010) as opposed to the traditionally structured relational databases. One of the key advantages of NoSQL databases is that they can handle large unstructured data efficiently.