Explore chapters and articles related to this topic
Role and Support of Image Processing in Big Data
Published in Ankur Dumka, Alaknanda Ashok, Parag Verma, Poonam Verma, Advanced Digital Image Processing and Its Applications in Big Data, 2020
Ankur Dumka, Alaknanda Ashok, Parag Verma, Poonam Verma
Apache Hbase (Hadoop database) is a column-oriented data model which provides zero downtime during node failure and thus provides good redundancy. It provides concurrency by means of optimistic concurrency. CouchDB is a document-oriented data model which also provides concurrency by means of optimistic concurrency and also provides secondary indexes. MongoDB is also a document-oriented data model which provides nearly the same features as CouchDB. Apache Cassandra is a column-oriented data model which provides zero downtime on node failure and hence provides good redundancy to the system. It also provides concurrency to the system. Apache Ignite is a multi-model data model which provides nearly all the features such as zero downtime on node failure, concurrency, and secondary indexes and hence mostly in use. Oracle NoSQL Database is a key-value-based data model which provides concurrency and secondary indexes.
A Microuidics-Driven Cloud Service: Genomic Association Studies
Published in Mohamed Ibrahim, Krishnendu Chakrabarty, Optimization of Trustworthy Biomolecular Quantitative Analysis Using Cyber-Physical Microfluidic Platforms, 2020
Mohamed Ibrahim, Krishnendu Chakrabarty
Apache Cassandra is a masterless, NoSQL online database architecture with no single point of failure (i.e., fault tolerant). Apache Spark is a centralized scheme that designed to handle a large amount of data by simultaneously processing it at scale. For example, to develop scalable regression models for CanLib, Spark Machine Learning Library (MLlib) can be deployed and used. In BioCyBig, we tightly integrate Spark and Cassandra, which gives us the capability to use Spark to analyze the data stored at Cassandra; this data is generated online by the individual microfluidic platforms that may be geographically distributed. This integration provides horizontal scaling, fault tolerance, operational-level reporting, and analytics-friendly environment, all in one package. To achieve this integration, it is imperative to study how to extract biochemical outcomes from Cassandra and incrementally move the updates to Spark in a real-time fashion.
Big Data Computing
Published in Vivek Kale, Parallel Computing Architectures and APIs, 2019
Apache Cassandra is a distributed object store for managing a large amount of structured data spread across many commodity servers. The system is designed to avoid a single point of failure and offers a highly reliable service. Cassandra was initially developed by Facebook; now, it is part of the Apache incubator initiative. In the initial years, Facebook used a leading commercial database solution for their internal architecture in conjunction with some Hadoop. Eventually, the tsunami of users led the company to start thinking in terms of unlimited scalability and focus on availability and distribution. The nature of the data and its producers and consumers did not mandate consistency but needed unlimited availability and scalable performance.
Machine Learning Techniques and Big Data Analysis for Internet of Things Applications: A Review Study
Published in Cybernetics and Systems, 2022
Fei Wang, Hongxia Wang, Omid Ranjbar Dehghan
Techniques such as Apache HBase, Apache Cassandra, Apache Flink, Apache Storm, Apache Spark and Apache Hadoop can be used to process data classified as big data (Kotenko, Saenko, and Branitskiy 2018). The IoT and big data are so intertwined that billions of Internet-connected objects will generate large amounts of data. However, this in itself will not be part of another industrial revolution, will not change digital everyday life, or will not provide an early warning system to save the planet. However, existing big data techniques alone lack large-scale processing, making this efficient big data analysis difficult (Martis et al. 2018). In this context, the use of a combination of machine learning and big data techniques to enhance the data analysis of IoT devices has been introduced. In recent years, machine learning techniques have become widely used due to features such as ensemble unsupervised training with faster processing (Rezaeipanah, Mojarad, and Fakhari 2022). Big data analysis by machine learning techniques includes classification, clustering, association rule mining, and regression, as shown in Figure 2. In most existing research, machine learning and big data techniques focus separately on IoT data analysis.
Big Data and Social Science: Data Science Methods and Tools for Research and Practice
Published in Technometrics, 2021
Chapter 4 of “Databases,” by I. Foster and P. Heus, shows different approaches to storing data in ways that facilitate rapid, scalable, and reliable exploration and analysis, convenient for using in any software, particularly, in SAS, Stata, SPSS, or R. It describes relational DBMSs and Structured Query Language (SQL), optimizing databases and cleaning data, and embedding queries in Python. For extremely large databases, the alternative technologies have been developed of no SQL, or not only SQL, which are commonly referred as NoSQL approaches. For example, there are such NoSQL DBMSs of simple key-value structure as Redis, Amazon Dynamo, Apache Cassandra, and Project Voldemort. The spatial databases with socioeconomic data associated with jobs in cities and states are also discussed.
API deployment for big data management towards sustainable energy prosumption in smart cities-a layered architecture perspective
Published in International Journal of Sustainable Energy, 2020
Bokolo Anthony Jnr, Sobah Abbas Petersen, Dirk Ahlers, John Krogstie
Thus, the HBase NoSQL database and CouchDB are utilised to store the processed energy data for energy market trading forecasting and prediction analytics to be utilised by prosumers and energy service providers in making decision regarding energy trading in energy districts. Also, CouchDB is used to store energy data that can be converted into action information in the application layer to provides interoperability open energy data to prosumers, energy service providers, and stakeholders via RESTful API to support energy trading in energy districts. Likewise, Apache Cassandra a NoSQL database is utilised as a datastore to elastically organise linked meta-data in a system. It has a linear scalability and has been proven to be fault-tolerance making it a viable database for organising critical data in RDF format (Khan, Kiani, and Soomro 2014). Cassandra’s data model employs column-based schema and robust built-in caching that help prosumers and energy service providers in attainment the quality of service in exploiting energy data for achieving positive energy block in energy district.