Explore chapters and articles related to this topic
Role and Support of Image Processing in Big Data
Published in Ankur Dumka, Alaknanda Ashok, Parag Verma, Poonam Verma, Advanced Digital Image Processing and Its Applications in Big Data, 2020
Ankur Dumka, Alaknanda Ashok, Parag Verma, Poonam Verma
Apache Hbase (Hadoop database) is a column-oriented data model which provides zero downtime during node failure and thus provides good redundancy. It provides concurrency by means of optimistic concurrency. CouchDB is a document-oriented data model which also provides concurrency by means of optimistic concurrency and also provides secondary indexes. MongoDB is also a document-oriented data model which provides nearly the same features as CouchDB. Apache Cassandra is a column-oriented data model which provides zero downtime on node failure and hence provides good redundancy to the system. It also provides concurrency to the system. Apache Ignite is a multi-model data model which provides nearly all the features such as zero downtime on node failure, concurrency, and secondary indexes and hence mostly in use. Oracle NoSQL Database is a key-value-based data model which provides concurrency and secondary indexes.
Modern Predictive Analytics and Big Data Systems Engineering
Published in Anna M. Doro-on, Handbook of Systems Engineering and Risk Management in Control Systems, Communication, Space Technology, Missile, Security and Defense Operations, 2023
Apache HBase, https://hbase.apache.org/, is the Hadoop databases and a non-relational (NoSQL) database, a distributed scalable, big data store (ASF 2018f). It is an open-source NoSQL database that provides real-time read/write access to those large data sets (Hortonworks 2018d. Note, though, that HBase is not a column-oriented database in the typical RDBMS sense, but utilizes an on-disk column storage format (George 2011). This is also where the majority of similarities end, because although HBase stores data on disk in a column-oriented format, it is distinctly different from traditional columnar databases: whereas columnar databases excel at providing real-time analytical access to data, HBase excels at providing key-based to a specific cell of data, or a sequential range of cells (George 2011). Apache HBase provides random, real-time access to your data in Hadoop. It was created for hosting very large tables, making it a great choice to store multistructured or sparse data (Hortonworks 2018d). Users can query HBase for a particular point in time, making “flashback” queries possible (Hortonworks 2018d). These following characteristics make HBase a great choice for storing semi-structured data like log data and then providing that data very quickly to users or applications integrated with HBase Hortonworks 2018d).The HBase features include (ASF 2018f): (1) linear and modular scalability; (2) strictly consistent reads and writes; (3) automatic and configurable sharding of tables; (4) automatic failover support between RegionServers; (5) convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables; (6) easy to use Java API for client access; (7) block cache and bloom filters for real-time queries; (8) query predicate push-down via server-side filters; (9) thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options; (10) extensible JRuby-based (JIRB) shell; and (11) support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX.
Big Data Analytics in Healthcare Data Processing
Published in Punit Gupta, Dinesh Kumar Saini, Rohit Verma, Healthcare Solutions Using Machine Learning and Informatics, 2023
Tanveer Ahmed, Rishav Singh, Ritika Singh
HBase: is based on the Distributed File System (DFS) from Hadoop (HDFS). Apache Hbase is a multidimensional distributed database system that is part of the Hadoop ecosystem. It can store a larger amount of data from terabyte (TB) to petabyte (PB) level [24].
Machine Learning Techniques and Big Data Analysis for Internet of Things Applications: A Review Study
Published in Cybernetics and Systems, 2022
Fei Wang, Hongxia Wang, Omid Ranjbar Dehghan
Techniques such as Apache HBase, Apache Cassandra, Apache Flink, Apache Storm, Apache Spark and Apache Hadoop can be used to process data classified as big data (Kotenko, Saenko, and Branitskiy 2018). The IoT and big data are so intertwined that billions of Internet-connected objects will generate large amounts of data. However, this in itself will not be part of another industrial revolution, will not change digital everyday life, or will not provide an early warning system to save the planet. However, existing big data techniques alone lack large-scale processing, making this efficient big data analysis difficult (Martis et al. 2018). In this context, the use of a combination of machine learning and big data techniques to enhance the data analysis of IoT devices has been introduced. In recent years, machine learning techniques have become widely used due to features such as ensemble unsupervised training with faster processing (Rezaeipanah, Mojarad, and Fakhari 2022). Big data analysis by machine learning techniques includes classification, clustering, association rule mining, and regression, as shown in Figure 2. In most existing research, machine learning and big data techniques focus separately on IoT data analysis.
Distributed image retrieval with colour and keypoint features
Published in Journal of Information and Telecommunication, 2019
Michał Ła̧giewka, Marcin Korytkowski, Rafal Scherer
Existing CBIR systems are rather not designed to work with database environment. The presented system for content-based image retrieval can work in a relational database environment. The system has a good scalability, which means that the number of slave machines can be increased. To optimize transactions between Master and SQL Server we can merge them into one virtual machine to maximally reduce network operations (with migrating Hadoop to Windows environment due to MS SQL Servers used in the experiments). The proposed approach demonstrates several advantages partly coming from the original method presented in Korytkowski et al. (2016) and Korytkowski (2017). The indexing method is relatively accurate in terms of visual object classification. The training phase is relatively fast and image classification stage is very fast. Expanding the system knowledge is efficient as adding new visual classes to the system requires generation of new fuzzy rules whereas in the case of, e.g. bag-of-features it requires new dictionary generation and re-training of classifiers. The system is highly scalable and the performance of the system depends only on the hardware resources. The accuracy of the image retrieval is similar to the one presented in Korytkowski et al. (2016) and Korytkowski (2017) as the image indexing procedure is the same. The goal of the solution presented in the paper is to achieve speed and scalability. It is hard to compare the solution speed to the previous one by the authors as it is very hardware-depended. Generally, it is faster than a not-distributed version and adding more hardware and slave machines makes it faster. There can be done some further changes in the processing of the initial input image by incorporating features similar to Lagiewka et al. (2016). This means the proposed system can recognize objects with colour parameter given. Such a feature might be able to reduce a subset of compared images, which means faster processing on a smaller amount of objects matching colour requirements. Storing additional data such as texture, colour or approximate size of objects (proportionally to relevant objects in other stored images and compared to background objects at the same image) can lead to reduce processing time due to a smaller dataset containing only objects referenced by SQL query results. The proposed solution is semi-parallel because only the file system has been distributed. Our work is mostly aimed at eliminating a bottleneck of image retrieval systems designed as single-entity solutions. The list sorting process can be further parallelized but it was not within the scope of the presented work. Performance of the database part of the solution can be also increased through the use of a SQL server cluster, where the process of generating the index in the form of rules can be parallelized and spread across several servers. There is also a possibility to exchange the relational database engine into a distributed database, e.g. Apache HBase or MapR-DB.