Explore chapters and articles related to this topic
Database querying using SQL
Published in Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton, Texts in Statistical Science, 2017
Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton
However, the flights data frame can become very large. Going back to 1987, there are more than 169 million individual flights—each comprising a different row in this table. These data occupy nearly 20 gigabytes as CSVs, and thus are problematic to store in a personal computer’s memory. Instead, we write these data to disk, and use a querying language to access only those rows that interest us. In this case, we configured dplyr to access the flights data on a MySQL server. The src scidb() function from the mdsr package provides a connection to the airlines database that lives on a remote MySQL server and stores it as the object db. The tbl() function from dplyr maps the flights (carriers) table in that airlines database to an object in R, in this case also called flights (carriers).
Knowledge Discovery and Information Analysis from Remote Sensing Big Data
Published in Lizhe Wang, Jining Yan, Yan Ma, Cloud Computing in Remote Sensing, 2019
Lizhe Wang, Jining Yan, Yan Ma
Owing to natural raster data structure of Earth observation images, the time-series imagery set can be easily transformed to multidimensional array. For example, a 3D array can represent the data with spatiotemporal dimensions. This data type is suitable for parallel processes, because a large array can be easily partitioned into several chunks for distributed storing and processing. In addition, the multidimensional array model enables a spatiotemporal auto-correlated data analysis; therefore, researchers need not be concerned about the organization of discrete image files. Thus, much research is focused on developing new analysis tools to process the large RS data based on the multidimensional array model; e.g., Gamara et al.[251] tested the performance of spatiotemporal analysis algorithms on array database architectures - SCIDB[252], which described the efficiency of spatiotemporal analysis based on the multidimensional array model, Assis et al.[237] built a parallel RS data analysis system based on the MapReduce framework of Hadoop[253], describing the 3D array with key/value pairs. Although these tools have significantly improved the computation performance of RS data analysis, they also contain some deficiencies. First, many of them focused only on analyzing the RS raster image data located by geographic coordinates, and did not provide the support of spatial feature, thereby limiting their ability to use these geographic objects in the analysis application. Next, some of these tools require analysers to fit their algorithms into specialised environments, such as Hadoop MapReduce framework[254]. This will be user unfriendly to researchers who only desire to focus on their analysis application.
Integrating memory-mapping and N-dimensional hash function for fast and efficient grid-based climate data query
Published in Annals of GIS, 2021
Mengchao Xu, Liang Zhao, Ruixin Yang, Jingchao Yang, Dexuan Sha, Chaowei Yang
To compare the performance of LotDB with other databases, three popular databases are chosen; specifically, they are PostgreSQL 9.3 (Relational Database), MongoDB 4.0.4 (NoSQL Database), and SciDB 18.1 (Array Database). PostgreSQL is an open source object-relational database management system, it has been launched for 30 years and maintained a very stable performance in different domains. MongoDB is a document-oriented NoSQL database system, which follows a schema-free design and is one of the most popular databases for modern applications. SciDB, as mentioned in the previous section, is a high-performance array database that designed specifically for storing and querying scientific datasets. All these databases are installed as standalone modes on individual servers with the same hardware configuration: Intel Xeon CPU X5660 @ 2.8 Ghz×24 with 24GB RAM size and 7200 rpm HDD, installed with CentOS 6.6 or Ubuntu 14.04. Data were uploaded to each database, and pre-processing work was done for databases that do not support NetCDF format. The databases are evaluated in the following aspects: (1) data uploading and pre-processing time, (2) data storage consumption, and (3) spatiotemporal query run-time. Different spatiotemporal queries are designed to evaluate the performance of selected databases (Table 1) for the year 2017 with raw data size be 3.45 GB. Different raw data sizes were chosen to evaluate the data storage consumption in different databases in additional to 3.45 GB, specifically, they are 10 MB, 100 MB, 1 GB, and 10 GB. The number of grid points is the estimated number of array cells to be retrieved for the corresponding query. The query run-time refers to the elapsed real time or wall-clock time in this experiment.