Explore chapters and articles related to this topic
Feature Selection and Evaluation
Published in Guozhu Dong, Huan Liu, Feature Engineering for Machine Learning and Data Analytics, 2018
Given a large-scale data set containing a huge number of samples and features, the scalability of a feature selection algorithm becomes extremely important. However, most existing feature selection algorithms are proposed to deal with data sets whose sizes are under several gigabytes. In order to improve the scalability of existing feature selection methods for large-scale data sets, the feature selection is always performed in a distributed manner. It has been shown in [11] that any operation fitting the Statistical Query model can be computed in parallel based on data partitioning. Studies also showed that when the data size is large enough, parallelization based on data partitioning can result in linear speedup as computing resources increase [11]. Problem 1: When the data are located at a central database, how do we implement the distributed feature selection?
Scalability
Published in Vivek Kale, Digital Transformation of Enterprise Architecture, 2019
A parallel computer is a set of processors that are able to share the load i.e., work cooperatively to solve a computational problem. Parallel processing is performed by the simultaneous execution of program instructions that have been allocated across multiple processors with the objective of running a program in less time; parallelizing processing permits to handle larger data sets in reasonable time or to speed up complex operations and, therefore, represents the key to tackle the big data problem. Parallelization implies that the processing load or work is split and distributed across a number of processors, or processing nodes.
Big Data Analytics and Machine Learning for Industry 4.0: An Overview
Published in G. Rajesh, X. Mercilin Raajini, Hien Dang, Industry 4.0 Interoperability, Analytics, Security, and Case Studies, 2021
Nguyen Tuan Thanh Le, Manh Linh Pham
Parallelization allows one to improve computation time by dividing big problems into smaller instances, distributing smaller tasks across multiple threads and then performing them simultaneously. This strategy decreases computation time instead of total amount of work because multiple tasks can be performed simultaneously rather than sequentially [10].
Interactive Visual Exploration of Big Relational Datasets
Published in International Journal of Human–Computer Interaction, 2023
Katerina Vitsaxaki, Stavroula Ntoa, George Margetis, Nicolas Spyratos
The function ansQ is also shown in Figure 4. We view the ordered triple Q = (b, q, sum) as an analytic query, the function ansQ as the answer to Q, and the computations described above as the query evaluation process. The function b that appears first in the triple (b, q, sum) and is used in the grouping step is called the grouping function; the function q that appears second in the triple is called the measuring function; and the function sum that appears third in the triple is called the reduction operation or the aggregate operation. Actually, the triple (b,q,sum) should be regarded as the specification of an analysis task to be carried out over the data set Inv#. It should be clear from the above evaluation of the query answer that the task of evaluating Q can be easily parallelized. Indeed, if for each i we consider the evaluation of ansQ(Bri) as a sub-task then we can assign the sub-tasks to a number of processors, each processor receiving one or more sub-tasks. Each processor then executes its own sub-task(s) independently of all other processors, and the results from all processors put together, constitute the answer to the query (note that this kind of parallelization lies at the basis of MapReduce).
Monitoring travel patterns in German city regions with the help of mobile phone network data
Published in International Journal of Digital Earth, 2021
Stefan Fina, Jigeeshu Joshi, Dirk Wittowsky
Computing routes between 8222 (German postcodes) datapoints requires intensive processing. Even if not all possible combinations are practical (it is unlikely that people travel long distances on a daily basis and remote places are less frequently travelled to), route calculation nonetheless dictated significant performance considerations. Parallelization helped improve processing performance. The program code snippet featured in Appendix A.1 documents the procedure used to compute the network distances between startzone and endzone. The function GetDistances given in Appendix A.2 was used to calculate network distance between the features in the origin and destination layers. This function uses a network dataset built from the OSM street network and methods available in the ArcPy network analyst module (ESRI 2019).
Next generation of GIS: must be easy
Published in Annals of GIS, 2021
A-Xing Zhu, Fang-He Zhao, Peng Liang, Cheng-Zhi Qin
Another aim of the modelling environment is to improve the computation efficiency of the constructed workflows. Researchers develop different parallelization strategies to divide the computation into different sessions and process in parallel to improve computation efficiency (Healey et al. 1997; Hawick, Coddington, and James 2003; Zhao et al. 2016). The parallelization strategies are designed at two different levels. The first level is data division that divides geospatial data into parts to be loaded and processed in parallel. The data division strategy varies as data structure changes (Shook et al. 2016; Qin, Zhan, and Zhu 2014; Liu et al. 2014). The second is the division of computation which divides the computation process into parts to be carried out in parallel. The computation process is divided according to algorithm characteristics (Qin and Zhan 2012; Liu et al. 2016). The modelling environment also provides users with the access to high-performance computing resources like cluster computing and grid computing (Huang et al. 2011; Lecca et al. 2011; Hussain et al. 2013; Kim and Tsou 2013; Yang et al. 2011). For example, the Google Earth Engine (Padarian, Minasny, and McBratney 2015; Gorelick et al. 2017) uses cloud computing to allow users to conduct the geo-computation with high-performance computing resources over the internet, without the need to set up a local high-performance computing infrastructure.