Explore chapters and articles related to this topic
Feature Selection for Clustering: A Review
Published in Charu C. Aggarwal, Chandan K. Reddy, Data Clustering, 2018
Salem Alelyani, Jiliang Tang, Huan Liu
Streaming Data are continuous and rapid data records. Formally defined as a set of multidimensional records X1, …, Xn, … that come in time stamps T1 …, Tn, …. Each record Xi is m-dimensional. Usually these samples have a huge dimensionality and arrive very fast. Therefore, they require scalable and efficient algorithms. Also, the underlying cluster structure is changing, so we need to capture this change and keep selected feature sets up to date. In this section, we introduce what we believe to be required characteristics of a good algorithm that handles streaming data: Adaptivity: The algorithm should be able to adjust features’ weights or even reselect the set of features, so it is able to handle the data drift, aka dataset shift.Single scan: The algorithm should be able to cluster the incoming stream in one scan, since another scan is usually impossible or at least costly.
Bicriteria Task Scheduling and Resource Allocation for Streaming Big Data Processing in Geo-Distributed Clouds
Published in Yulei Wu, Fei Hu, Geyong Min, Albert Y. Zomaya, Big Data and Computational Intelligence in Networking, 2017
Deze Zeng, Chengyu Hu, Guo Ren, Lin Gu
With the advent of big data, many novel programming frameworks have been proposed. The most well-known one is MapReduce advocated by Google. The design goal of MapReduce is to provide a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations on large clusters of commodity PCs [19]. With the ability of exploring large-scale computation resources, MapReduce therefore is suitable and actually has already been widely used, for dealing with big volumes of data in Clouds, i.e., one-step batch processing. However, one notorious disadvantage of MapReduce is its inefficiency for multistep tasks as lots of I/O processing is required. Unlike batch data processing, streaming data processing cannot be simply split and processed in a parallel manner like the MapReduce model. In batch data processing, the source data are stored in a local file system or database while, in streaming data processing, the data flow into the computation unit at a very fast speed and must be processed immediately. With respect to the task sequences and interdependencies, if we continue to use the “process-write” model in batch processing, the I/O communication cost will be extremely high due to the large involvement of intermediate I/O operations.
Software and Technology Standards as Tools
Published in Jim Goodell, Janet Kolodner, Learning Engineering Toolkit, 2023
Jim Goodell, Andrew J. Hampton, Richard Tong, Sae Schatz
Cloud computing facilitates the on-demand availability of computer resources. So, instead of buying an entire hardware server, which you might only use for eight hours a day, you can acquire cloud-based resources as you need them. Cloud environments scale to meet growing or shrinking needs. Commercial cloud providers (such as Amazon Web Services, Google Cloud, or Microsoft Azure) also offer many tools and features, like built-in analytics engines and cybersecurity. Such services facilitate more rapid software development and implementation, and they minimize risk. Cloud computing also better supports streaming data architectures, which often require scalability and significant computing power.
An edge streaming data processing framework for autonomous driving
Published in Connection Science, 2021
Hang Zhao, LinBin Yao, ZhiXin Zeng, DongHua Li, JinLiang Xie, WeiLing Zhu, Jie Tang
At present, researchers have conducted many studies on the streaming data processing frameworks which are established in the cloud data centre. Spark Streaming is one of streaming data processing framework running on Spark developed by the University of California, Berkeley (Zaharia et al., 2013). However, there is rare research about that in the edge data centre. In this paper, based on sensor data generated by automated vehicles, we propose a streaming data processing framework, which has the two following advantages: Based on the gray model (GM), within the coverage scope of a certain edge node, we implement the traffic flow monitor and prediction for autonomous driving vehicles, so that the system can realise its flexibility by conducting the adjustment for resource utilisation strategy according to the variation of data stream.The fuzzy control method is adopted to dynamically adjust the batch interval of Spark Streaming according to the change of data streams and system workload, which contributes to reducing the delay between end to end on the premise of satisfying the throughput requirement.
Prediction of stock values changes using sentiment analysis of stock news headlines
Published in Journal of Information and Telecommunication, 2021
Streaming data prove to be a rich source of data analysis where data are collected in real-time. The major characteristics of such data being its accessibility and availability, help in proper analysis and prediction. Das et al. (2018) show an analysis that has been made for making financial decisions such as stock market prediction, to predict the potential prices of a company's stock using twitter data.