Web indexing – Knowledge and References

Explore chapters and articles related to this topic

Cloud computing for big data

Published in Jun Deng, Lei Xing, Big Data in Radiation Oncology, 2019

The advent of Cloud computing has opened new avenues for performing massively distributed computation using large clusters of temporary VMs allocated on demand on the Cloud. This new paradigm has immense appeal for scientific computation due to its cost effectiveness and ease of development, especially compared with other specialty solutions such as GPU clusters (Pratx and Xing 2011) and computer grids. Scientific computation can now leverage new computational frameworks that have been developed by Internet companies to handle big data tasks such as analytics, web indexing, data mining, and machine learning. These new programming paradigms developed for Cloud computing include MapReduce, Hadoop, Spark, NoSQL, Cassandra, and MongoDB. These highly accessible tools are easy to program in addition to being open source in the majority of cases. This is in contrast to previous tools such as the message passing interface (MPI), which were tremendously difficult to scale for big data. In this section, three of the most popular big data–distributed Cloud computing frameworks are described in detail.

Mobile Augmented Reality to Enable Intelligent Mall Shopping by Network Data

View Chapter

Purchase Book

Published in Yulei Wu, Fei Hu, Geyong Min, Albert Y. Zomaya, Big Data and Computational Intelligence in Networking, 2017

Vincent W. Zheng, Hong Cao

Second, we rely on Web data to provide the retailer reviews. Once we identify the mobile user’s location and their orientation in the mall, we can infer which retailer they are currently looking at based on the mall’s floor plan. To assist intelligent shopping, we try to provide more information, in particular reviews, of the retail shops to the mobile users. In practice, the retailer’s reviews are often spread all over the Web. We try to aggregate the reviews for a retailer from different sources on the Web, so as to consolidate a more complete view about that retailer and avoid potential opinion bias. Because the Web has been indexed by search engines, our system manages to make use of the search engine interface to query and crawl the reviews from the Web. Therefore, the problem becomes how to find the right queries to ask the search engine, so as to find as many reviews as possible for a given retailer. As shown in Figure 16.2, after a retailer “Salon A” is recognized by using the wireless network data, we try to find its reviews. In general, we do not know what kinds of queries are good in advance, thus we first collect some Web pages off-line for the same domain retailers (e.g., “Salon B” and “Salon C” in this case). Among these off-line Web pages for “Salon B” and “Salon C,” there can be review pages and irrelevant pages such as yellow pages and advertisement pages. As a result, based on these Web pages, we can analyze what kinds of keywords and phrases are useful in retrieving the reviews. As shown in Figure 16.2, the keywords and the Web pages are connected into a network. Based on the network, we can observe that, both “Salon B” and “Salon C” have their review pages linked to some stylists, while their irrelevant pages hardly linked to any stylist. We can then conclude that “stylist” can be a good query to search for salon reviews. Finally, for our target “Salon A,” we use “sylist” to construct queries to search for reviews. We remark that by using a search engine to retrieve the retailer reviews our system is able to utilize the search engine’s Web indexing capability. More details are provided in Section 16.3.

Discovering the relationship of disasters from big scholar and social media news datasets

View Article

Journal Information

Published in International Journal of Digital Earth, 2019

Liang Zheng, Fei Wang, Xiaocui Zheng, Binbin Liu

Many of the subclass disaster types seldom occur in the reality or are not often discussed in the scholar papers and news articles. Considering over 300 subclasses of all possible disaster types into the search work will bring complicated processing data, and more importantly not all of them are interesting for the consequent disaster chain research. Therefore, first and foremost, all of the disaster type subclasses are separately searched through Baidu Scholar and the number of search results for each keyword is tallied up. Web crawler is used to help to get the data. Web crawler can be regarded as an Internet bot that systematically browses the World Wide Web normally for the purpose of web indexing based on certain defined rules. It can be a program or script developed in different computer languages (Guo 2017), and Python programming language is adopted in our research. Some of the results are picked and listed in Table 2.