Scrapy – Knowledge and References

Explore chapters and articles related to this topic

Speaking Naturally: Text and Natural Language Processing

Published in Jesús Rogel-Salazar, Advanced Data Science and Analytics with Python, 2020

In an ideal scenario, data can be obtained with appropriate Application Programming Interfaces (APIs) to make requests for example via the HTTP protocol. In these cases you can use the Requests module1 in Python for example. While the efforts of the open data movement are slowly but surely making data accessible to all, it is often the case that we still need to obtain it in a more indirect way such as scraping the contents of a webpage: We could manually copy and paste the data but that is as interesting as watching paint dry. Instead, we automate the process to do this work programmatically. There are some great modules such as Scrapy2 or Beautiful Soup3 to do this work. In this case we will obtain some data with the help of Beautiful Soup. Web scraping extracts or “scrapes” data from a web page.

An analysis of open source operating systems based on complex networks theory

View Chapter

Purchase Book

Published in Jimmy C.M. Kao, Wen-Pei Sung, Civil, Architecture and Environmental Engineering, 2017

Denghui Zhang, Zhengxu Zhao, Yiqi Zhou, Yang Guo

The data collection architecture is shown in Figure 1. Scrapy, a python-based crawler module, drives the data flow. Scrapy can be adopted to extract information from a site like DistroWatch which does not provide an API or other programmable access mechanism. Spiders schedule the first URL to crawl based on CrawlerRules. Scheduler sorts URL requests into a queue, and then sends them to the Downloader. Once the webpage is downloaded completely, Downloader sends the response content to Spiders, and then Spiders transfers the response to a HTML Filter for further process. Spiders returns new requests to the Scheduler at the same time. HTML Filter is the key of the architecture. It extracts the distribution dependency from a webpage. Due to the asynchronous network, the dependency is first saved in an intermediate file for each distribution webpage. After all of dependencies are collected, they are converted into a GraphML file which models distributions as nodes and dependencies among them as edges for the next complex network analysis. The process repeats until there are no more requests from the Scheduler.

PPP project procurement model selection in China: does it matter?

View Article

Journal Information

Published in Construction Management and Economics, 2020

Wang Pu, Fei Xu, Ruoxun Chen, Rui Cunha Marques

The data of PPP projects in China were drawn from PPP Projects Library on the official website of China PPP Centre of the MoF using the web crawler of Scrapy 1.5. As of 30 December 2017, 4881 PPP projects were in the procurement and implementation stages. After deleting the observations with incomplete information, 4736 projects remained. Among those projects, the most commonly adopted procurement model was OT. CD ranked second with 1456 projects. 141, 110 and 40 projects were procured by SSP, CN and IT, respectively. Because the proportion of projects adopting the latter three procurement models was too small (6%), these models were merged into the ‘Other’ category for simplification. Thus, 291 projects were procured by ‘Other’. Overall, the procurement models were divided into three models, the OT, CD and ‘Other’.