Web scraping

Explore chapters and articles related to this topic

Working with Web Data and APIs

Published in Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, Julia Lane, Big Data and Social Science, 2020

Web scraping involves writing code to download and process web pages programmatically. We need to look at the website, identify how to obtain the information we want from it, and then write code to do it. Many websites deliberately make this difficult to prevent easy access to their underlying data, while some websites explicitly prohibit this type of activity in their terms of use. Another challenge when scraping data from websites is that the structure of the websites changes often, requiring researchers to continually update their code. In fact, this aspect applies when using the code in this chapter, too. Although the code accurately captures the data from a website at the time of this writing, it may not be valid in the future as the structure and content of the website changes.

Application programming interfaces and web data for social research

View Chapter

Purchase Book

Published in Uwe Engel, Anabel Quan-Haase, Sunny Xun Liu, Lars Lyberg, Handbook of Computational Social Science, Volume 2, 2021

Dominic Nyhuis

One way to understand APIs is to consider how collecting data using APIs is different from and similar to web scraping. Web scraping ordinarily means accessing an HTML document and extracting the relevant pieces of information from the document. Ignoring the details, we can think of web scraping as a two-step process. In a first step, we send a query to a server to request a particular resource, frequently an HTML document. In a second and often considerably more cumbersome step, we extract the relevant information from the HTML document.

View Chapter

Purchase Book

Published in Rafael A. Irizarry, Introduction to Data Science, 2019

Rafael A. Irizarry

Web scraping, or web harvesting, is the term we use to describe the process of extracting data from a website. The reason we can do this is because the information used by a browser to render webpages is received as a text file from a server. The text is code written in hyper text markup language (HTML). Every browser has a way to show the html source code for a page, each one different. On Chrome, you can use Control-U on a PC and command+alt+U on a Mac. You will see something like this:

Extraction and linking of motivation, specification and structure of inventions for early design use

View Article

Journal Information

Published in Journal of Engineering Design, 2023

Pingfei Jiang, Mark Atherton, Salvatore Sorce

To get started, initial searches using Google Patents are required to obtain a list of patents to be analysed. Then the results are exported in the form of a spreadsheet using the ‘Download’ feature in Google Patents, which contains information such as title, date, assignee and links of each patent. The spreadsheet is then imported into Python, allowing web scraping of patent contents using the links. Python module BeautifulSoup with html parser is used to carry out the web scraping. The scraped contents are stored in Python, ready for the next step.

A Systematic Review of Data Analytics Job Requirements and Online-Courses

View Article

Journal Information

Published in Journal of Computer Information Systems, 2022

Mohamad Almgerbi, Andrea De Mauro, Adham Kahlawi, Valentina Poggioni

Web scraping provides tools for extracting and using relevant data and information from the Internet.44 Multiple studies adopted web scraping techniques to build a corpus of documents for subsequent topic modeling.6,16,37,38,40–42,45,46 To extract a large number of data points related to online courses and online job offers, we built a custom web scraper software using Python beautiful soup library able to extract titles and descriptions of every job position and course available on the selected websites. To extract only the data related to this study, we used six keywords which collectively cover the field of Big Data Analytics, namely: “big data”, “data science”, “business intelligence”, “data mining”, “machine learning” and “data analytics”. These keywords were used to extract the data for both job offers as well as online courses. Using the web scraper, we were able to extract 14,495 online job ads published in a six-month period from March 2019 to August 2019 and 3,636 online courses that were published in the same period. Using Python DataFrame.drop_duplicates package, we identified and deleted the duplicate job ads and courses, then we did the same with those which are not related to our study by creating an algorithm that finds one or more of the six keywords we used to extract the data in the title and the description of each job ad. Through these progressive cleaning steps of the data set summarized in Table 3, we finally obtained a dataset containing 9,067 online job ads and 764 online courses. To the best of our knowledge, this is the biggest dataset containing both job offers and courses descriptions. Other works proposed a similar dataset containing only data extracted from job posts6,16,37,38,40–42 but these datasets contain data from 3,000 job posts at most. The only one which collected a dataset also for courses is,40 but the dataset is very limited both in dimensions and geographical scope.