Scraping – Knowledge and References

Explore chapters and articles related to this topic

Data science ethics

Published in Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton, Modern Data Science with R, 2021

Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton

Problem 1 (Easy): A researcher is interested in the relationship of weather to sentiment (positivity or negativity of posts) on Twitter. They want to scrape data from https://www.wunderground.com and join that to Tweets in that geographic area at a particular time. One complication is that Weather Underground limits the number of data points that can be downloaded for free using their API (application program interface). The researcher sets up six free accounts to allow them to collect the data they want in a shorter time-frame. What ethical guidelines are violated by this approach to data scraping?

Professional Ethics

View Chapter

Purchase Book

Published in Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton, Texts in Statistical Science, 2017

Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton

A researcher is interested in the relationship of weather to sentiment on Twitter. They want to scrape data fromwww.wunderground.com and join that to Tweets in that geographic area at a particular time. One complication is that Weather Underground limits the number of data points that can be downloaded for free using their API (application program interface). The researcher sets up six free accounts to allow them to collect the data they want in a shorter time-frame. What ethical guidelines are violated by this approach to data scraping?

Application programming interfaces and web data for social research

View Chapter

Purchase Book

Published in Uwe Engel, Anabel Quan-Haase, Sunny Xun Liu, Lars Lyberg, Handbook of Computational Social Science, Volume 2, 2021

Dominic Nyhuis

One way to understand APIs is to consider how collecting data using APIs is different from and similar to web scraping. Web scraping ordinarily means accessing an HTML document and extracting the relevant pieces of information from the document. Ignoring the details, we can think of web scraping as a two-step process. In a first step, we send a query to a server to request a particular resource, frequently an HTML document. In a second and often considerably more cumbersome step, we extract the relevant information from the HTML document.

The Bundling of Business Intelligence and Analytics

View Article

Journal Information

Published in Journal of Computer Information Systems, 2023

Kashif Saeed, Anna Sidorova, Akash Vasanthan

We collected 1000 jobs data from Indeed.com. Indeed.com is considered the number one job site in the world with over 250 million unique visitors every month and 175 million resumes.44 Approximately 10 jobs are added per second to Indeed.com globally.44 Indeed.com does not provide job data download features; however, the data can be scraped from Indeed.com. We used BeautifulSoup, a Python library, for data scraping. Two distinct searches in the entire job posting for 500 results each were run: one for the keyword “business intelligence,” and the other for “analytics” or “predictive analytics.” The scraped data included the job title, job description, location, and salary (if available). Scraping publicly available data is considered ethical and legal for research purposes.45

A Systematic Review of Data Analytics Job Requirements and Online-Courses

View Article

Journal Information

Published in Journal of Computer Information Systems, 2022

Mohamad Almgerbi, Andrea De Mauro, Adham Kahlawi, Valentina Poggioni

Web scraping provides tools for extracting and using relevant data and information from the Internet.44 Multiple studies adopted web scraping techniques to build a corpus of documents for subsequent topic modeling.6,16,37,38,40–42,45,46 To extract a large number of data points related to online courses and online job offers, we built a custom web scraper software using Python beautiful soup library able to extract titles and descriptions of every job position and course available on the selected websites. To extract only the data related to this study, we used six keywords which collectively cover the field of Big Data Analytics, namely: “big data”, “data science”, “business intelligence”, “data mining”, “machine learning” and “data analytics”. These keywords were used to extract the data for both job offers as well as online courses. Using the web scraper, we were able to extract 14,495 online job ads published in a six-month period from March 2019 to August 2019 and 3,636 online courses that were published in the same period. Using Python DataFrame.drop_duplicates package, we identified and deleted the duplicate job ads and courses, then we did the same with those which are not related to our study by creating an algorithm that finds one or more of the six keywords we used to extract the data in the title and the description of each job ad. Through these progressive cleaning steps of the data set summarized in Table 3, we finally obtained a dataset containing 9,067 online job ads and 764 online courses. To the best of our knowledge, this is the biggest dataset containing both job offers and courses descriptions. Other works proposed a similar dataset containing only data extracted from job posts6,16,37,38,40–42 but these datasets contain data from 3,000 job posts at most. The only one which collected a dataset also for courses is,40 but the dataset is very limited both in dimensions and geographical scope.