Data scraping – Knowledge and References

Explore chapters and articles related to this topic

Tidy data and iteration

Published in Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton, Texts in Statistical Science, 2017

Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton

The procedure for reading data in one of these formats varies depending on the format. For Excel or Google spreadsheet data, it is sometimes easiest to use the application software to export the data as a CSV file. There are also R packages for reading directly from either (readxl and googlesheets, respectively), which are useful if the spreadsheet is being updated frequently. For the technical software package formats, the foreignR package provides useful reading and writing functions. For relational databases, even if they are on a remote server, there are several useful R packages that allow you to connect to these databases directly, most notably dplyr and DBI. CSV and HTML <table> formats are frequently encountered sources for data scraping. The next subsections give a bit more detail about how to read them into R.

1

View Chapter

Purchase Book

Published in Uwe Engel, Anabel Quan-Haase, Sunny Xun Liu, Lars Lyberg, Handbook of Computational Social Science, Volume 2, 2021

Stefan Bosse, Lena Dahlhaus, Uwe Engel

The use of internet platforms may require an adherence to their terms of service, which in this case may be a requisite for visiting the website. If such terms of service become binding between the website operator and a scraper (e.g., since the scraper has visited the website or logged into it and expressed consent to terms of service in a legally binding way) and if they forbid data scraping, the researcher is well advised to seek explicit permission for data scraping. Alternatively, if a digital platform grants access via the use of APIs, this might be a reasonable alternative to scraping the web content.

The Bundling of Business Intelligence and Analytics

View Article

Journal Information

Published in Journal of Computer Information Systems, 2023

Kashif Saeed, Anna Sidorova, Akash Vasanthan

We collected 1000 jobs data from Indeed.com. Indeed.com is considered the number one job site in the world with over 250 million unique visitors every month and 175 million resumes.44 Approximately 10 jobs are added per second to Indeed.com globally.44 Indeed.com does not provide job data download features; however, the data can be scraped from Indeed.com. We used BeautifulSoup, a Python library, for data scraping. Two distinct searches in the entire job posting for 500 results each were run: one for the keyword “business intelligence,” and the other for “analytics” or “predictive analytics.” The scraped data included the job title, job description, location, and salary (if available). Scraping publicly available data is considered ethical and legal for research purposes.45

GreedyBigVis – A greedy approach for preparing large datasets to multidimensional visualization

View Article

Journal Information

Published in International Journal of Computers and Applications, 2022

Moustafa Sadek Kahil, Abdelkrim Bouramoul, Makhlouf Derdour

Table 4 shows the comparison between the existing works and the proposed approach according to the listed factors. The abbreviations used are listed in the points below. Techniques (Techs): BA: Binned Aggregation, M: Map, BN: Bayesian Network, HM: Hitmaps, BSP: Binned Scatter Plots, PH: Pivot hierarchy, LH: Linked histograms, PC: Parallel Coordinates, DA: Data aggregation, MR: MapReduce, OWL: Ontology (OWL), S: SPARQL, 3M: 3D map, SL: Skyline, SF: Suitability Function, LR: Learning to Rank (a Machine Learning model), DT: Decision Tree, BT: Bitset tree, WAH: Word-Aligned Hybrid, C: Charts, TD: Tree Diagrams, GD: Graph Diagrams, DS: Data Scraping, WC: WordCloud, P: Probabilities, GA: Greedy Algorithm, MDA: Model-driven Architecture, ST: Set Theory, CC: Canopy Clustering, RR: Regression Residuals.Data types (DataT): T: Tx, Im: Image, Vd: Video, Sp: Spatial, Cat: Categorical, Tmp: Temporal, Nm: Numerical.

The ASHRAE Great Energy Predictor III competition: Overview and results

View Article

Journal Information

Published in Science and Technology for the Built Environment, 2020

Clayton Miller, Pandarasamy Arjunan, Anjukan Kathirgamanathan, Chun Fu, Jonathan Roth, June Young Park, Chris Balbach, Krishnan Gowri, Zoltan Nagy, Anthony D. Fontanini, Jeff Haberl

We also examined the general categories of the analysis notebooks by manually assigning tags to each one. The largest group was related to the preprocessing of data, with almost half of the notebooks dedicated to this topic. Preprocessing emerged as a critical aspect of the performance of the winning models. Prediction models themselves accounted for a third of the notebooks, with the rest split between exploratory data analysis (EDA) (16.1%) and data scraping (3.86%). Figure 13 analyzes the self-assigned tags that the users gave their notebooks upon posting them. These labels include descriptors for aspects, such as whether the notebooks are suitable for beginners or as starter code. This tag descriptor breakdown includes technical descriptors such as deep learning, ensembling, and feature engineering.