Focused crawler

Focused crawler

A focused crawler is a type of web crawler that is designed to search for and analyze specific web pages that are relevant to a particular topic or subject. It continuously monitors selected web pages for any changes and uses a special selection policy to restrict its analysis to websites that are relevant to the topic. The two main components of a focused crawler are the full page text and the link structure of the web page.From: Intelligent Technologies for Web Applications [2019], Building Sensor Networks [2017], An ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithm [2022]

Real-Time Search in the Sensor Internet

View Chapter

Purchase Book

Published in Ioanis Nikolaidis, Krzysztof Iniewski, Building Sensor Networks, 2017

Richard Mietz, Kay Römer

Multimedia search engines use crawling techniques similar to those used by general search engines, but instead of concentrating on text content, they focus more on the embedded multimedia files. However, the selection policy of image crawlers controls which extracted images to keep for the following indexing process. Images that are too small or completely transparent are often discarded because they are usually used for design purposes of websites. Topical search engines use focused crawlers with a special selection policy that restricts the analysis to websites relevant for a specific topic. Some follow only URLs where the link description tag contains topic-matching keywords or where the website itself is about the topic, while others download all websites, analyze them, and discard irrelevant websites.

An ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithm

View Article

Journal Information

Published in International Journal of Computers and Applications, 2022

P. R. Joe Dhanith, B. Surendiran

The web crawler is a software bot that navigates the web-based on BFS (Breadth_First_Search) algorithm until all pages were obtained or no empty storage space was available by following a directed graph. Due to the dynamic growth of the internet, it is not possible to retrieve all the web pages. To overcome this trouble focused web crawler [1] was proposed for seeking, acquiring, indexing, and maintaining websites on a particular set of subjects representing a comparatively small section of the Web. Compared to the general purpose crawlers on the internet, focused crawlers decrease huge time and space resources and better satisfy user needs. Full page text and the link structure of the web page are the two main components of the focused crawlers.

Explore chapters and articles related to this topic

Real-Time Search in the Sensor Internet

An ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithm