Explore chapters and articles related to this topic
Domain-Specific Journal Recommendation Using a Feed Forward Neural Network
Published in Himansu Das, Jitendra Kumar Rout, Suresh Chandra Moharana, Nilanjan Dey, Applied Intelligent Decision Making in Machine Learning, 2020
Meschenmoser et al. define web scraping as an automated technique to extract and retrieve targeted web data at range (Meschenmoser, Meuschke, Hotz, and Gipp 2016). A variety of tools and interfaces to build personalized scrapers, as well as customizable well equipped scraping frameworks, exist. Glez-Peña et al. and Haddaway et al. present ample summaries of frameworks and tools for various extraction tasks, namely DataToolBar, Helium Scraper, Screen Scraper, and FMiner (Glez-Peña, Lourenço, López Fernández, Reboiro-Jato, and Fdez-Riverola 2013; Haddaway 2015). But, there are very few scrapers for mining scientific records and bibliographic data. Smith-Unna et al. recommends a ContentMine framework that allows building personalized tools and other data mining elements (Smith-Unna and Murray-Rust 2014). Tang et al. propose an Aminer framework that collects and integrates heterogeneous social network data from many web data sources for researchers (Tang, Zhang, Yao, Li, Zhang, and Su 2008). But, the framework provides no provision for personalized content mining.
DW-PathSim: a distributed computing model for topic-driven weighted meta-path-based similarity measure in a large-scale content-based heterogeneous information network*
Published in Journal of Information and Telecommunication, 2019
In this paper, we use the real-world DBLP bibliographic network1 as the main dataset for all experiments. The DBLP bibliographic network contains over 2M authors, 4.1M papers and over 5.3K venues/journals. For the text content of published papers in DBLP, we use the abstract text dataset, which is provided by Aminer.2 The Aminer corpus contains over 1.5M abstract contents of published DBLP papers. From the 1.5M abstract text document of Aminer, we randomly selected 100K papers as the main corpus for training the LDA topic model. By using LDA, we extract 10 latent topics and 20 keywords for each topic from that text corpus. For evaluating the accuracy of outputs, we used the nDCG (normalized Discounted Cumulative Gain) (Järvelin & Kekäläinen, 2002) metric to compute the accuracy rate. Each output result is intuitively evaluated and the specific relevancy level score is assigned from 1 to 3 (higher the better), where 0 is the non-relevant object to the target object in the user's query, 1 is quite a relevant object to target one in the query. 2 is closely relevant object to target one in the query and score 3 is very/highly relevant object to target object one in the query.