Explore chapters and articles related to this topic
A methodology to detect and extract tables from born-digital PDF documents using deep learning
Published in Sheela Evangeline, M.R. Rajkumar, Saritha G. Parambath, Recent Advances in Materials, Mechanics and Management, 2019
C. Shichin, A.C. Vinay Chandran, V.S. Unnikrishnan
Table Extraction is the task of detecting and decomposing table information in a document [2]. This task attracted the attention of researchers because tables are one of the most used elements to present data and they should be extracted for reuse. While humans can easily recognize and comprehend tables, computers cannot, because of the lack of common identifying characteristic in tables. A good number of research efforts have been made on table detection so far. However, the majority of researches on table detection concentrated on image-based documents. Although PDF format becomes increasingly important, there is much less prior works considering PDF documents [5]. Optical Character Recognition is a comprehensive method that can be used to extract information from a selected area in a document. Deep learning techniques could be applied to extract information from documents with increased accuracy.
Machine extraction of polymer data from tables using XML versions of scientific articles
Published in Science and Technology of Advanced Materials: Methods, 2021
Hiroyuki Oka, Atsushi Yoshizawa, Hiroyuki Shindo, Yuji Matsumoto, Masashi Ishii
Polymer data in scientific articles are described in text, figures, or tables. We first studied the extraction of polymer data from tables, because major components of polymer data are numerical values such as glass transition temperature and the values are frequently condensed in tables using numerical characters. In addition, tabular forms systematically manage the data, which is very convenient for creating algorithms for machine extraction. Although text and figures also bear physical values, processes such as relation extraction and image processing are not easy. However, we have been also researching these objects, because articles do not always include tables. In addition, it will be necessary to relate table data to information in text or figures to perform informatics. Among these objects, tables are the most convenient for starting to develop a data extraction system. However, extracting tabular forms in plain text from PDF files by machines is not easy, because the character positions in PDFs are lost in the process of conversion into plain text. Therefore, data extraction from tables by machines is not easy and the corresponding research has not been intensive. ChemDataExtractor [14], which is an automated chemical data extraction system described in literature, is the only system researched for this purpose thus far. However, this system does not cover complicated tables such as multi-column, multi-row, and merged tables, because such table extraction is considerably more difficult than that of simple tables. Complicated tables are frequently used in scientific articles, which also prevents the extraction of scientific data from tables by machines. One method of solving these problems is to use XML. Although PDF has been a typical electronic format for scientific articles, XML has recently become available. XML is convenient for information extraction by machines, because the XML tags systematically manage the content. By referencing the XML tags, the tabular forms of even the most complicated tables can be accurately extracted in plain text.