PDFBox – Knowledge and References

Explore chapters and articles related to this topic

A methodology to detect and extract tables from born-digital PDF documents using deep learning

Published in Sheela Evangeline, M.R. Rajkumar, Saritha G. Parambath, Recent Advances in Materials, Mechanics and Management, 2019

C. Shichin, A.C. Vinay Chandran, V.S. Unnikrishnan

For detecting tables with only horizontal lines and tables with no line separation, text information extracted from the original PDF documents are used. From PDF documents information about each character is extracted using Apache PDFBox library. The extracted information contains the X, Y coordinates and space, width and height of each character in the PDF. Tables usually have texts arranged in cells. After the text in each cell there will be a gap along the X coordinate till the first character in the next column. Similarly, after each row there will be a change along Y coordinate as well. The ‘jumps’ in coordinates can be detected and is used to distinguish cells. A sliding window which scans the image of document is employed and this window is guided by the coordinates extracted from the PDF. The ‘jumps’ are detected by the sliding window and lines separating rows and columns are drawn in the image accordingly to create a table structure.

Parsing AUC Result-Figures in Machine Learning Specific Scholarly Documents for Semantically-enriched Summarization

View Article

Journal Information

Published in Applied Artificial Intelligence, 2022

Iqra Safder, Hafsa Batool, Raheem Sarwar, Farooq Zaman, Naif Radi Aljohani, Raheel Nawaz, Mohamed Gaber, Saeed-Ul Hassan

A full-text document is composed of different sections such as Abstract, Introduction, Literature review, Methodology, Experiments and Results, Conclusion and References. These sections are organized hierarchically. The possibility that the result-figures will occur in the Experiments and Results section is high. Therefore, we need to segment this section and ignore the rest of the paper. Therefore, we extracted the plain text from the PDF document by using the PDFBox library3 moreover, performed a documents segmentation mechanism to divide a document into its standard sections.

Machine extraction of polymer data from tables using XML versions of scientific articles

View Article

Journal Information

Published in Science and Technology of Advanced Materials: Methods, 2021

Hiroyuki Oka, Atsushi Yoshizawa, Hiroyuki Shindo, Yuji Matsumoto, Masashi Ishii

In this study, the training data for DNN learning of polymer names were prepared using a rule-based algorithm. The procedure is as follows. The algorithms for IUPAC names and their abbreviations were created based on their character patterns, i.e. IUPAC names usually begin with ‘poly’ while their abbreviations frequently include ‘P’. Because parentheses and curly and square brackets frequently follow immediately after ‘poly’ in IUPAC names, this regularity was used. Besides the two types of names, algorithms were created for the sample labels described in parentheses or square brackets following immediately after polymer full names (usually IUPAC, common, or trade names), and for copolymers and blends, which are named by joining homopolymer names by slashes or hyphens. This was done using their regularities. Additionally, an algorithm for analyzing ‘copolymer of A and B’ and ‘A and B copolymer’ was also created. The typical polymer names listed in polymer books, such as ‘SBR’ (styrene butadiene rubber) and ‘cellulose,’ and frequently used polymer names were registered and identified by string matching. Stop words were also registered to avoid incorrect annotation of words including ‘poly’ and ‘P’ other than polymer names. Because polymer names occasionally include modifiers, such as ‘isotactic’ and ‘doped,’ an algorithm for identifying such modifiers was also created by registering them or using regular expressions. Using the rule-based algorithm, polymer names in 737 polymer articles published between 2000–2008 were annotated. The rule-based algorithm was used for tokenized text prepared using PDFBox [17] for plain text conversion and Stanford Core NLP [18] for tokenization from the PDF files of polymer articles. Sequence labeling was applied to the annotation output using B, I, E, S, and O. S stands for a single-token polymer name; B, I, and E represent the beginning, intermediate, and end tokens in a polymer name; and O stands for a token unrelated to the polymer name. The output files consisted of two tab-separated columns for tokens and labels with a blank line after a period for sentence splitting.