Shallow parsing – Knowledge and References

Explore chapters and articles related to this topic

Natural Language Processing

Published in Subasish Das, Artificial Intelligence in Highway Safety, 2023

The goal of syntactic parsing is to find out whether an input sentence is in a given language or to assign the structure to the input text. In order to assign the structure, the grammar of a language is needed. Since it is generally not possible to define rules that would create a parse for any sentence, statistical or machine learning parsers are very important. Complete parsing is a very complicated problem because ambiguities often exist. In many situations, it is enough to identify only unambiguous parts of texts. These parts are known as chunks, and they are found using a chunker or shallow parser. Shallow parsing (chunking) is thus a process of finding non-overlapping groups of words in the text that have a clear structure. Figure 50 illustrates the steps of NLP analysis, and Figure 51 shows examples of stemming and lemmatization.

Natural Language Processing Associated with Expert Systems

View Chapter

Purchase Book

Published in Jay Liebowitz, The Handbook of Applied Expert Systems, 2019

Gian Piero Zarri

Faced with this situation, and confronted with the exigency of having at its disposal “robust” tools to meet the increasing need for NLP applications, the NLP community is making use more and more of “shallow parsing” techniques; we have already encountered, in subsection 3.S.1, a first form of shallow parsing, consisting of the use of an underspecified form of representation for the output of the syntactic analyzers. This form of parsing is normally used when the application requires the analysis of large text corpora: in this case, in fact, because of the problems outlined in the previous paragraph, the use of the traditional, “deep” syntactic parsers can be both arduous and cumbersome. In general, shallow parsing is characterized by the fact that its output is not the usual phrase-structure tree, but a (much) less detailed form of syntactic analysis where, normally, only some phrasal constituents are recognized. These can consist, e.g., of noun phrases — without, however, the indication of their internal structure and of their function in the sentence — or of the main verb accompanied by its direct arguments. Inspired by the success of stochastic methods like HMM in the field of speech understanding, the builders of modern shallow parsers often make use of probabilistic methods that are tested and refined, e.g., on the basis of reference corpora composed of sets of manually bracketed sentences.

Natural Language Processing for Information Retrieval

View Chapter

Purchase Book

Published in Anuradha D. Thakare, Shilpa Laddha, Ambika Pawar, Hybrid Intelligent Systems for Information Retrieval, 2023

Anuradha D. Thakare, Shilpa Laddha, Ambika Pawar

Shallow parsing is also called chunking; it is an NLP technique used to analyze the structure of a sentence to divide it into its tokens/words and further group them. Let us consider the conll2000 corpus from the NLTK library to train the chunking model. The annotated sentence is shown in the following.

A reasoning model for geo-referencing named and unnamed spatial entities in natural language place descriptions

View Article

Journal Information

Published in Spatial Cognition & Computation, 2022

Madiha Yousaf, Diedrich Wolter

This paper presents an approach for automated interpretation of natural language place descriptions using a reasoning method that can supplement the information obtained from natural language processing. This paper’s focus is to analyze the design and contribution of spatial-ontological reasoning for geo-referencing places from natural language input. We apply information extraction to obtain relational expressions from the input text, aiming for an over-generalization that avoids wrong parser commitments. A reasoning stage follows, which starts with an abductive contextualization step that generates hypothetical is-a relations and unification, using information available in the input context. A deductive inference stage then propagates relational expressions and rejects inconsistent interpretations. Whenever a constrained query can be generated that is expected to lead to few matches in the OSM database, the database is consulted, possibly enabling further references. Finally, the interpretation establishing the highest number of geo-references is chosen. Our system is implemented in a constraint-based programming approach that does not require a specific order of steps to be followed as graph-based approaches such as (Vasardani et al., 2018). We present an evaluation of our implemented system that reveals the effectiveness of reasoning, in particular the ability to resolve and interpret unnamed entities. Our reasoning-empowered approach that only employs the freely available Stanford Core NLP part-of-speech tagger can outperform existing state-of-the-art tagging systems as it can exploit more context from the input. Our approach does not require a correct parsing of the input sentence but can derive information from just shallow parsing. Despite the variety of open problems faced in understanding spatial language, the proposed method is already able to interpret composite phrases like post office near the train station in Bamberg correctly by geo-referencing post office, train station, and the city of Bamberg; we regard this to constitute a potentially useful advancement.

Extraction and linking of motivation, specification and structure of inventions for early design use

View Article

Journal Information

Published in Journal of Engineering Design, 2023

Pingfei Jiang, Mark Atherton, Salvatore Sorce

Then a Part-Of-Speech (POS) tagger is built using scikit-learn in Python. The Penn Treebank Corpus from NLTK is used to train the POS tagger. 80% of the dataset is used as training sentences and 20% is used for testing. Features of the tokenised word including the previous word, the next word, 1 to 3-letter prefix and suffix of the word are taken into consideration during training. DecisionTreeClassifier from scikit-learn is used with 20k samples. The trained POS tagger against the testing dataset achieved an accuracy of 90.8%. This tagger is then used to carry out noun phrase lemmatisation, to consolidate noun phrases with similar expressions such as ‘battery’ and ‘batteries’. This is accomplished by converting POS tags to WordNet tags and then applying the WordNet Lemmatiser. The reason for lemmatising noun phrases only while keeping other phrases unchanged is to maintain the accuracy of parsing performed later. A classifier-based chunker is then built using the CoNLL 2000 corpus. 85% of the dataset is used for training and 15% used for testing. The trained chunker achieved a 93.1% accuracy in identifying IOB (Inside, Outside, Beginning) tags and an F-measure of 89.2%. The trained chunker works well in identifying nouns and verbs with labels ‘NP’ and ‘VP’ respectively, however, Structures that are expressed in a more complex form might be missed as well as Specification that includes prepositions, e.g. ‘to define’ and ‘created by’. As a result, an additional step of shallow parsing using regular expression is performed to capture the more intricate form of Structure and Specification. For instance, ‘DP’, referred to as design parameter, can exist in the form of <NP><PP><NP>< NP>*, “VERB”, referred to as verb, can exist in the form of <PP>?<VP>+<RB>*<PP>*. This results in another tree consisting of larger chunks. Figure 2 shows an example comparison of chunks using only the classifier-based chunker (top) and with an additional step of shallow parsing (bottom) for a bladeless fan patent independent claim US9249810B2. ‘the air flow’, ‘from’ and ‘the base’ is identified as one design parameter ‘the air flow from the base’, ‘for’ and ‘receiving’ is identified as one verb ‘for receiving’.