Treebanks – Knowledge and References

Explore chapters and articles related to this topic

Text Analysis

Published in Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, Julia Lane, Big Data and Social Science, 2020

Text corpora—sets of multiple similar documents, each called a corpus—can be very helpful. For example, the Brown University Standard Corpus of Present-Day American English, or simply the Brown Corpus (Francis and Kucera, 1979), is a collection of processed documents from works published in the United States in 1961. The Brown Corpus represents a historical milestone: it was a machine-readable collection of 1 million words across 15 balanced genres with each word tagged with its part of speech (e.g., noun, verb, preposition). The British National Corpus (University of Oxford, 2006) has repeated that process for British English at a larger scale. The Penn Treebank (Marcus et al., 1993) provides additional information: in addition to part-of-speech annotation, it provides syntactic annotation. For example, what is the object of the sentence, “The man bought the hat?” These standard corpora serve as training data to train the classifiers and machine learning techniques to automatically analyze text (Halevy et al. 2009).

Natural Language Understanding

View Chapter

Purchase Book

Published in Richard E. Neapolitan, Xia Jiang, Artificial Intelligence, 2018

Richard E. Neapolitan, Xia Jiang

The most straightforward way to learn the probabilities for a PCFG is to learn them from a treebank, which is a collection of correct parse trees. For example, the well-known Penn treebank [Marcus et al., 1993] contains 3, 000, 000 words along with their parts of speech, and parse trees containing the words. It was developed using efforts by experts along with automation.

Exploring zero-shot and joint training cross-lingual strategies for aspect-based sentiment analysis based on contextualized multilingual language models

View Article

Journal Information

Published in Journal of Information and Telecommunication, 2023

Dang Van Thin, Hung Quoc Ngo, Duong Ngoc Hao, Ngan Luu-Thuy Nguyen

The joint training scenario is an idea of training one model on multiple languages because many languages share common features such as morphological, phonological, and syntactic phenomena (Ammar et al., 2016; Bender, 2011; Mulcaire et al., 2018). As a result, training in multiple languages can improve the performance of models in related languages. Ammar et al. (2016) found that the training model on multilingual treebanks of multiple languages outperformed the monolingual training data for parsing tasks. However, the authors employed the traditional deep learning model (LSTM) combined with static multilingual word embedding instead of contextual word representation. Mulcaire et al. (2018) also applied this idea by combining training data across languages for semantic role-labelling tasks. The experimental results showed that joint learning could achieve better performance than monolingual data. Aharoni et al. (2019) presented extensive experiments in multilingual neural machine translation by training multilingual languages in a single model. This demonstrated that multilingual joint learning has been shown to be beneficial in various NLP tasks. The authors employed the XLM-R language model as the baselines. Recently, the development of multilingual pre-trained language models brings a lot of benefits to low-resource languages. With plenty of pre-trained language models and languages, how to choose them to improve the performance of a specific language is an interesting problem.

MS-TR: A Morphologically enriched sentiment Treebank and recursive deep models for compositional semantics in Turkish

View Article

Journal Information

Published in Cogent Engineering, 2021

Sultan Zeybek, Ebubekir Koç, Aydın Seçer

In this work, we seek to address three main issues for Turkish SA; (i) to investigate the effectiveness of the recursive compositional models for Turkish sentiment analysis, (ii) to construct a Turkish Sentiment Treebank (a hierarchical representation of the sentences, i.e. fully labelled parse trees) to capture the semantic compositionality in a given sentence, and (iii) to contribute to the lack of sentiment analysis resources that also can be used for the other recursive deep models for future studies.