Access Restriction

Author Yarowsky, David ♦ Ngai, Grace ♦ Wicentowski, Richard
Source CiteSeerX
Content type Text
File Format PDF
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Simple Direct Annotation Projection ♦ Second Language ♦ Robust Projection ♦ Noise-robust Tagger ♦ Text Analysis ♦ Direct Annotation Projection ♦ French Achieves ♦ Language-specific Knowledge ♦ Bilingual Text Corpus ♦ Stand-alone Monolingual Part-of-speech Tagger ♦ Core Part-of-speech ♦ Induced Stand-alone Part-of-speech Tagger ♦ Named-entity Tagger ♦ Multi-lingual Text Analysis Tool ♦ Part-of-speech Tagging ♦ Case Study ♦ Complete French Ver-bal System ♦ Lemma-tizer Training Procedure ♦ Noun Phrase Brac ♦ Morphological Analyzer ♦ Aligned Corpus ♦ Word Alignment ♦ Noun-phrase Bracketer ♦ Arbitrary Foreign Language ♦ Optimal Alignment ♦ Text Analysis Tool ♦ Induced Morphological Analyzer Achie-ves ♦ Incomplete Initial Projection ♦ Lemmatization Accuracy ♦ Raw Text ♦ Hand-annotated Training Data ♦ Base Noun-phrase Bracketers ♦ Accurate System
Description This paper describes a system and set of algorithms for automati-cally inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish. Existing text analysis tools for English are applied to bilingual text corpora and their output projected onto the second language via statistically derived word alignments. Simple direct annotation projection is quite noisy, however, even with optimal alignments. Thus this paper presents noise-robust tagger, bracketer and lemma-tizer training procedures capable of accurate system bootstrapping from noisy and incomplete initial projections. Performance of the induced stand-alone part-of-speech tagger applied to French achieves 96 % core part-of-speech (POS) tag ac-curacy, and the corresponding induced noun-phrase bracketer ex-ceeds 91 % F-measure. The induced morphological analyzer achie-ves over 99 % lemmatization accuracy on the complete French ver-bal system. This achievement is particularly noteworthy in that it required absolutely no hand-annotated training data in the given language, and virtually no language-specific knowledge or resources beyond raw text. Performance also significantly exceeds that obtained by direct annotation projection. Keywords multilingual, text analysis, part-of-speech tagging, noun phrase brac-
Educational Role Student ♦ Teacher
Age Range above 22 year
Educational Use Research
Education Level UG and PG ♦ Career/Technical Study
Learning Resource Type Article
Publisher Date 2001-01-01
Publisher Institution In HLT '01: Proceedings of the first international conference on Human language technology research. Association for Computational Linguistics