中文
II. Design Principles:
Due to the fact that currently there is no large-scale shallow-annotated Chinese Treebank available, the design of the PolyU Treebank references two important works on fully-annotated Chinese Treebank: the Penn Chinese Treebank and the Sinica Treebank. The annotation of the Penn Chinese Treebank follows the Government and Bind framework with more than 500,000 Chinese words mainly manually annotated according to a strict quality assurance process [Xue et al. 2002]. The Sinica Treebank was developed by the Academic Sinica, Taiwan. Phrase bracketing and annotation was carried out using a head-driven chart parser guided by Information-based Case Grammar (ICG) and was followed with manual post-editing. The Sinica Treebank contains 39,000 parsed trees and 329,000 words [Chen et al. 1999; Chen et al. 2003]. A natural way to obtain a shallow Treebank is through extracting shallow structures from a fully annotated treebank. Unfortunately, the Penn Treebank and the Sinica Treebank are annotated using different grammar frameworks as well as different word segmentation/POS tagging strategies, making them unsuitable for our annotation.
To ensure the quality of the PolyU Treebank and its broad acceptance, its design and construction follows four basic principles:
Principle 1: High resource-sharing capability
The PolyU Treebank has been designed to provide a general purpose Treebank for use by as wide a range of different applications as possible. This calls for the selection of an effective and well-accepted grammatical framework for representing syntactical information as well as for a well-accepted word segmentation/POS tagging scheme.
We have chosen to use the Phrase-Standard Grammar (PSG), proposed by Peking University. PSG is widely accepted by Chinese NLP researchers. In the PSG framework, phrases rather than words are treated as the basic Chinese syntactical unit. The rationale is that while an individual word is flexible and may have different part-of-speech (POS) tags representing its different functions in sentences, a phrase is made up of a number of words normally driven by a headword and consequently has a stable internal structure and order. Using this framework, syntactical analysis should be performed in a cascaded fashion, and a linear character string may finally be syntactically analyzed to form a cascaded tree.
In the absence of orthographic device for delimiting words in Chinese, it is necessary to segment words before POS tagging can be conducted. We chose the segmented and tagged corpus of a People’s Daily, annotated by Peking University. This corpus was accurately segmented and tagged in accordance with the PSG framework, contains articles in People’s Daily 1998. The claimed accuracy of word segmentation and POS tagging are 99.9% and 99.5%, respectively [Yu et al. 2001]. Using this popularly accurate resource has significantly reduced the cost of annotation and ensures the maximum sharing of our output.
Principle 2: Low structural complexity
The second design principle is that the PolyU Treebank should not be structurally too complex; its annotation framework should be clear and simple and its syntactic and functional information should be labeled according to commonly used and widely accepted standards.
To ensure that our shallow annotation satisfied the requirements for syntactical information of typical language applications, we chose to focus on the annotation of phrases and the identification of headwords while ignoring sentence-level syntax. More specifically, we wanted to identify three types of information: (1) base-phrases, that is, those non-nesting phrases with at least one headword; (2) maximal phrases, that is, those phrases that mark the boundary of our scope of examination inclosing the base-phrases and headwords and thus should not go beyond subject, predicate, complement clause, embedded clause and other syntactic components of sentences; and (3) mid-phrases, which are the intermediate nesting phrases between the base-phrases and the maximal phrases if they indeed exist. As for the mid-phrases, there is a limit to the level of nesting since we do not intend to provide full parsing information. In order to limit structural complexity, we limit nesting brackets to only three levels. In other words, there can only be at most one level of mid-phrases.
Principle 3: Sufficient and effective syntactic information
The third design principle is to provide syntactic information at a low level of complexity that is sufficient for and effective in a wide variety of NLP applications. Earlier work in Chinese shallow annotation has been annotated only for non-nesting base-phrases [Sun 2001]. However, base-phrase annotation alone is not adequate for many applications. Our annotation permits three levels of nesting and this has a number of advantages. First, the maximal phrases indicate the essential syntactic elements of a sentence such as subject and predicate and the availability of this information makes it is possible in many applications to refine the search context window. Secondly, the base-phrases are the simplest and most stable structure in the sentence, and are thus regarded as the smallest syntactic units. Lastly, the nested mid-phrases can help to describe distant modifier relations within the scope of a maximal phrase, thus useful for certain applications.
The PolyU Treebank provides not only adequate syntactical information but also some semantic information. To achieve this, each phrase is given not only a syntactic label but sometimes also a label providing semantic information. For example, “国家航空和宇宙航行局”(NASA) is a noun phrase and is assigned the label NP. Furthermore, in terms of semantics, it is a noun phrase to indicate the name of an organization and so it is given the appropriate additional label, NT. The fact that the PolyU Treebank is a ‘Not-So-Shallow’ Treebank makes it substantially different from and more useful than other base-phrase only shallow Treebanks. The information it provides will assist language applications remove ambiguities. Finally, we should point out that in our Treebank the headword of a base-phrase is also annotated.
Principle 4: Large quantities of data annotated with great accuracy
The size of existing Chinese Treebanks range from 100,000 to 500,000 words. It is an acceptable size for full parsing [Leech and Garside 1996] but not enough for lexical-level analysis. Given resource constraints and with reference to work on the English language, it is our goal to create a Treebank of one million words. A Treebank of this size would support the design and training of a shallow parser and would be directly used in collocation extraction and named entity identification currently conducted in this research group.
A well-developed Treebank must be very accurately annotated. With the goal of reducing annotation errors, we have designed clear and simple annotation guidelines. To avoid the inaccuracies arising from automatic parsing, our annotation is conducted manually, while post-annotation error and consistency checking are supported with tools developed ourselves. Finally, to reduce human error, some texts are double- and triple-annotated and then compared. This allows certain mistakes to be easily identified and removed.
|