A treebank can be defined as a syntactically processed corpus. It is a language resource with linguistic information annotated at, variously, the word, phrase, clause, and sentence levels, to form a bank of linguistic trees. There are many treebanks built for different languages such as, for English, the Penn Treebank [Marcus 1993], and that of the ICE-GB [Wallis 2003] and, for Chinese the Penn Chinese Treebank [Xia et al. 2000; Xue et al. 2002] and the Sinica Treebank [Chen et al. 1999; Chen et al. 2003].
Most reported Chinese Treebank, including the Penn Chinese Treebank and the Sinica Treebank, are based on full parsing in which complete syntactical analysis is performed including determining syntactic categories of words, locating chunks that can be nested, finding relations between phrases, and resolving attachment ambiguities. Thus, the output of full parsing is a set of complete syntactic trees. Due to the complexity of natural languages, automatic full parsing is still quite challenging. An alternative to automatic full parsing is to adopt a divide-and-conquer strategy, i.e., to divide full parsing into several independent sub-tasks which can be applied relatively easily. One of these sub-tasks is shallow (or partial) parsing. The purpose of shallow parsing is to identify the local syntactical structures that are relatively simple and can be easily identified while ignoring the complicated analysis of how these phrases syntactically construct a sentence. Thus, its output only identifies local structures in sentences. These local structures form the sub-trees of a full syntactic tree. Because it does not necessitate complex and ambiguous attachment analysis, shallow parsing can recognize some local structures at a much lower cost with a much higher accuracy. For these reasons, shallow parsing has in recent years gained more research attention, and people have started to apply shallow parsing in many NLP applications. However, the lack of a large-scale Chinese Shallow Treebank as training and testing data is currently an impediment to research in this area. This has motivated us to construct a Chinese shallow Treebank for Chinese natural language processing applications. This treebank, referred as the PolyU Treebank, is named after the University where it is developed.
One problem with shallow parsing, is that, unlike full parsing, it seeks to identify in a sentence only certain local structures, yet there is at present no widely-accepted common standard for the scope and depth of local structures and different reported works all vary in annotation of what local structures are [Dalemans 1999; Sun 2001; Li et al. 2003]. In this work, then, we first discuss the objectives of shallow parsing based on our need and the need of some typical NLP research to define the scope of shallow parsing. In accordance with this defined scope, we then construct the PolyU Treebank by manually annotating the shallow syntactic structures in a selected corpus.
Obviously, the scope and the depth of shallow annotation should be determined by the requirements of the applications using the Treebank. With consideration of typical NLP research such as Chinese collocation extraction, terminology extraction, and the acquisition of descriptions of terminologies conducted at the author’s research institution, we restrict the shallow syntactic structures to at most the maximal phrases that play the role of subject, predicate, complement clause and other syntactic components in a sentence as our scope of examination. Within the scope of examination, our focus is to identify the base-phrases, minimum syntactic unit in a maximal phrase. We also identify those nested phrases between base-phrases and maximal phrases which we call mid-phrases. Each identified phrase will be given a mandatory syntactic label and an optional semantic label. Its header will also be identified. An important feature of our Treebank is that the identified phrases are augmented with semantic information. This kind of information is useful in many areas of NLP research but is difficult to identify automatically and some of them are not annotated in the other existing Treebanks.
To guide the syntactic annotation, we choose to use the Phrase-Standard Grammar (PSG) as proposed by Peking University [Yu et al. 1998]. There are two reasons for this choice. First, the PSG grammar framework is widely accepted in mainland China. Second, in order to reduce the cost of annotation and to ensure the maximum sharing of our output, we perform the shallow syntactic annotation on the popular segmented and tagged People’s Daily corpus, which is guided by PSG [Yu et al. 2001].
The construction of the Treebank, which has taken more than 15 months, includes the guideline design, the annotation specifications, the annotation and quality assurance checking. The annotated one-million-word shallow Treebank has accuracy of over 98.8% on phrase bracketing and over 98% on phrase labeling. Such a large-scale Treebank can be used to support a variety of NLP research. To date, it has been used to train and to test a shallow parser that is currently under development [Lu et al. 2003]. The performance of our systems for the extraction of Chinese collocation and terminologies and their summarization, as well as our information retrieval work has also benefited from the PolyU Treebank. We are optimizing the Treebank and making it available to other researchers as a public resource.