Chinese Computing Lab
 Site Map 
About CCL
Site News
Projects
PolyU TreeBank
Chunk Bank
Collocation Extraction
ASAB
CERG
Hong Kong Character Glyphs
Jyutping
Dash Line
Publications
Download Area
Contact Information
Useful Links


Warning: A non-numeric value encountered in /webhome/cclab/public_html/menu.php on line 69

PolyU Treebank

 

中文

 

 

 



IV. Implementation of the PolyU Treebank

4-1 Corpus data preparation

The People’s Daily corpus, developed by Peking Univ., consists of more than 13,000 articles totaling five million words. Since only one million words are required in the PolyU Treebank, we have carried out a data selection process. To avoid the duplication of short-lived events and topics, our selection treats each day’s news as a single unit, and we picked six random days in each month from among the six month data in the entire collection as the raw Treebank data.

4-2 Word Segmentation and Part-of-Speech Tagging

The word segmentation and POS tagging of People’s Daily corpus is guided by the PSG grammar and “The Grammatical Knowledge-base of Contemporary Chinese” [Yu et al. 1998]. The specification lists a total of 43 POS tags. Peking Univ. claimed the accuracy of word segmentation and POS tagging is higher than 99.9% and 99.5%, respectively [Yu et al. 2001].

In this project, we directly use the PKU POS tagging results and made only some notational changes. This modification is to ensure consistent labeling in our system, where the lower cases are used to indicate word-level tags and upper cases are used to indicate phrase-level labels.

4-3 Phrase Bracketing and Annotation

Identification of Maximal-phrase:

A maximal phrase contains at least one base-phrase and has a syntactic role in the sentence. In the following example sentence,
   中国/ns 旅游年/n 是/v 一/m 次/q 国家级/b 的/u 宣传/vn 促销/vn 活动/vn(e.g.1)
   China Tourism Year is a national-level promotion and marketing activity

We find that this sentence has a S-V-O structure. 中国/ns 旅游年/n is the Subject, 是/v is the predicate, and 一/m 次/q 国家级/b 的/u 宣传/vn 促销/vn 活动/vn is the object. Clearly there are three syntactic components in this sentence, and thus two separate maximal-phrases, [中国/ns 旅游年/n]NP (ChinaTourism Year), and [一/m 次/q 国家级/b 的/u 宣传/vn 促销/vn 活动/vn]NP (a national-level promotion and marketing activity) are annotated. Note that 是/v is also considered a maximal phrase because it acts as a predicate. However, since it has only one lexical word and is structurally unambiguous, thus by default it is not bracketed. Admittedly, 是/v and 一/m 次/q 国家级/b 的/u 宣传/vn 促销/vn 活动/vn can be constructed as a VP, but we regard this kind of bracketing is more relevant to indicate how phrases may be used to construct a sentence. That is to say, this kind of bracketing would take us into the realm of full parsing, which is not our objective here. Thus, we chose to bracket them as separate phrases. As a result, the maximal phrase annotation result is:
   [中国/ns 旅游年/n]NP 是/v [一/m 次/q 国家级/b 的/u 宣传/vn 促销/vn 活动/vn]NP-PZ

Let us see another example,
   富裕/v 起来/v 的/u 当地/a 农民/n 自发/d 地/u 组织/v 了/u 多个/a 业余/a 乐团/n (Eg. 2)
   (The rich farmers took the initiative to organize several amateur bands)

We may separate this sentence into three components, those are 富裕/v 起来/v 的/u 当地/a 农民/n is the subject, 自发/d 地/u 组织/v 了/u acts as the predicate, and 多个/a 业余/a 乐团/n is the object. Thus, this sentence is annotated with three maximal-phrases, bracketed and labeled as [富裕/v 起来/v 的/u 当地/a 农民/n#]NP [自发/d 地/u 组织/v# 了/u]VP-ZZ [多个/a 业余/a 乐团/n]NP-PZ

Most syntactical labels can be used in maximal-phrases, except AP (adjective phrase), DP (adverb phrase) and QP (quantifier phrase). Meanwhile, NP-NT, NT-NS, NP-NZ may only be used to label maximal phrases. These kinds of phrases do not normally contain nesting components and header words.

Base-phrases Identification:

Base-phrases are identified only within an already-identified maximal phrase either nesting inside a maximal phrase or overlapping with it. Normally a base-phrase contains two-to-four words with one lexical word as its header.

Taking the maximal phrase [一/m 次/q 国家级/b 的/u 宣传/vn 促销/vn 活动/vn]NP-PZ in e.g.1 as an example, [一/m 次/q]QP (a) and [宣传/vn 促销/vn 活动/vn#]NP-PZ (promotion and marketing activity), are base-phrases in this maximal phrase. Thus, the sentence is annotated as [中国/ns 旅游年/n]NP是/v [[一/m 次/q]QP 国家级/b 的/u [宣传/vn 促销/vn 活动/vn]NP-PZ]NP-PZ.

As it happens, [中国/ns 旅游年/n]NP and 是/v are also base-phrases but, because they overlap with maximal phrases, they are not further bracketed. Our annotation principle here is that if a base-phrase overlaps with a maximal phrase, it will not be bracketed twice.

It should be pointed out that the identification of a base-phrase is the most fundamental and most important goal of Treebank annotation. The identification of maximal phrases can be thought as the parsing of a clause using a top-down approach. The identification of a base-phrase, however, is a bottom-up approach, the object of which is the identification of the most basic units within a maximal phrase.

Mid-Phrase Identification:

Because other syntactic structures may sometimes exist between the base-phrases and maximal phrases, it is useful to identify one more level of the syntactic structure within a maximal-phrase, the mid-phrase. This step begins with the examination of the base-phrase. Thus, e.g.1 is further annotated as [中国/ns 旅游年/n]NP 是 /v [[一/m 次/q]QP [国家级/b 的/u [宣传/vn 促销/vn 活动/vn]NP-PZ]NP-PZ]NP-PZ where, the underlined text shows the additional annotation.

As we have limited our nesting to three levels, any further nested phrases are ignored. The following sentence shows the result of our annotation with three levels of nesting:
   [目前/t [企业/n 发展/vn]NP [值得/v 注意/v 的/u [[几/m 个/q]QP 问题/n]NP-PZ]NP]NP
   (Several issues which are worthy of consideration in the development of current enterprise)

A full annotation would identify four levels of nesting, as shown below, but our system does not make the additional level of bracketing as shown in underlined annotation as it is beyond our limit of 3 levels.
   [目前/t [[企业/n 发展/vn]NP [值得/v 注意/v 的/u [[几/m 个/q]QP 问题/n]NP-PZ]NP]NP]NP

Annotation of Headword

IIn our system, a ‘#’ tag is appended to a word to indicate that it is a headword. Here, a headword must be a lexical word (sometimes also called a content word) rather than a function word. In most cases, a headword stays in a fixed position in a base-phrase. For example, the headword of a noun phrase is normally the last noun in the phrase. Thus it is considered as the default position which needs no explicit annotation. For example, in the clause, [美国/ns 科学家/n]NP [绘制/v 出/v]VP-SBU (The American scientists drafted), [绘制/v 出/v] (drafted) is a verb phrase, and the headword of the phrase is 绘制/v, which is not in the default position for a verb phrase headword. Thus, this phrase is further annotated as: [美国/ns 科学家/n]NP [绘制/v# 出/v]VP-SBU. Note that 科学家/n is also a headword in [美国/ns 科学家/n] (The American scientists), but since it is in the default position (for the noun phrase NP, the default grammatical structure is that the last noun in the phrase is the headword, and other components are the modifiers taking the PZ label), no explicit annotation is needed.

 

<< Annotation Guideline         Working Team, Schedule and Quality Control >>

 

Last modified on Thu, 11 May 2006 11:54:22 +0800
THE HONG KONG POLYTECHNIC UNIVERSITY