Chinese Computing Lab
 Site Map 
About CCL
Site News
Projects
PolyU TreeBank
Chunk Bank
Collocation Extraction
ASAB
CERG
Hong Kong Character Glyphs
Jyutping
Dash Line
Publications
Download Area
Contact Information
Useful Links


Warning: A non-numeric value encountered in /webhome/cclab/public_html/menu.php on line 69

PolyU Treebank

 

中文

 

 

 



III. Annotation Guideline

The establishment of the annotation guidelines is the first step in Treebank development. To ensure quality output, the guidelines must follow the design principles and must be clear, unambiguous, easy to understand and easy to follow. The PolyU Treebank guidelines include definitions of (1) syntactical phrase categories, (2) the categories of semantic information, and (3) the different phase levels, including maximal phrase, mid-phrase and base-phrase. Because the PolyU Treebank is based on a segmented and POS tagged corpus, the part-of-speech tags in the chosen corpus are used (with only minor modifications for annotation consistency). Appendix 1 provides a complete list and explanations of the POS tags. These tags will be used through out the examples in this paper.

The symbols ‘[’ and ‘]’ are used to indicate the left and right boundaries of phrases. The right bracket is appended with syntactic labels in the form of [Phrase]SS-FF, where SS is a mandatory syntactic label such as NP(noun phrase) and AP(adjective phrase), and FF is an optional label indicating internal semantic information such as BL(parallel). For example, a noun phrase with parallel components will be annotated as [荣誉/n 与/c 尊严/n]NP-BL (honor and dignity).

 

3-1 Defining the syntactical phrase categories

The first level of information for describing phrases is that of the syntactical phrase category. With reference to the works of Penn Chinese Treebank and Sinica Treebank, the guideline defines a total of eight syntactical phrase categories:

NP — Noun phrase. An NP is headed by a noun and the header is normally is the last noun in the phrase. e.g. [市场/n 经济/n#]NP (market economy)

TP — Time phrase. A TP consists of continuous time words and is used to indicate a time. e.g. [早上/t 8时/t]TP (8:00 in the morning)

FP — Position phrase. A FP is headed by a position word, f, and is used to indicate position information. e.g. [内蒙古/ns 东北部/f#]FP (North-east of Inner Mongolia)

VP — Verb phrase. A VP is a phrase headed by a predicate and containing no subject. e.g. [顺利/a 启动/v#]VP-ZZ (successfully start) , and [分析/v# 问题/n]VP-SBI (analyze the problem)

AP — Adjective phrase. The header of an AP is an adjective and the whole phrase acts as an adjective in the sentence. e.g. [公正/a 合理/a#]AP (fair and reasonable)

DP — Adverb phrase. The header of a DP is an adverb, and the whole phrase plays the adverbial role in a sentence. e.g. [已/d 不再/d#]DP (no longer)

PP — Preposition phrase. A PP is the phrase which begins with a preposition. e.g. [在/p 贵州/ns 农村/n]PP (In the countryside of Guizhou Province)

QP — Quantifier phrase. A QP consists of a number and a quantifier. The quantifier acts as the header. Normally, a QP is used as the modifier of an NP or a VP. e.g. [数千/m 名/q#]QP 士兵/n (several thousand soldiers)

SV — Subject-verb phrase. A SV consists of a noun or a NP as subject, and a verb or VP as predicate. e.g. [[规模/n 收益/n]NP 递增/v]SV (scale income increases)

IC — A IC is used to mark the boundary of embedded clause in the sentence. E.g. [如何/r 多/a 方面/n 开辟/v 就业/vn 渠道/n]IC 是/v (how to provide more job opportunities is)

 

3-2 Defining semantic information categories

The PolyU Treebank is unique in being annotated with semantic labels. The annotation of the FF labels is not mandatory. Only those phrases with pre-defined semantic phrase categories are labeled.

Semantic information is very useful for some language applications. For example, 山东/ns 烟台/ns 市/n (Yantai City, Shan Dong Province) and 烟台/ns 大学/n (Yantai University) are both noun phrases, but the first one is the name of a place and the second that of an organization. The use of the semantic information labels NS (Name of a place) and NT (Name of an organization) allows these two NPs to be distinguished. This is highly useful in named entity extraction and in automatic summarization. The additional semantic labels can be considerate a natural byproduct of manual annotation since annotators naturally need to go through the mental process of identifying them, we are simply making them available so that such used knowledge is not wasted during annotation.

The following separately lists the semantic categories.

Semantic information categories for Noun Phrases

NT — Name of an organization. e.g. [烟台/ns 大学/n]NP-NT (Yantai University)

NS — Name of a place. e.g. [江苏省/ns铜山县/ns]NP-NS (Jiangsu Province, Tongshan Country)

NR — Name of a person. e.g. [胡/nr 锦涛/nr]NP-NR (Hu Jintao)

NZ — Other proper noun phrase. e.g. [诺贝尔/nr奖/n]NP-NZ (The Nobel Prize)

BL — Juxtaposition structure. A BL label indicates that the phrase is made up of two or more parallel components. e.g. [中国/ns 与/c 南非/ns]NP-BL (Chinaand South Africa)

FZ — Appositive. An NP with FZ labels normally has two equivalents. e.g. [[国家/n 主席/n]NP [江/nr 泽民/nr]NR]NP-FZ (the president of China, Jiang Zemin)

PZ — Noun modifier. A PZ is the default semantic structure of an NP. e.g. [美丽/a 的/u 花/n#]NP-PZ (beautiful flower)

FS — Noun plurals. A FS indicates that the last word in a noun phrase is a suffix for noun plurals. e.g. [朋友/j# 们/k]NP-FS (friends)

DE — A DE construction is a special kind of an NP structure in Chinese. It ends with ‘ 的 ’(DE) and indicates the absence of the complementation. e.g. 比/v [原先/d 预料/v 的/u]NP-DE 低/a (lower than originally expected)

SU — A SU construction is a special kind of an NP structure in Chinese. The typical pattern is 所(SUO)+VP+NP . e.g. [所/u 画/v 禽鸟/n#]NP-SU (the birds painted by)

 

Semantic information categories for Verb Phrase

SBI — Predicate and its object. A VP with the label SBI consists of a predicate and an object. e.g. [打/v# 篮球/n]VP-SBI 是/v 我/r 的/u 爱好/n (playing basketball is my hobby)

SBU — Complement. The label SBU indicates that the second part of the VP phrase is the complement modifying the first part of VP. e.g. [医治/v# 无效/v]VP-SBU (ineffectively treat)

ZZ — When a VP has the label ZZ, the verb is the header and other words are its modifiers. e.g. [[有效/ad 打击/v#]VP-ZZ 了/u 敌人/n]VP-SBI (effectively strike the enemy)

SD — Serial verb constructions. A SD indicates the serial actions in a VP phrase where the last action is the cardinal action. e.g. [[审核/v 发放/v]VP-SD 护照/n]VP-SBI (verify and issue the passport)

BA — A BA construction is a special kind of a VP structure in Chinese. The typical pattern is 把 (BA)+NP1+VP . e.g. [把/p [扶贫/vn 开发/vn 工作/vn]NP-PZ 作为/v#]VP-BA (place the work of poverty reduction and social development as)

BEI — A BEI-construction is a special kind of a VP structure in Chinese. The typical patterns are 被 (BEI)+NP+VP and NP+被+VP . eg. 商店/n [被/p [责令/v# 停业/vn]VP-SBI]VP-BEI (the shop was ordered to close)

 

Semantic information categories for Time Phrase

PO — A point-of-time indicator. The label PO indicates the TP carries point-of-time information. e.g. [7月/t 1日/t]TP-PO (July 1)

DU — A period-of-time indicator. A DU indicates a period of time. e.g. [今后/t 3/m 年/q]TP-DU (following three years)

 

Semantic information categories for Prepositional Phrase

YY — Causation information. A YY label is used only to modify a PP to indicate that the PP carries causation information. e.g. [因/p 饿/a]PP-YY 死亡/v (dead for hunger)

DX — Object information. The label DX is used to modify a PP to indicate object information. e.g. [向/p [受灾/vn 地区/n]NP]PP-DX (to the disaster area)

DD — Place information. It is the place indicator of a PP. e.g. [在/p 深圳/ns]PP-DD (in Shenzhen)

FS — Method information. A PP with an FS label signals method information. e.g. [通过/p [股票/n 上市/v]S]PP-FS (Through the stock market)

MD — Motivation information. A PP with an MD label signals motivation information. e.g. [为/p 动武/v]PP-MD [找/v 借口/n]VP-SBI (looking for an excuse for war)

GJ — Tool information. A GJ label indicates that a PP carries tool information. e.g. [用/p 公车/n]PP-GJ (using public-bus)

SJ — Time information. A SJ label indicates that a PP carries time information. e.g. [到/v 目前/t 为止/v]PP-SJ (up to now)

 

Other semantic information categories

DL — Motion quantifier. A phrase with DL labels means it quantifies the motion. e.g. [十/m 下/v 江南/ns]VP-DL (visit Jiangnan 10 times)

ML — Noun quantifier. ML is the default function of QP. It indicates the QP is the modifier of the follow noun or NP. e.g. [五/m 个/q]QP-ML 苹果/n (five apples)

SL — Time quantifier. SL label indicates that a QP carries time information. e.g. [30/m 年/q]QP-SL 里/f (in the 30 years)

 

3-3 Phrase bracketing

Phrases in the PolyU Treebank are classified into three levels: maximal phrase, mid-phrase and base-phrase. The syntactical analysis and annotation of the PolyU Treebank begins with the identification of maximal phrases which define the scope of examination for bracketing.

A maximal phrase is a predicate playing the part of a distinct syntactic component of a sentence realized by the maximum span of its non-overlapping length. Maximal phrases form the backbone of a sentence. The identification of maximal phrases is one of the most difficult steps in the whole project in that annotators have to syntactically analyze the sentence and understand its syntactic components even though they will not be labeled. The objective of identifying maximal phrases is to separate a sentence into several syntactic components for examination. After maximal phrases are identified, the base-phrases can then be identified within the scope of examination, that is, within each maximal phrase.

A base-phrase is defined as a minimum non-nesting phrase with a stable internal structures and impendent semantic role. Normally, a base-phrase has a lexical word as its headword. Essentially, a base-phrase must consist of continuous words and contain no nesting components. It never overlaps with other phrases and must be contained within a maximal phrase. Base-phrases normally conform to a number of typical patterns, such as [a+n]->NP, [a+a]->AP.

A mid-phrase is a nested phrase within a maximal phrase which has a base-phrase as its header. A mid-phrase may contain more than one base-phrase, but only one will be its header. A mid-phrase may have nested components but none of them may overlap.

Furthermore, the headword of each phrase will be annotated.

 

<< Design Principles         Implementation of the PolyU Treebank >>

 

Last modified on Thu, 11 May 2006 11:54:22 +0800
THE HONG KONG POLYTECHNIC UNIVERSITY