Chinese Computing Lab
 Site Map 
About CCL
Site News
Projects
PolyU TreeBank
Chunk Bank
Collocation Extraction
ASAB
CERG
Hong Kong Character Glyphs
Jyutping
Dash Line
Publications
Download Area
Contact Information
Useful Links


Warning: A non-numeric value encountered in /webhome/cclab/public_html/menu.php on line 69

PolyU Treebank

 

中文

 

 

 



VI. Current Progress and Future Work

As mentioned earlier, we are now in Stage 5 of the annotation. The resulting annotation contains 2,639 articles selected from PKU People’s Daily corpus. These articles contain 1,022,761 segmented Chinese words, with on average, around 387 words in each article. There are a total of 286, 057 bracketed phrases including nested phrases. A summary of the different SS labels used are given in Table 1.

NP TP FP SV VP PP DP AP QP IC
143063 5623 2652 5677 81648 27055 197 19580 11049 4198

Table 1. Statistics of annotated syntactical phrases

For each bracketed phrase, if its FF label does not fit into the corresponding default pattern, (like for the noun phrase(NP), the default grammatical structure is that the last noun in the phrase is the headword and other components are the modifiers, using PZ tags), its FF labels should then be explicitly labeled. The statistics of annotated FF tags are listed in Table 2.

BL FZ PZ ZZ SBI SBU SD PO DU FJ JY
14264 6279 0 682 16290 9387 5246 0 939 276 2789
DL ML SL YY DX DD FS MD GJ SJ OT
27 134 415 528 10645 669 1314 1124 106 3016 0
NT NS NR NZ DE SU XD BA BEI    
7813 1516 0 335 1248 124 194 1041 385    

Table 2. Statistics of function and structure tags

For the material annotated by multiple annotators as duplicates, the evaluation program has reported that the accuracy of phrase annotation is higher than 99.5% and the consistency between different annotators is higher than 99.8%. As for other annotated materials, the quality evaluation program preliminarily reports the accuracy of phrase annotation is higher than 98%. Further checking and evaluation work are ongoing to ensure the final overall accuracy achieves 99%.

It is also our intention to further develop our tools to improve the automatic annotation analysis and evaluation program to find out the potential annotation error and inconsistency. Other visualization tools are also being developed to support keyword searching, context indexing, and annotation case searching. Once we complete Stage 5, we intend to make the PolyU Treebank data available for public access. Furthermore, we are developing a shallow parser and using The PolyU Treebank as training and testing data.

 

<< Working Team, Schedule and Quality Control         Applications of The PolyU Treebank >>

 

Last modified on Thu, 11 May 2006 11:54:24 +0800
THE HONG KONG POLYTECHNIC UNIVERSITY