VI. Current Progress and Future Work
As mentioned earlier, we are now in Stage 5 of the annotation. The resulting annotation contains 2,639 articles selected from PKU People’s Daily corpus. These articles contain 1,022,761 segmented Chinese words, with on average, around 387 words in each article. There are a total of 286, 057 bracketed phrases including nested phrases. A summary of the different SS labels used are given in Table 1.
NP |
TP |
FP |
SV |
VP |
PP |
DP |
AP |
QP |
IC |
143063 |
5623 |
2652 |
5677 |
81648 |
27055 |
197 |
19580 |
11049 |
4198 |
Table 1. Statistics of annotated syntactical phrases
For each bracketed phrase, if its FF label does not fit into the corresponding default pattern, (like for the noun phrase(NP), the default grammatical structure is that the last noun in the phrase is the headword and other components are the modifiers, using PZ tags), its FF labels should then be explicitly labeled. The statistics of annotated FF tags are listed in Table 2.
BL |
FZ |
PZ |
ZZ |
SD |
PO |
DU |
FJ |
JY |
14264 |
6279 |
0 |
682 |
16290 |
9387 |
5246 |
0 |
939 |
276 |
2789 |
DL |
ML |
SL |
YY |
DX |
DD |
FS |
MD |
GJ |
SJ |
OT |
27 |
134 |
415 |
528 |
10645 |
669 |
1314 |
1124 |
106 |
3016 |
0 |
NT |
NS |
NR |
NZ |
DE |
SU |
XD |
BA |
7813 |
1516 |
0 |
335 |
1248 |
124 |
194 |
1041 |
385 |
Table 2. Statistics of function and structure tags
For the material annotated by multiple annotators as duplicates, the evaluation program has reported that the accuracy of phrase annotation is higher than 99.5% and the consistency between different annotators is higher than 99.8%. As for other annotated materials, the quality evaluation program preliminarily reports the accuracy of phrase annotation is higher than 98%. Further checking and evaluation work are ongoing to ensure the final overall accuracy achieves 99%.
It is also our intention to further develop our tools to improve the automatic annotation analysis and evaluation program to find out the potential annotation error and inconsistency. Other visualization tools are also being developed to support keyword searching, context indexing, and annotation case searching. Once we complete Stage 5, we intend to make the PolyU Treebank data available for public access. Furthermore, we are developing a shallow parser and using The PolyU Treebank as training and testing data.