Collocation describes how one word is used in relation to others (co-occur) in a language structure to form a specific meaning. Finding the collocation of words is a very important part of language use and understanding. It is fundamental to many natural language processing (NLP) applications such as Information Retrieval (IR) and Information Extraction (IE). Research in this area has been conducted using lexical statistics of the words within the combination. However, the lexical statistical approach cannot provide a very good performance with regard to both precision and the recall rate, and nor can it solve the problem of identifying collocations which appear sparsely or irregularly. It should also be noted that the collocation information obtained are mostly for western languages.
We consider collocation as the syntagmatic relationships between words. To precisely determine whether two or more words form a collocation, the syntactic information must be introduced. The best way to identify the structure of a sentence is to apply the technique of parsing. In the last several years, a new shallow parsing approach has emerged for linguistic analysis in the study of western languages. Shallow parsing aims at giving only the local structures in a sentence, such as non-recursive noun phrases (base NP), V-O structures and S-V structures, etc.
In this study, we plan to investigate how to apply the technique of shallow parsing in the automatic extraction of collocations in Chinese text. Because a shallow parser can give the local structures of a sentence with high precision, speed and robustness, it can be used to process a large collection of unrestricted text. Once local structures are identified by the shadow parser, extraction of collocations, whether adjacent or disjoint, becomes much easier. The main work in the project consists of four parts: (1) developing a robust shallow parser which can identify the local structures formed by content words for Chinese running text; (2) building a large Chinese “shallow” tree bank; and (3) developing an automatic collocation extraction tool which extracts the collocations from the output of the shallow parser and building a Chinese collocation database with the help of the collocation extraction tool.
Motivation and Long-term Significance
Collocation is a very important part of language processing and understanding. It is a very important part of language understanding. It is one step further from the grammatical analysis of senesces to the understanding of how one word is used in combination with other words in a language structure in forming a specific meaning. It is a natural and habitual usage in language. For example, in English, people will say ‘warm welcome' rather ‘hot welcome', ‘strong tea' rather than ‘powerful tea'. Similarly in Chinese, ‘ 行李 ' ‘ 包裹 ' ‘ 包袱 ' are three similar nouns, however, we will say ‘ 思想包袱 ' rather than ‘ 思想行李 ', ‘ 托运行李 ' rather than ‘ 托运包袱 '. Briefly speaking, collocations cover word pairs and phrases that are commonly used in a language, but for which no general syntactic or semantic rules can be directly applied. It is dependent on a natural habitual language.
Even though collocations occur in text naturally, and are easily to be understood by human beings, it is still difficult to be given a formal clear and strict definition. Some researches restricted collocation as frequently used consecutive word sequences, and others also considered disconnected forms such as 'make …decision' in addition to consecutive sequence. From view of semantic, some researchers restricted the whole meaning of a collocation cannot be predicted from its components and some other researchers thought such a restriction is too strict [Allerton 1984][Benson 1989][Cruse 1986]. The more detail discussion on collocation will be presented in Chapter 3. To avoid the confusion, the definition of collocation used in this study is given as follows:
A Chinese collocation is a recurrent and conventional expression that holds syntactic and semantic relations, which consists of two or more words and in which must have at least one content word.
In another word, our research on Chinese collocation research focuses on investigating the collocations between content words with recurrent and conventional expression. Here, content word in Chinese including noun, verb, adjective, adverb, determiner, and directional word.
Generally speaking, a collocation consist two or more words in which must have at least one content words. According to the numbers of components, collocations are divided into BI-Gram collocation and N-Gram collocation [Smadja 1988]. The words in a collocation can consecutively appear, either appear in interrupted context, that means both uninterrupted collocation and interrupted collocation are the object of this research [Manning 2000]. For example, 浓 茶 , 烈 酒 , 思想 包袱 are the uninterrupted two word collocations, and 裁减 - 员额 , 消减 - 人工 are interrupted two word collocations because some modifier can be inserted between the two words. As for multi word collocation, most of them are uninterrupted like ‘ 安定 团结 的 政治 局面 '.
From view of linguistic, collocations can be classified into fully fixed collocation , fixed collocation, strong collocation, and normal collocation according to the strength of internal restriction, composition ability, substitutability, and modifiability (Will be detail discussed later) [Sinclair 1991] [Brundage 1992][Benson 1989] [Hoey 1991] [Smadja 1993].
Over the past decades, collocation becomes an important research topic in computational linguistics. We think two reasons behind impulse these researches. On one hand, the collocation has been neglected in structural linguistic traditions [Chomsky Saussure]. The structural linguistic traditions began from Chomsky's Transformation Grammar, and many further researches including System Grammar [Halliday], Case Grammar [Fillmore], Generalized Phrase Structure Grammar (GPSG) [Gardar], and Header-Driven Phrase Structure Grammar (HPSG) are proposed. Structural linguistics concentrates on general abstractions about the construction and properties of phrases and sentences, and applies these grammar or rules in language processing. However, there are still large amount of linguistic phenomena cannot be covered. For the above example, ‘powerful tea' fits to any grammar restriction but never used in natural language. In fact, the collocation research is not aim to object, but could be used to rich and strength structural linguistic tradition. On the other hand, the collocation research is focused on how words can construct collocations, and identify the appropriate word to be used in a context. As the Firth's famous conclusion that ‘ a word is characterized by the company it keeps ' [Firth 1959]. That means, either distinguishing the different meaning of a word in context, or identifying an appropriate word to fulfill in the context, are both dependent on the usage of collocations. That is why collocation are widely employed in many natural language processing applications, especially for Machine Translation (MT), Information Retrieval (IR), Natural Language Generation (NLG), Word Sense Disambiguation (WSD) and speech recognition.
First, collocation differs form one language to another, therefore, the mapping of collocations between different language can help MT system avoid some problems [Gitaski et al.2000], like when translate ‘strong tea' to Chinese, we hope the output is ‘ 浓 茶 'rather than ‘ 强 茶 ' ( word to word mapping result).
Secondly, employing collocation will make sure that the output of NLG system sounds natural, and mistakes like ‘powerful tea' [Manning 1999], ‘ 烈茶 ', ‘ 浓酒 ' can be avoided.
Thirdly, the collocation can be employed in IR research. The accuracy of retrieval can be improved if the similarity between a user query and a document can be determined according to common collocations (or phrases) instead of common words. The techniques can also be applied in cross-language information retrieval.
Fourthly, it can be used in word sense disambiguation. As Wittgenstein says, ‘ the meaning of words lies in their use '[Wittgenstein 1995], the different senses of a word are indicated by its collocations. Another application related to WSD is computational lexicography. The techniques can be used to automatically identify the important collocations to be listed in a dictionary entry. Furthermore, collocation can help to precisely describe the senses of a word (the contrastive collocation makes the different senses of a word separated [Church 1989][Sinclair 1995].
Lastly, the techniques can be employed to improve the language models in speech recognition. Conventional language models in speech technology are based on such basic linguistic units as words and phones. Language models based on collocation can describe the pronunciations of words more accurately [Stolcke 1997].
The objective of this research is to construct an effective collocation extraction system with high precision and recall performance, and apply collocation knowledge to natural language applications. Both new designed algorithms and the improvement of existed techniques to address the limitations and unsolved problems are investigated.
Detail tasks to achieve this objective may be summarized as follows:
- Manually analyze the status and properties of Chinese collocation. Based on this work, give a clear and accurate definition with good computational operation ability. Furthermore, the linguistic properties of collocation are analyzed. This part of work is the fundamental of whole research.
- Apply the techniques of Xtract to Chinese and its performances are evaluated. The improvements, including parameter optimization and algorithm modification for this system, are performed. The output of this technique is analyzed and evaluated.
- Design and implement a new collocation extraction technique based on BI-directional BI-Grams that BI-directional analyzes the association between the headword and co-word, which is expected to improve both the precision and recall performance. Furthermore, the association evaluation using hypothesis test is performed to strength this technique.
- Design and construct a large manually annotated Treebank for extracting syntactical knowledge and collocation patterns.
- Extract the collocation patterns by automatically analyze the cases in Treebank.
- Establish a new technique, which incorporate lexical statistic and collocation patterns. The appropriate integration strategies are investigated and evaluated.
- Demonstrate the application of collocation knowledge by improving an existed natural language processing system. Reasonable improvement of this system is expected.
Definition of Chinese Collocation
Generally speaking, collocations are close and frequently used word combinations, they are mainly depends on natural habitual language, rather than general syntactic or semantic rules.
There are some definitions on collocation were given by linguists. According to Gitsak [Gitsak 2000] and Lin [Lin 1998], ‘ collocation is habitual and recurrent word combination '. Manning [Manning 1999] defined ‘ a collocation is an expression consisting of two or more words that correspond to some conventional way of saying things '. Besides these, many researchers adopted Benson's definition on collocation. That is, ‘ a collocation is an arbitrary and recurrent word combination ' [Benson 1990]. This definition is adopted in English collocation extraction research like Smadja [Smadja 1993] and Cruse [Cruse 1986] as well as Chinese collocation extraction by Sun [Sun M.S. 1997]. Even though collocations occur in text naturally, and the above definitions are easily to be understood by human beings, it is still difficult to formulate their descriptions in a computer system for automatic extraction.
Furthermore, compare to English, Chinese has rather simpler grammar, and the lexically usage is much more flexible. Thus the functional words play less important role in Chinese language and collocations are mainly existed between content words. Meanwhile, the part-of-speech and semantic meanings of context words in Chinese are much more flexible and mostly depend on the context. For example, one word ‘ 学习 ' can be used as a Verb like ‘ 努力 学习 ', as well as a Noun like ‘ 政治 学习 '. Therefore, our research on Chinese collocation extraction will mainly focus on the collocations between content words. In this research, we define a Chinese collocation as:
A Chinese collocation is a recurrent and conventional expression that holds syntactic and semantic relations, which consists of two or more words and in which must have at least two content words.
In another word, our research on Chinese collocation research focuses on investigating the collocations between content words with recurrent and conventional expression. Here, content word in Chinese including noun, verb, adjective, adverb, determiner, and directional word. As for functional words, they consist conjunction, exclamation, quantifier, numeral, onomatopoeia, prepositional, postposition, pronoun, space word, time, auxiliary, adjuncts, modal, affix, and degree-terms.
Here, two kinds of special cases should be mentioned:
First, the idioms, proverbs, sayings and abbreviations are regarded as independent and close components. In this study, themselves are regarded as a special kind of full fixed collocations, thus the possible collocations between them and other words are ignored.
Furthermore, the collocation between quantifier and noun is a kind of interesting collocation [Lee 1996*]. It can be easily extracted from a POS tagged corpus, or a POS tagged running text, and thus it is not regarded as the object of this research.
Characteristics of Chinese Collocation
In this section, we analyze the characteristics of Chinese collocation from two views, one is from view of syntactic and another is from view of semantic.
From view of syntactic, the Chinese collocation has following characteristics:
1. A collocation is the occurrence of two or more words within a short context [Sinclair 1991] . These two or more words can appear adjacent or distant.
That is to say, we didn't limit collocations must be adjacent. However, they must occurs in a short context, in this research, we limit the collocations occur in a clause, and the possible collocations across a clause will be ignored. This restriction is aimed to reduce the difficulty of collocation extraction system and ensure a high precision result.
For example, “历史 包袱” is a typical collocation of adjacent word pairs, and 打击 -- 犯罪 (the cases include 打击 犯罪，打击 职务 犯罪，打击 毒品 犯罪，打击 飞车 抢劫 犯罪 and so on) is a collocation of distant words.
2. A collocation in this research must consist of least two content words.
It is attributed to the fact that the content words convey most information in the language. Thus, most collocations occur between content words. The word combinations of a content word and a function word, like verb-preposition and noun-affix, are not being regarded as collocations in this research and the function word is regarded as modifier of content word.
Furthermore, the word combinations between quantifier and noun are regarded as collocations but will not be processed in our collocation extraction system.
3. A collocation must be grammatically well formed, and has its special syntactic structure, including word order and word distances.
Firstly, the words occurs in a collocations are always in fixed order. For example, the collocations in form of verb-object, the verb must appear before object. We can say “弹 钢琴” , “踢 足球” , but never “钢琴 弹” , “足球 踢” .
Secondly, the words of the collocations often appear within fix or similar distance. For example, the collocations in form of adjective-noun are often adjacent or across short distance. For instance, “历史” and “包袱” are often appears adjacently. As for another example, the distance between the words in the collocations in form of verb-object is more freely but the order is always fixed. For instance, 打击 - 犯罪 (the cases include 打击 犯罪，打击 职务 犯罪，打击 毒品 犯罪，打击 飞车 抢劫 犯罪 and so on). The words in the collocation occur both adjacent and disconnected.
4. In the spectrum of word combinations, collocations are co-occurrence of words that fall between idioms, which are very stable and unbreakable at one end, and free combination at the other end.
Generally speaking, the collocation express the lexical and semantic restrictions among the sentence, and our research of collocation extraction are mainly focus on phrase level, only few special cases of short sentences are considered.
As for from the view of semantic, a Chinese collocation has following characteristics:
1. Collocation is bound combination and limited compositional [Manning 1999] [Brundage 1992][Benson 1989].
Brundage introduced the term of ‘compositional' as ‘ a natural language expression is called compositional if the meaning of the expression can be predicted from the meaning of the parts '.
Free word combination can be generated by general linguistic rules and the meaning is mostly the combination of its components, further the weak restrictions exist between these words. For example, the meaning of word combination ‘ 写 文章 ' is directly composed by the meaning of two words. Each word of this combination can be substituted by many words, the new generated word combinations like ‘ 写 作业 ', ‘ 看文章 ' are grammatically well-formed and have semantic meanings taken from their components. Thus, it is not regarded as a collocation. On the contrary, the meaning of a word combination “ 思想 包袱 ' is obvious different with the direct meaning combination of two words, and naturally it is a collocation.
Brundage restrict a collocation must be non-compositional. We think it is a too strict restriction and we think collocation should be a bound combination and limited compositional. On one hand, collocations are expected has additional meanings beyond, i.e., the meaning cannot be derived directly from the meaning its components. On the other hand, for those collocations that have little additional meaning over the combination of words, they still show the close semantic restriction between the components.
Here we should pay attention to two kinds of special cases: idioms and terms.
The idioms are considered fully fixed collocations that are non-compositional; they are fixed combination of words with specific meanings. Idioms are widely existed in Chinese and many other languages. For example, “刻舟求剑” and “守株待兔” , their meanings cannot be derived for the meaning combinations of their components.
Another kind of cases is terms. When we extrac t collocations from a specific domain, many of them are terms. In this research, the extracted term phrases are regarded as collocations.
2. Collocation is limited substitutable and limited modifiable
Brundage believe a collocation must be non-substitutable and non-modifiable [Brundage 1992]. That means we cannot substitute any word of a collocation, even we use their synonyms, to construct a new collocation with same meaning. Furthermore, collocations cannot be freely modified by adding modifiers or through grammatical transformations because basically collocations are based on conventional usage.
We think such a restriction is too strict, and many interesting word combinations will be lost. In this research, we restrict the collocation must be limit substitutable and limit modifiable, that means when we use the corresponding synonyms to substitute the component in the word combinations, if non or only very few cases that the new generated word combinations are tend to be strong combination and others are meaningless or ill-formed, such a word combination is regarded as a collocations.
For example, for the word combination ‘ 裁减 员额 '. The synonyms of ‘ 裁减 ' are 减少 缩减 减缩 压缩 削减 裁减 节减 , that recorded in The synonyms dictionary of Chinese ( 同义词词林 ) [Mei ***]. When we use these words to substitute ‘ 裁减 ', we find none of new word combinations occur in practical text. Naturely, ‘ 裁减 - 员额 ' fulfill the requirement of non-substitutable. As for the word combination ‘ 裁减 职位 ', we use the synonyms of ‘ 裁减 ' to substitute itself, the new word combination ‘ 减少职位 ', ‘ 缩减职位 ' are found in the corpus. According to Brundage's requirement, these are not collocations. However in this research, the fact that only two of seven synonyms could be used to substitute ‘ 裁减 ' fulfill the requirement of limit substitutable and limit modifiable, thus they are regarded as collocations.
3. Collocation is recurrent. [Smadja 1993]
Hoey pointed out, “ collocation has long been the name given to the relationship a lexical item has with items that appear with greater than random probability in its context ”. [Hoey 1991]
That means these word combinations are not exceptions, but frequently occur in the similar context. It is easily understand, the collocation is based on conventional usage. Only the frequently used word combinations are regarded as collocations.
4. Collocation is domain-dependent [Smadja 1993]
In domain area, many collocations tend to be term phrases. Furthermore, some word combinations can be regarded as collocations in one specific domain, and in other domains, they tend to be free combinations with high co-occurrence.
For example, ‘ 模糊 聚类 ' means fuzzy clustering which is frequently used in computer science area, and they are not seldom appear in general area. Even though these two words appear in context, they are normally not constructing a collocation.
Classifications of Chinese Collocation
The collocations are various. Some are very rigid, whereas some are flexible. According to the internal restriction, substitutability, and modifiability, we classify Chinese collocations into 4 types. Such a classification strategy has good computational operation ability.
Type 0 Collocation: Fully fixed collocation
A fully fixed collocation is the most strict type that fulfill two conditions,
a. non-substitutable of the components, that is to say, a fully fixed phrase is one which allows no syntactic transformation and no internal lexical variation;
b. has a frozen form, that is the components cannot be shifted around, or added to, or alter.
Type 0 collocations include some idioms, proverbs and sayings. For example, ‘ 缘木求鱼 ' ‘ 釜底抽薪 ' and so on. Neither of the components can be substituted, or the word order can be changed, or any words can be inserted.
Type 1 Collocation: Fixed collocation
A Type 1 collocation fulfills;
a. non-substitutable of the components.
b. the order of components can not be changed, but allow insert modifier
The components within a Type 1 collocation have fully internal restriction, that is, the appearance of one word implies the co-occurrence of another one. Each component in this collocation can not be substituted by its synonyms.
For example, ‘ 裁减 员额 ' is a Type 1 collocation because the components of the collocation cannot be substitute by their synonymies and the word order cannot be changed. Meanwhile, the modifier can be inserted in this collocation and become new multi-word collocations like ‘ 裁减 军队 员额 ', ‘ 裁减 政府 员额 '.
Type 2 Collocation: Strong collocation
A Type 2 collocation fulfills,
a. very limited-substitutable of the components;
b. the order of components can not be changed, but allow insert modifier.
Different with Type 1 collocations, a Type 2 collocation allows very limited substitutable of the components. Here ‘very limited substitutable' means only one word in the collocation can be substituted by its very few synonymies and other components must keep fixed.
For example, ‘ 裁减 职位 ' is a Type 2 collocation because for ‘ 裁减 ', only two of seven synonymies can be used for substitution and construct collocations, ‘ 减少职位 ' and ‘ 缩减职位 '.
Type 3 Collocation: Normal collocation
A Type 3 collocation fulfills,
a. limited-substitutable of the components;
b. allow insert modifier;
c. allow the order of components are changed;
Type 3 collocations maintain less internal restrictions. More substitutable of components is allowed but a limitation is still required. Once the components of a collocation can be substituted by their most corresponding synonymies, they tend to be grammatical collocation, but to be true collocation.
For example, ‘ 减少 开支 ', ‘ 缩减 开支 ', ‘ 压缩 开支 ', ‘ 消减开支 ' are Type 3 collocations because the word combination ‘ 裁减 开支 ', ‘ 减缩 开支 ', ‘ 节减 开支 ' are not found in practical text.
The discussion on collocation from view of linguistic is very important to this research. Based on the definition and classification of collocation, we could distinguish the true collocations and pseudo ones. Thus, the correct answers can be identified from the output of automatic collocation extraction system. Meanwhile, the mentioned characteristics and classification conditions of collocation motivate and direct the research on eliminate the pseudo collocations from automatic extraction result.
Research on the automatic extraction of collocations
The research on automatic collocation extraction began with the work of Choueka  in 1983 and he in 1988 conducted the experiment on the corpus of 11 million words of the New York Times. Church conducted an experiment on the Associated Press corpus of 44 million words in 1991. Smadja carried out the most comprehensive and most in-depth work in this field . He developed a lexicograghic tool, Xtract which was applied to a 10 million-word corpus of stock market news report.
The techniques in these researches were mainly based on lexical statistics. The collocations were selected based on the following statistical criteria:
- Mutual Information [7,11,13]
- Mean and variance of the distribution of the collocation [6,14].
- t-test or chi-square for hypothesis testing .
These techniques utilized the statistical figures of words to reflect the relevance of the association. However, there are some problems remain to be solved. Firstly, the accuracy of the automatic extraction is still unsatisfactory. As reported by Smadja in , the accuracy of English collocation extraction based on lexical statistics was only 40%. Similar work was done for Chinese. As reported by Sun et al., the accuracy of Chinese collocation extraction using a method similar to Smadja's was 29.3%. Sun (1998) improved the performance by changing the “watch window” in statistics and got a 46.1% accuracy rate and a 64.5% recall rate. Smadja also reported that the accuracy could rise from 40% to 80% by parsing the text before lexical statistics . His work showed that syntactic filtering can greatly improve the performance of automatic collocation extraction. Secondly, the techniques, based on statistics of words in a large corpus, cannot be applied when identifying collocations in a small corpus or one document. Thirdly, although different approaches have been taken, the scale of work was very limited. There have been few reports on large collocation databases extracted automatically or semi-automatically from running text. Finally, almost all of the past work was done for English. There are some work in this topic for Chinese but they are at a very limited scale [13, 14, 16].
Research on shallow parsing
The aim of shallow parsing is to reliably recognize relatively simple syntactic elements, rather than to produce complete parses. Previous research in this field can be classified into two categories: the rule-based approach and the statistics-based approach.
1. Rule-based approach
Abney [17-19] used cascaded finite-state machines for shallow parsing. A finite-state cascade machine consists of strata, in which each stratum is being defined by a set of regular expression patterns for recognizing phrases. Voutilainen [20-21] used a constraint-grammar-based approach to assign an appropriate functional tag, indicating its syntactic function in the context to every word. Noun phrases can be recognized readily from the functional tag. As the compiling of a large set of grammatical rules is tedious and time-consuming, machine learning techniques are applied to learn the rules automatically from a training corpus. Ramshaw and Marcus  used the error-driven transformation-based learning proposed by Brill  to learn a set of sequential rules for recognizing the base NP. Cardie & Pierce  used instance-based learning to acquire a set of base NP rules from treebanks. Argamon et al.  used memory-based learning to acquire the rules for the recognition of noun phrases, subject-verb and verb-object patterns. Although none of the current techniques yield competitive results with hand-written grammars, the research in this direction has shown great potential.
2. Statistics-based approach
Church  applied Hidden Markov Models (HMMs) in the recognizing of base NP. He modeled the recognition of base NP as the insertion of open brackets and close brackets between tag pairs. Bikel et al.  used HMMs for the task of named entity in information extraction (finding names and other non-recursive entities in the text). The statistical criteria like mutual information, chi-square are also used to determine the phrase boundaries [28-29].
Most of the work has been experimental and on a limited scale. Researchers have focused on one specific problem, such as base NP recognition, phrase boundary determination, or named entity recognition. None tried to apply the techniques to all the tasks of collocation extraction to be investigated here.