中国科学院系统科学研究所

Information Extraction and Text Summarization in Thai Language

主讲人：Thanaruk Theeramunkong (SIIT, Thammasat University, Thailand)
时间：2011年10月21日上午9:00 地点：思源楼712

【Abstract】In this work, we have studied named entity recognition and text summarization in Thai language. Named entity recognition is a nontrivial and challenging task for information extraction in Thai language since a Thai text has no word, phrase and sentence boundary. In the first work, we have proposed a method to exploit the concept of character clusters, a sequence of inseparable characters, to group characters into clusters and then utilize statistics among characters and their clusters to extract Thai words and then recognize named entities, simultaneously. Integrated of two phases, the word-segmentation model and the named-entity-recognition model, context features are exploited to learn parameters for these two discriminative probabilistic models, i.e., CRFs, to rank a set of word and named entity candidates generated. Moreover, three alternative discriminative probabilistic approaches called (1) phase-independent approach, (2) phase-merging approach, and (3) phase-cascading approach are proposed and compared. In the second work, we study on a number of techniques for construction of a comprehensive summary from multiple documents. Towards summarization of multiple news articles related to a specific event, we studied a method to find relationship among entities using association rule mining and proposed a graph-based summarization method which constructs a summarization graph by modelling text portions as nodes and relationships among them as edges. An ideal summary should include only important common descriptions of these articles, together with some dominant differences among them.