|
DSTO Publications Online >
DSTO Publications Online Repository >
DSTO Formal Reports >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/1947/9900
|
| Title: | The Use of Systemic-Functional Linguistics in Automated Text Mining. |
| Report number: | DSTO-RR-0339 |
| AR number: | AR-014-419 |
| Classification: | Unclassified |
| Report type: | Research Report |
| Authors: | Kappagoda, A. |
| Issue Date: | 2009-03 |
| Division: | Command, Control, Communication and Intelligence Division |
| Abbreviation: | C3ID |
| Release authority: | Chief, Command, Control, Communication and Intelligence Division |
| Task sponsor: | ASCP EXEC DIR CTSTC |
| Task number: | INT 07/020 |
| File number: | 2009/1016253/1 |
| Pages or format: | 82 |
| References: | 38 |
| DSTORL/DEFTEST terms: | Information extraction Machine learning |
| Other descriptors: | Systemic-functional linguistics Text mining Text categorisation |
| Abstract: | Systemic-functional linguistics is a linguistic framework for the analysis of grammatical and
semantic information in text, with a potential role in automated text mining. This report outlines
essential features of the theory, its application in computational work, and the rationale for use in
automated text mining, and develops a grammatical annotation scheme– word functions– to
enrich a mixed text corpus of newspaper articles and e-mails, for machine learning of
semantically-oriented grammatical patterns. Testing demonstrates high accuracy in predicting
word functions in unseen text in co-training with other grammatical information, providing the
basis for further grammatical and semantic text processing. |
| Executive summary: | Using grammatical and semantic patterns as the basis for large-scale text processing has
wide potential to improve the quality and speed of information management and
analytical tasks in the defence and intelligence domains. It is proposed that a robust
linguistic model is needed to support the automation of these tasks, which is achieved by
co-training semantic and grammatical information with unstructured text, and that
systemic-functional linguistics (SFL) provides a prime means for achieving this. SFL is a
linguistic theory that has had a substantial presence in natural language processing work
for the past 40 years, with recent developments in rule-based and machine learning (ML)-
based text processing. An outline of the theoretical apparatus of SFL is presented, focusing
on a detailed treatment of the functional structure of word groups and phrases. This is
used to derive a grammatical annotation scheme for the labelling of the functions of single
tokens in unstructured text (WFG). A justification for using this scheme is presented, and a
method is outlined for the preprocessing of unstructured text and for annotation with the
WFG scheme, in order to produce training and testing corpora for a ML system employing
the 'conditional random fields' algorithm. It is demonstrated via this system that
automated WFG annotation can be achieved with high accuracy, and that such labelling
supports the automated recognition of other grammatical information such as chunk
labelling. It is proposed that WFG annotation provides a robust semantically-oriented
foundation for other kinds of semantically-based text processing, such as information
extraction and text categorisation, which are important elements in information
management in defence and intelligence tasks. |
| Appears in Collections: | DSTO Formal Reports
|
Items in DSTO Publications Online are protected by copyright, with all rights reserved, unless otherwise indicated.
|