DSpace
 

DSTO Publications Online >
DSTO Publications Online Repository >
DSTO Formal Reports >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1947/9900

Title: The Use of Systemic-Functional Linguistics in Automated Text Mining.
Report number: DSTO-RR-0339
AR number: AR-014-419
Classification: Unclassified
Report type: Research Report
Authors: Kappagoda, A.
Issue Date: 2009-03
Division: Command, Control, Communication and Intelligence Division
Abbreviation: C3ID
Release authority: Chief, Command, Control, Communication and Intelligence Division
Task sponsor: ASCP
EXEC DIR CTSTC
Task number: INT 07/020
File number: 2009/1016253/1
Pages or format: 82
References: 38
DSTORL/DEFTEST terms: Information extraction
Machine learning
Other descriptors: Systemic-functional linguistics
Text mining
Text categorisation
Abstract: Systemic-functional linguistics is a linguistic framework for the analysis of grammatical and semantic information in text, with a potential role in automated text mining. This report outlines essential features of the theory, its application in computational work, and the rationale for use in automated text mining, and develops a grammatical annotation scheme– word functions– to enrich a mixed text corpus of newspaper articles and e-mails, for machine learning of semantically-oriented grammatical patterns. Testing demonstrates high accuracy in predicting word functions in unseen text in co-training with other grammatical information, providing the basis for further grammatical and semantic text processing.
Executive summary: Using grammatical and semantic patterns as the basis for large-scale text processing has wide potential to improve the quality and speed of information management and analytical tasks in the defence and intelligence domains. It is proposed that a robust linguistic model is needed to support the automation of these tasks, which is achieved by co-training semantic and grammatical information with unstructured text, and that systemic-functional linguistics (SFL) provides a prime means for achieving this. SFL is a linguistic theory that has had a substantial presence in natural language processing work for the past 40 years, with recent developments in rule-based and machine learning (ML)- based text processing. An outline of the theoretical apparatus of SFL is presented, focusing on a detailed treatment of the functional structure of word groups and phrases. This is used to derive a grammatical annotation scheme for the labelling of the functions of single tokens in unstructured text (WFG). A justification for using this scheme is presented, and a method is outlined for the preprocessing of unstructured text and for annotation with the WFG scheme, in order to produce training and testing corpora for a ML system employing the 'conditional random fields' algorithm. It is demonstrated via this system that automated WFG annotation can be achieved with high accuracy, and that such labelling supports the automated recognition of other grammatical information such as chunk labelling. It is proposed that WFG annotation provides a robust semantically-oriented foundation for other kinds of semantically-based text processing, such as information extraction and text categorisation, which are important elements in information management in defence and intelligence tasks.
Appears in Collections:DSTO Formal Reports

Files in This Item:

File Description SizeFormat
DSTO-RR-0339 PR.pdf365.67 kBAdobe PDFView/Open

Items in DSTO Publications Online are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0! DSpace Software Copyright © 2002-2008  The DSpace Foundation - Feedback