Wednesday, March 9, 2011

GATE Plugins and CREOLE Resources

GATE (or General Architecture for Text Engineering) is very extensible. Its architecture is based on components (or resources). Its framework functions as a backplane into which users can plug components.

Each component (i.e., a Java Beans), is a reusable chunks of software with well-defined interfaces that may be deployed in a variety of contexts. You can define applications with processing pipelines using these reusable components. In GATE, these resources are officially named CREOLE (i.e., Collection of REusable Objects for Language Engineering).

A set of components plus the framework is a deployment unit which can be embedded in user's applications.

CREOLE Resources

GATE components are one of three types:

  1. Language Resources (LRs) represent entities such as lexicons (e.g. Word-Net), corpora or ontologies
  2. Processing Resources (PRs) represent entities that are primarily algorithmic, such as parsers, generators or n-gram modellers
  3. Visual Resources (VRs) represent visualisation and editing components that participate in GUI
To better organize CREOLE resources, CREOLE plugins are used. In other words, resource implementations can be grouped together as ‘plugins’ and stored at a URL. When the resources are stored in the local file system, this can be a file URL (i.e. file:///D:/Gate/Workspace/GoldFish/) .

CREOLE Plugins

To create a CREOLE plugin, you layout its contents in a directory. Within the directory, it can have a jar which holds its resource implementation, a configuration file (i.e., creole.xml), and external resources such as rules, gazetteer lists, schemas, etc in a resources folder.

To create one, you can use BootStrap Wizard in GATE Developer. For example, we create a new plugin with a single Processing Resource named GoldFish as shown below:



The following files and directories are created:
GoldFish/

+-- src/
put all your Java sources in here.
+-- resources/
any external files used by your plugin (e.g. configuration files,
JAPE grammars, gazetteer lists, etc.) go in here.
+-- build.xml
Ant build file for building your plugin.
+-- build.properties
property definitions that control the build process go in here,
in particular, make sure that gate.home points to your copy of GATE.
+-- creole.xml
plugin configuration file for GATE - edit this to add parameters, etc.,
for your resources.

Using CREOLE Resources

In the applications using GATE Embedded, you can contruct an information extraction (or IE) pipeline using CREOLE resources from different CREOLE plugins. For example, in the Gold Fish example, it constructs a pipeline (i.e., SerialAnalyserController) using three different PRs:
String[] processingResources = {
"gate.creole.tokeniser.DefaultTokeniser",
"gate.creole.splitter.SentenceSplitter",
"sheffield.creole.example.GoldFish"};
SerialAnalyserController pipeline = (SerialAnalyserController)Factory
.createResource("gate.creole.SerialAnalyserController");

for(int pr = 0; pr <processingResource.length; pr++) {
System.out.print("\t* Loading " + processingResource[pr] + " ... ");
pipeline.add((gate.LanguageAnalyser)Factory
.createResource(processingResource[pr]));
}

Two of them are provided by ANNIE plugin and the third one (i.e., sheffield.creole.example.GoldFish) is provided by GoldFish plugin.

In order to use a CREOLE resource, the relevant CREOLE plugin must be loaded. For example, in the Gold Fish Example, it loads two plugins as follows:
// Load GlodFish plugin
Gate.getCreoleRegister().registerDirectories(
new File(System.getProperty("user.dir")).toURI().toURL());
// Load ANNIE plugin for the Defaulttokeniser and SentenceSplitter
Gate.getCreoleRegister().registerDirectories(
new File(Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR).toURI().toURL());

Note that all CREOLE resources (i.e., LRs, PRs, and VRs) require that the appropriate plugin be first loaded. The only exceptions are: Document, Corpus or DataStore. For those, you do not need to first load a plugin.

In the above statements, we use registerDirectories() API to load plugins from a given CREOLE directory URL. Note that CREOLE directory URLs should point to the parent location of the creole.xml file.

When a plugin is loaded into GATE it looks for a configuration file called creole.xml relative to the plugin URL and uses the contents of this file to determine what resources this plugin declares and where to find the classes that implement the resource types (typically these classes are stored in a JAR file in the plugin directory).

In the next sections, we will examine the structures of two CREOLE plugins:

  1. ANNIE plugin
  2. GoldFish plugin

ANNIE Plugin

ANNIE plugin has the following layout:
/plugins/ANNIE/ (i.e., ANNIE's plugin directory)
+-- resources/
+-- BengaliNE/
+-- gazeteer/
+-- heptag/
+-- NE/
+-- othomatcher/
+-- regex-splitter/
+-- schema/
+-- sentenceSplitter/
+-- tokenizer/
+-- VP/
+-- build.xml
+-- creole.xml

From creole.xml (i.e., plugin configuration file), we can find the following resources declared:

  • Annotation Schema
  • PRs
    • GATE Unicode Tokeniser
    • ANNIE English Tokeniser
    • ANNIE Gazetteer
    • Sharable Gazetteer
    • Hash Gazetteer
    • Jape Transducer
    • ANNIE NE Transducer
    • ANNIE Sentence Splitter
    • RegEx Sentence Splitter
    • ANNIE POS Tagger
    • ANNIE OrthoMatcher
    • ANNIE Pronominal Coreferencer
    • ANNIE Nominal Coreferencer
    • Document Reset PR
  • VR
    • Jape Viewer

ANNIE is unique in that it's part of the GATE framework. So, all of its components are implemented in the framework (i.e., included in gate.jar). Therefore, it doesn't have a jar file in its directory. However, it does provide external resources like gazetteer lists, JAPE rules, schema, etc. These resources are referenced from CREOLE resource definition. For example, the definition of GATE Unicode Tokeniser is defined as:

<RESOURCE>
<NAME>GATE Unicode Tokeniser</NAME>
<CLASS>gate.creole.tokeniser.SimpleTokeniser</CLASS>
<COMMENT>A customisable Unicode tokeniser.</COMMENT>
<HELPURL>http://gate.ac.uk/userguide/sec:annie:tokeniser</HELPURL>
<PARAMETER NAME="document"
COMMENT="The document to be tokenised" RUNTIME="true">
gate.Document
</PARAMETER>
<PARAMETER NAME="annotationSetName" RUNTIME="true"
COMMENT="The annotation set to be used for the generated annotations"
OPTIONAL="true">
java.lang.String
</PARAMETER>
<PARAMETER
DEFAULT="resources/tokeniser/DefaultTokeniser.rules"
COMMENT="The URL to the rules file" SUFFIXES="rules"
NAME="rulesURL">
java.net.URL
</PARAMETER>
<PARAMETER DEFAULT="UTF-8"
COMMENT="The encoding used for reading the definitions"
NAME="encoding">
java.lang.String
</PARAMETER>
<ICON>tokeniser</ICON>
</RESOURCE>
Its rulesURL parameter has a default value which points to a rule file stored in the resources subfolder:
resources/tokeniser/DefaultTokeniser.rules

Gold Fish Plugin

GlodFish plugin has the following layout:
GoldFish/ (i.e., GoldFish's plugin directory)
+-- build.xml
+-- build.properties
+-- creole.xml
+-- GoldFish.jar

The class "sheffield.creole.example.GoldFish" in GoldFish.jar provides the implementation of the new PR. Because this PR doesn't need any gazetteer list or rules, it has an empty resources folder. In its creole.xml, the content is as simple as:
<CREOLE-DIRECTORY>
<JAR SCAN="true">GoldFish.jar</JAR>
</CREOLE-DIRECTORY>

This tells GATE to load GoldFish.jar and scan its contents looking for resource classes annotated with @CreoleResource.

Configuration Data

Configuration data for the resources may be stored directly in the creole.xml file, or it may be stored as Java annotations on the resource classes themselves; in either case GATE retrieves this configuration information and adds the resource definitions to the CREOLE register. When a user requests an instantiation of a resource, GATE creates an instance of the resource class in the virtual machine.

To learn more on creole.xml, read this section of GATE's user guide. To learn more on Java annotations, read this section.

Friday, March 4, 2011

How to Create a Standalone Application Using GATE Embedded

General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for all sorts of natural language processing tasks, including information extraction in many languages.

For non-programmers, you can use GATE Developer for all of the NLP tasks. For programmers, you can use GATE Embedded to embed its language processing functionality in your applications. In this article, we will demontrate how to use GATE in a standalone application named GoldFish.

Gold Fish Example

Andrew Golightly has noticed a lot of programmers having problems running GATE outside of the GUI (i.e., wanting to run it as a standalone program).

So, he has written a sample program which essentially implements the "Goldfish" example in the User Guide for GATE (see http://www.gate.ac.uk/sale/tao/index.html#x1-220002.7).

This program counts the number of times the word "Goldfish" appears in a sentence. It uses three Processing Resources (PRs) to achieve that:
  • DefaultTokeniser
  • SentenceSplitter
  • GoldFish
The first two PRs are provided in ANNIE plugin while the third is a new PR provided in this sample program. This sample program was created in 2003 and is a bit dated. This article tries to fill in the gaps and show what changes are needed from the original program provided by Andrew.

BootStrap Wizard

To create a new PR you need to:
  • Write a Java class (i.e., GoldFish.java) that implements GATE’s beans model
  • Compile the class, and any others that it uses, into a Java Archive (JAR) file
  • Write some XML configuration data (i.e., creole.xml) for the new resource
  • Tell GATE the URL of the new JAR and XML files.
GATE Developer helps you with this process by creating a set of directories and files that implement a basic resource, including a Java code file and a Makefile. This process is called ‘bootstrapping’. To bootstrap, you do:
  • Start up GATE Developer
  • Start up BootStrap Wizard (Tools > BootStrap Wizard)
  • Fill in the information as shown below

A new folder named GoldFish is created as follows:

GoldFish/

+-- classes/

+-- src/
put all your Java sources in here.
+-- lib/

+-- resources/
any external files used by your plugin (e.g. configuration files,
JAPE grammars, gazetteer lists, etc.) go in here.
+-- build.xml
Ant build file for building your plugin.
+-- build.properties
property definitions that control the build process go in here,
in particular, make sure that gate.home points to your copy of GATE.
+-- creole.xml
plugin configuration file for GATE - edit this to add parameters, etc.,
for your resources.


For my environment, I need to update gate.home property in build.properties to point to my new GATE installation location:
gate.home=D:/Gate/gate-6.0-build3764-BIN

Eclipse


Next we use Eclipse IDE for our project development. You can proceed as follows:

  • Start up Eclipse
  • Bring up New Project Dialog (File > New > Project)
  • Select "Java Project from Existing Ant Buildfile" wizard
  • Specify GoldFish as your project name
  • Select GoldFish/build.xml as your Ant buildfile
  • Click Finish

A new project named GoldFish is created as below:
Under src folder, there is a file named GoldFish.java in a package named sheffield.creole.example, which is created by BootStrap Wizard. Now, let's copy two files:


from GATE example code repository and put them in src/sheffield/creole/example . Note that you need to fix up package name (i.e., from andrewgolightly.nlp.gate to sheffield.creole.example) and class name (from Goldfish to GoldFish). So, your src/sheffield/creole/example folder looks like this:

src/
+--sheffield/
+--creole/
+-- example/
+-- GoldFish.java
+-- TotalGoldfishCount.java


Before we proceed, we need to add two statements below Gate.init() in TotalGoldfishCount.java:

// need resource data for GlodFish
Gate.getCreoleRegister().registerDirectories(
new File(System.getProperty("user.dir")).toURL());
// need ANNIE plugin for the Defaulttokeniser and SentenceSplitter
Gate.getCreoleRegister().registerDirectories(
new File(Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR).toURL()
);

Without these fixes, you'll see the following ResourceInstantiationException exception:

gate.creole.ResourceInstantiationException: Couldn't get ...

Next, you need to edit the Run/Debug Settings of project properties to add a single argument:



testFile.txt

This is our input document to be processed by the GATE pipeline. You can copy it from here.

Now we need to compile the class and package it into a JAR file. The bootstrap wizard creates an Ant build file that makes this very easy – so long as you have Ant set up properly, you can simply run
ant jar

from command line. This will compile the Java source code and package the resulting classes into GoldFish.jar.

Finally, you can run TotalGoldfishCount by right selecting it from Package Explorer(TotalGoldfishCount.java > Run As > Java Application). If everything was set up appropriately, you should see the following output from the console:

== OBTAINING DOCUMENTS ==
1) testFile.txt -- success
== USING GATE TO PROCESS THE DOCUMENTS ==
* Loading gate.creole.tokeniser.DefaultTokeniser ... done
* Loading gate.creole.splitter.SentenceSplitter ... done
* Loading sheffield.creole.example.GoldFish ... done
Creating corpus from documents obtained...done
Running processing resources over corpus...done
== DOCUMENT FEATURES ==
The features of document "/D:/Gate/GoldFishExample/GoldFish/testFile.txt" are:
*) Number of tokens --> 56
*) Total "Goldfish" count --> 9
*) Number of words --> 46
*) Number of characters --> 322
*) Number of sentences --> 7

Demo done... :)