Xml and More: GATE Plugins

Showing posts with label GATE Plugins. Show all posts

Wednesday, July 13, 2011

Language Identification

Language identification is one of the supervised learning method. In this article, we will cover a specific Processing Resource (PR) in GATE (i.e., TextCat or Language Identification PR). Based on its documentation, it says that it:

Recognizes the document language using TextCat. Possible languages: german, english, french, spanish, italian, swedish, polish, dutch, norwegian, finnish, albanian, slovakian, slovenian, danish, hungarian.

N-Gram-Based Text Categorization

TextCat PR uses N-Gram for text categorization. You can find the details from this article. See the following diagram for its data flow.

There are two phases in the language identification task:

Training
Application

We'll discuss those in the following sections.

Training Phase

In the training phase, the goal is to generate category profiles from the given category samples. In Language Identification PR (or TextCat PR), the categories are languages. So, we take document samples from different languages (i.e., English, German, etc.) and use them to generate category profiles.

These category profiles are already provided in TextCat PR. At runtime, TextCat PR looks for a configuration file named textcat.conf. This files has the following content:

language_fp/german.lm    german
language_fp/english.lm    english
language_fp/french.lm    french
language_fp/spanish.lm    spanish
language_fp/italian.lm    italian
language_fp/swedish.lm    swedish
language_fp/polish.lm    polish
language_fp/dutch.lm    dutch
language_fp/norwegian.lm    norwegian
language_fp/finnish.lm    finnish
language_fp/albanian.lm    albanian
language_fp/slovak-ascii.lm    slovakian
language_fp/slovenian-ascii.lm    slovenian
language_fp/danish.lm    danish
language_fp/hungarian.lm    hungarian

In a sub-folder named language_fp which is relative to the location of textcat.conf, there are multiple category profile files with lm suffix. For example, german.lm is the category profile for German and english.lm is the category profile for English.

Using English profile as an example, its content looks like this:

_     20326
e     6617
t     4843
o     3834
n     3653
i     3602
a     3433
s     2945
r     2921
h     2507
e_     2000
d     1816
_t     1785
c     1639
l     1635
th     1535
he     1351
_th     1333
...

On each line, there are two elements:

N-gram (N is from 1 to 5)
Frequency

N-grams are sorted in the reverse order of frequency. For example, the most frequently found character in English documents is the space character (i.e., represented by '_') whose count of occurrences is 20326. From the training data, we also find that the most frequently found 2-gram is 'e_' (i.e., letter 'e' followed by a space).

Application Phase

In the application phase, the TextCat PR reads the learned model (i.e., category profiles ) and then applies the model to the data. Given a new document, first we generate a document profile (i.e., N-grams frequency profile) similar to the category profiles.

The language classification task is then to measure profile distance: For each N-gram in the document profile, we find its counterpart in the category profile, and then calculate how far out of place it is.

Finally, the bubble labelled "Find Minimum Distance" simply takes the distance measures from all of the category profiles to the document profile, and picks the smallest one.

What's in TextCat PR?

If you look inside the textcat-1.0.1.jar, you can identify the following structure:

org/
+--knallgrau/
+--utils/
 +-- textcat/
    +-- FingerPrint.java
    +-- MyProperties.java
    +-- NGramEntryComparator.java
    +-- TextCategorizer.java
    +-- textcat.conf
    +-- language_fp/
        +-- english.lm
        +-- german.lm
        +-- ...

Unfortunately, you cannot find the above source files from GATE's downloads. However, after Google search, I've found them from Google Code here.

Wednesday, March 9, 2011

GATE Plugins and CREOLE Resources

GATE (or General Architecture for Text Engineering) is very extensible. Its architecture is based on components (or resources). Its framework functions as a backplane into which users can plug components.

Each component (i.e., a Java Beans), is a reusable chunks of software with well-defined interfaces that may be deployed in a variety of contexts. You can define applications with processing pipelines using these reusable components. In GATE, these resources are officially named CREOLE (i.e., Collection of REusable Objects for Language Engineering).

A set of components plus the framework is a deployment unit which can be embedded in user's applications.

CREOLE Resources

GATE components are one of three types:

Language Resources (LRs) represent entities such as lexicons (e.g. Word-Net), corpora or ontologies
Processing Resources (PRs) represent entities that are primarily algorithmic, such as parsers, generators or n-gram modellers
Visual Resources (VRs) represent visualisation and editing components that participate in GUI

To better organize CREOLE resources, CREOLE plugins are used. In other words, resource implementations can be grouped together as ‘plugins’ and stored at a URL. When the resources are stored in the local file system, this can be a file URL (i.e. file:///D:/Gate/Workspace/GoldFish/) .

CREOLE Plugins

To create a CREOLE plugin, you layout its contents in a directory. Within the directory, it can have a jar which holds its resource implementation, a configuration file (i.e., creole.xml), and external resources such as rules, gazetteer lists, schemas, etc in a resources folder.

To create one, you can use BootStrap Wizard in GATE Developer. For example, we create a new plugin with a single Processing Resource named GoldFish as shown below:

The following files and directories are created:

GoldFish/

+-- src/
  put all your Java sources in here.
+-- resources/
  any external files used by your plugin (e.g. configuration files,
  JAPE grammars, gazetteer lists, etc.) go in here.
+-- build.xml
  Ant build file for building your plugin.
+-- build.properties
  property definitions that control the build process go in here,
  in particular, make sure that gate.home points to your copy of GATE.
+-- creole.xml
   plugin configuration file for GATE - edit this to add parameters, etc.,
   for your resources.

Using CREOLE Resources

In the applications using GATE Embedded, you can contruct an information extraction (or IE) pipeline using CREOLE resources from different CREOLE plugins. For example, in the Gold Fish example, it constructs a pipeline (i.e., SerialAnalyserController) using three different PRs:

String[] processingResources = {
   "gate.creole.tokeniser.DefaultTokeniser",
   "gate.creole.splitter.SentenceSplitter",
   "sheffield.creole.example.GoldFish"};
SerialAnalyserController pipeline = (SerialAnalyserController)Factory
       .createResource("gate.creole.SerialAnalyserController");

for(int pr = 0; pr <processingResource.length; pr++) {
 System.out.print("\t* Loading " + processingResource[pr] + " ... ");
 pipeline.add((gate.LanguageAnalyser)Factory
         .createResource(processingResource[pr]));
}

Two of them are provided by ANNIE plugin and the third one (i.e., sheffield.creole.example.GoldFish) is provided by GoldFish plugin.

In order to use a CREOLE resource, the relevant CREOLE plugin must be loaded. For example, in the Gold Fish Example, it loads two plugins as follows:

// Load GlodFish plugin
Gate.getCreoleRegister().registerDirectories(
       new File(System.getProperty("user.dir")).toURI().toURL());
// Load ANNIE plugin for the Defaulttokeniser and SentenceSplitter
Gate.getCreoleRegister().registerDirectories(
    new File(Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR).toURI().toURL());

Note that all CREOLE resources (i.e., LRs, PRs, and VRs) require that the appropriate plugin be first loaded. The only exceptions are: Document, Corpus or DataStore. For those, you do not need to first load a plugin.

In the above statements, we use registerDirectories() API to load plugins from a given CREOLE directory URL. Note that CREOLE directory URLs should point to the parent location of the creole.xml file.

When a plugin is loaded into GATE it looks for a configuration file called creole.xml relative to the plugin URL and uses the contents of this file to determine what resources this plugin declares and where to find the classes that implement the resource types (typically these classes are stored in a JAR file in the plugin directory).

In the next sections, we will examine the structures of two CREOLE plugins:

ANNIE plugin
GoldFish plugin

ANNIE Plugin

ANNIE plugin has the following layout:

/plugins/ANNIE/ (i.e., ANNIE's plugin directory)
+-- resources/
  +-- BengaliNE/
  +-- gazeteer/
  +-- heptag/
  +-- NE/
  +-- othomatcher/
  +-- regex-splitter/
  +-- schema/
  +-- sentenceSplitter/
  +-- tokenizer/
  +-- VP/
+-- build.xml
+-- creole.xml

From creole.xml (i.e., plugin configuration file), we can find the following resources declared:

Annotation Schema
PRs
- GATE Unicode Tokeniser
- ANNIE English Tokeniser
- ANNIE Gazetteer
- Sharable Gazetteer
- Hash Gazetteer
- Jape Transducer
- ANNIE NE Transducer
- ANNIE Sentence Splitter
- RegEx Sentence Splitter
- ANNIE POS Tagger
- ANNIE OrthoMatcher
- ANNIE Pronominal Coreferencer
- ANNIE Nominal Coreferencer
- Document Reset PR
VR
- Jape Viewer

ANNIE is unique in that it's part of the GATE framework. So, all of its components are implemented in the framework (i.e., included in gate.jar). Therefore, it doesn't have a jar file in its directory. However, it does provide external resources like gazetteer lists, JAPE rules, schema, etc. These resources are referenced from CREOLE resource definition. For example, the definition of GATE Unicode Tokeniser is defined as:

<RESOURCE>
<NAME>GATE Unicode Tokeniser</NAME>
<CLASS>gate.creole.tokeniser.SimpleTokeniser</CLASS>
<COMMENT>A customisable Unicode tokeniser.</COMMENT>
<HELPURL>http://gate.ac.uk/userguide/sec:annie:tokeniser</HELPURL>
<PARAMETER NAME="document"
 COMMENT="The document to be tokenised" RUNTIME="true">
 gate.Document
</PARAMETER>
<PARAMETER NAME="annotationSetName" RUNTIME="true"
 COMMENT="The annotation set to be used for the generated annotations"
 OPTIONAL="true">
 java.lang.String
</PARAMETER>
<PARAMETER
 DEFAULT="resources/tokeniser/DefaultTokeniser.rules"
 COMMENT="The URL to the rules file" SUFFIXES="rules"
 NAME="rulesURL">
 java.net.URL
</PARAMETER>
<PARAMETER DEFAULT="UTF-8"
 COMMENT="The encoding used for reading the definitions"
 NAME="encoding">
 java.lang.String
</PARAMETER>
<ICON>tokeniser</ICON>
</RESOURCE>

Its rulesURL parameter has a default value which points to a rule file stored in the resources subfolder:
resources/tokeniser/DefaultTokeniser.rules

Gold Fish Plugin

GlodFish plugin has the following layout:

GoldFish/ (i.e., GoldFish's plugin directory)
+-- build.xml
+-- build.properties
+-- creole.xml
+-- GoldFish.jar

The class "sheffield.creole.example.GoldFish" in GoldFish.jar provides the implementation of the new PR. Because this PR doesn't need any gazetteer list or rules, it has an empty resources folder. In its creole.xml, the content is as simple as:

<CREOLE-DIRECTORY>
<JAR SCAN="true">GoldFish.jar</JAR>
</CREOLE-DIRECTORY>

This tells GATE to load GoldFish.jar and scan its contents looking for resource classes annotated with @CreoleResource.

Configuration Data

Configuration data for the resources may be stored directly in the creole.xml file, or it may be stored as Java annotations on the resource classes themselves; in either case GATE retrieves this configuration information and adds the resource definitions to the CREOLE register. When a user requests an instantiation of a resource, GATE creates an instance of the resource class in the virtual machine.

To learn more on creole.xml, read this section of GATE's user guide. To learn more on Java annotations, read this section.

Cross Column