Recognizes the document language using TextCat. Possible languages: german, english, french, spanish, italian, swedish, polish, dutch, norwegian, finnish, albanian, slovakian, slovenian, danish, hungarian.
There are two phases in the language identification task:
We'll discuss those in the following sections.
In the training phase, the goal is to generate category profiles from the given category samples. In Language Identification PR (or TextCat PR), the categories are languages. So, we take document samples from different languages (i.e., English, German, etc.) and use them to generate category profiles.
These category profiles are already provided in TextCat PR. At runtime, TextCat PR looks for a configuration file named textcat.conf. This files has the following content:
In a sub-folder named language_fp which is relative to the location of textcat.conf, there are multiple category profile files with lm suffix. For example, german.lm is the category profile for German and english.lm is the category profile for English.
Using English profile as an example, its content looks like this:
On each line, there are two elements:
- N-gram (N is from 1 to 5)
N-grams are sorted in the reverse order of frequency. For example, the most frequently found character in English documents is the space character (i.e., represented by '_') whose count of occurrences is 20326. From the training data, we also find that the most frequently found 2-gram is 'e_' (i.e., letter 'e' followed by a space).
In the application phase, the TextCat PR reads the learned model (i.e., category profiles ) and then applies the model to the data. Given a new document, first we generate a document profile (i.e., N-grams frequency profile) similar to the category profiles.
The language classification task is then to measure profile distance: For each N-gram in the document profile, we find its counterpart in the category profile, and then calculate how far out of place it is.
Finally, the bubble labelled "Find Minimum Distance" simply takes the distance measures from all of the category profiles to the document profile, and picks the smallest one.
What's in TextCat PR?
If you look inside the textcat-1.0.1.jar, you can identify the following structure:
Unfortunately, you cannot find the above source files from GATE's downloads. However, after Google search, I've found them from Google Code here.