Sunday, December 4, 2011

Volume Rendering using MapReduce

In [1], it has given a few examples of applications that can be easily expressed as MapReduce computations:
  1. Distributed Grep
  2. Count of URL Access Frequency
  3. Reverse Web-Link Graph
  4. Term-Vector per Host
  5. Inverted Index
There are many other applications that can be easily expressed in MapReduce programming model. In this article, we will show one of them — Volume Rendering.

Applying MapReduce

To determine if MapReduce might be a potential solution to a concurrent programming. Here are the questions to ask:
  • Does the algorithm break down into two separate phases (i.e., Map and Reduce)?
  • Can the data be easily decomposed into equal-size partitions in the first phase (i.e., Map)?
  • Can the same processing be applied to each partition, with no dependencies in the computations, and no communication required between tasks in the first phase?
  • Is there some “mapping” of data to keys involved?
  • Can you “reduce” the results of the first phase to compute the final answer(s)?
    If all the answers are yes, you have an ideal candidate for the MapReduce computation. In [2], Jeff A. Stuart et al. have demonstrated a multi-GPU parallel volume rendering implementation built using the MapReduce programming model.

    Volume Rendering

    In [2], Jeff A. Stuart el al. used a volume rendering technique called segmented ray casting [5] (or ray partitioning [6]).

    In [3,4], I and my colleagues have demonstrated an alternative way of parallel implementation of volume rendering on Denali. In Fig. 1, we see that sample points along the rays with the same distance from the image plane are in the same plane. So, instead of casting rays, we can equally well sample the volume perpendicular to the viewing direction at different distances from the image plane. This parallelization scheme is called parallel plane cutting.

    Figure 1. Parallel plane cutting vs. segmented ray casting

    In this article, I'll explore the possibility of adapting parallel plane cutting to MapReduce computation.

    MapReduce Basics[7,8]

    MapReduce is an algorithmic framework, like divide-and-conquer or backtracking. Its model derives from the map and reduce combinators from a functional language like Lisp. It is an abstraction that allows Google engineers to perform simple computations while hiding the details of:
    • Parallelization
    • Data distribution
    • Load balancing
    • Fault tolerance
    A MapReduce job is a unit of work that the client wants to be performed: it consists of:
    • Input data
    • MapReduce program
    • Configuration information
    The user configures and submits a MapReduce job to the framework (e.g., Hadoop), which will decompose the job into a set of map tasks, shuffles, a sort, and a set of reduce tasks. The framework will then manage the distribution and execution of the tasks, collect the output, and report the status to the user.

    A MapReduce job implemented in Hadoop is illustrated below:


    Figure 2. Components of a Hadoop's MapReduce Job[7]

    The data flow of the model is shown in Figure 3. This diagram shows why the data flow between map and reduce tasks is colloquially known as “the shuffle,” as each reduce task is fed by many map tasks.

    Figure 3. Data Flow of MapReduce programming model

    Map and Reduce Tasks

    In this article, we will use Hadoop as the framework for our design consideration. Hadoop supports the MapReduce model which was introduced by Google as a method of solving a class of petascale problems with large clusters of inexpensive machines. Hadoop runs the MapReduce job by dividing it into tasks, of which there are two main types:
    • Map tasks
    • Reduce tasks

    The idea behind map is to take a collection of data items and associate a value with each item in the collection. That is, to match up the elements of the input data with some relevant value to produce a collection of key-value pairs. In terms of concurrency, the operation of pairing up keys and values should be completely independent for each element in the collection.

    The reduce operation takes all the pairs resulting from the map operation and does a reduction computation on the collection. The purpose of a reduction is to take in a collection of data items and return a value derived from those items. In more general terms, we can allow the reduce operation to return with zero, one, or any number of results. This will all depend on what the reduction operation is computing and the input data from the map operation.

    Data Decomposition

    As shown in Figure 2, the first design consideration is data composition (or split). There are at least two factors to be considered:
    • Data locality
    • Task granularity vs. parallel overhead cost
    Data locality promotes performance. Hadoop does its best to run the map task on a node where the input data resides in (Hadoop Distributed Filesystem) HDFS. However, reduce tasks don’t have the advantage of data locality—the input to a single reduce task is normally the output from all mappers. For our volume rendering example, local sub-volume data will help the performance of map tasks.

    Fine-grain parallelism allows for a more uniform distribution of load among nodes, but has the potential for a significant overhead. On the other hand, Coarse-grain parallelism incurs a small overhead, but may not produce a balanced loading. For our volume rendering, there will be an optimal sub-volume size (TBD) that incurs a smaller overhead while produces a better load balancing.

    InputFormat

    In Hadoop (see Figure 2), user-provided InputFormat can be used for custom data decomposition. An InputFormat describes both how to present the data to the Mapper and where the data originates from. An important job of the InputFormat is to divide the input data sources (e.g., input files) into fragments that make up the inputs to individual map tasks. These fragments are called splits and are encapsulated in instances of the InputSplit interface.

    In the parallel cutting plane approach, we subdivide volume into sub-volumes for the rendering. Volume data can be stored in different formats. To simplify this discussion, we assume our input data are stored in sub-volumes (i.e., voxels belonging to the same sub-volume are stored consecutively and in an individual file).

    Objects which can be marshaled to or from files and across the network must obey a particular interface, called Writable, which allows Hadoop to read and write the data in a serialized form for transmission. If the Objects are Keys, WritableComparable interface should be used instead.

    To support our volume renderer, a custom InputFormat with two custom data types (i.e., SubVolumeKey and SubVolumeValue) needs to be created. A high level description of the implementation is provided below:
    public class VolumeInputFormat extends
    SequenceFileInputFormat {
    
    public RecordReader getRecordReader(
    InputSplit input, JobConf job, Reporter reporter)
    throws IOException {
    
    reporter.setStatus(input.toString());
    return new VolumeRecordReader(job, (FileSplit)input);
    }
    ...
    }
    The RecordReader implementation is where the actual file information is read and parsed.

    class VolumeRecordReader implements RecordReader {
    
    public VolumeRecordReader (JobConf job, FileSplit split) throws IOException {
    ..
    }
    
    public boolean next(SubVolumeKey key, SubVolumeValue value) throws IOException {
    // get next sub-volume
    }
    
    public Text createKey() {
    return new SubVolumeKey();
    }
    
    public Point3D createValue() {
    return new SubVolumeValue ();
    }
    ...
    }
    In SubVolumeKey, you need to provide the following minimum information:
    • 2D footprint offset (Fx, Fy)
    • Transformation matrix (M)
    • 3D sub-volume offset (Vx, Vy, Vz)
    • Resampling mode (R)
    • 3D Zooming and 2D Scaling factors (Z and S)
    • Projection function (P; for example max operation)
    Resampling of sub-volumes on each cutting plane can be done independently as long as we can provide sufficient information as shown in the sample SubVolumeKey to each map task. For the detailed description of SubVolumeKey's parameters, refer to [3,4].

    Map Function

    In this article, we will use Maximum Intensity Projection (MIP) as our volume rendering example. In scientific visualization, MIP is a volume rendering method for 3D data that projects in the visualization plane the voxels with maximum intensity that fall in the way of parallel rays traced from the viewpoint to the plane of projection.

    Same principles used for MIP can be applied to Isosurface Rendering (SR). In SR, a Z-buffer or depth matrix is generated as the result. This matrix is actually a 2D image whose values are the depth values at which an isosurface threshold occurs for a given viewing direction. A shading procedure using depth-gradient shading is then applied to generate a colored image.

    In [3], we have demonstrated other parallel volume rendering methods too:
    • Multi-Planar Reformatting
    • Volume Resampling
    • Ray Sum
    MIP has a nice property—you can apply the reduction computations to individual items and partial results of previous reductions:
    • max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
    The map task of volume rendering is to generate 2D footprints out of a given sub-volume. To perform projection, we apply the transformation matrix (M) to the volume coordinates (Vx, Vy, Vz) and find the bounding box of each sub-volume. Based on the bounding box and zooming factor (Z), we can find out number of cutting planes need to be sampled in the sub-volume. In x and y directions, all coordinates are rounded or truncated to the closest discrete pixel position in the image plane. In z direction, we define discrete plane levels (not necessary integer coordinates) and all coordinates are rounded or truncated to the closest plane level. After the adjustment of the coordinates of the bounding box as described above, we sample the bounding box of the sub-volume via plane cutting.

    For MIP, the map task includes the following sub-tasks:
    • Resample voxels on each cutting plane
    • Prepare intermediate results for the consumption of reduce tasks
    Each map task will generate as many 2D footprints as required to be sent to reduce tasks. 3D resampling can be done in either point sampling or linear interpolation. The projected footprint will then be scaled based on the 2D scaling factor (S) before being sent out.

    Sort and Shuffle

    Custom data types are needed for the intermediate results (i.e., 2D image tiles):
    • SubImageKey
    • SubImageValue
    In SubImageKey, you need to provide the following minimum information:
    • 2D footprint offset (Fx, Fy)
    • Projection function (P; for example max operation)
    • 2D footprint distance (Fz; but this is not needed in MIP)
    The implementation of compareTo method in SubImageKey, which implements a WritableComparable interface, should use (Fx, Fy) in the comparison for the shuffle or sort. For MIP, the ordering of intermediate results doesn't matter.

    Reduce Function

    The footprint of each sub-volume after projection is a 2D image tile. In Figure 4, we see that image tiles may overlay each other. The final image is created by recombining image tiles. Therefore alignment of image tiles in the projection and recombination process is an important task in this work. If not correct, you may introduce artifacts into the final image. For medical imaging, none of such artifacts can be tolerated.


    Figure 4. Projection and Recombination

    For MIP, the reduce task includes the following sub-tasks:
    • Apply projection function (i.e., max) to each pixels on the intermediate results
    • Assemble the final image in a global output file with a specified format.

    Conclusion

    For a divide-and-conquer approach, the construction of the final image requires a number of stages. Image tiles of individual sub-volumes are generated after sampling and blending. A recombination process which takes care of pixel alignments is used to place these tiles into the final image under a specific merging condition. Finally, a post-rendering process called Z merging, with a depth compare done upon merging, can be used to integrate volume images with 3D graphics.

    Finally, I want to use this article to pay tribute to Dr. Bruce H. McCormick (1928 - 2007) who is my most respected college professor and Ph.D. adviser [10].

    References

    1. Introduction to Parallel Programming and MapReduce
    2. Mult-GPU Volumne Rendering using MapReduce
    3. S. Y. Guan and R. Lipes, “Innovative Volume Kendering Using 3D Texture Mapping,” Proceedings of Medical imaging 1996-Image Capture. Formatting, and Display, vol. 2 164. pp. 382-392, Feb. 1994.
    4. S. Y. Guan, Bleiweiss, A., Lipes, R. “Parallel Implementation of Volume Rendering on Denali Graphics Systems,” Parallel Processing Symposium, 1995. Proceedings., 9th International, pp. 700-706,1995.
    5. E. Camahort and I. Chakravmty, “Integrating Volume Data Analysis and Rendering on Distributed Memory Architectures,” Proceedings of 1993 Parallel Rendering Symposium, pp. 89-96, San Jose. CA, Oct. 1993.
    6. W. M. Hsu, “Segmented Ray Casting for Data Parallel Volume Rendering,” Proceedings of 1993 Parallel Rendering Symposium. pp. 7-14, San Jose, CA, Oct. 1993.
    7. Pro Hadoop by Json Venner
    8. Hadoop: The Definitive Guide, Second Edition by Tom White
    9. Yahoo! Hadoop Tutorial
    10. Brain Networks Laboratory at Texas A&M University
    11. Learn Hadoop: How To Process Data with Apache Pig

    Saturday, November 5, 2011

    Sub-flow Design Pattern

    A design pattern is a formal documentation of a proven solution to a common problem. Within Oracle Fusion Web Applications[1], there are many design patterns embedded in their design. One of them is sub-flow design pattern.

    Before you start, read this companion article first.

    Usage

    In sub-flow design pattern, there are two task flows involved:
    • Parent task flow (top-level)
    • Sub-flow
    In Oracle Fusion Web Applications, all top-level task flows can be bookmarked and be launched from either task list or Recent Items menu. However, if the application requires that sub-flows can also be bookmarked and be launched from Recent Items menu. Then this sub-flow design pattern can be utilized for that functionality.

    If sub-flows are bookmarked, it can be relaunched from Recent Items menu. In the sub-flows, it's required that users can also navigate back to its parent flow. Sub-flow design pattern also takes that into consideration.

    Overview

    To record sub-flows into the Recent Items list, applications need to call openSubTask API right before sub-flows are launched[2]. openSubTask takes parameters similar to openMainTask's. One of them is task flow ID. For this, you need to specify parent flow's ID (or main task's ID). In other words, sub-flows need to be executed via parent flow even they are launched from Recent Items menu. See Sample Implementation section for details.
    If your sub-flow doesn't need to be bookmarked by Recent Items, you don't need to change anything. Otherwise, you need to modify your parent flow and sub-flow as described in the following task. After the changes, sub-flows can be launched in two ways:
    1. From original flows
    2. From Recent Items menu items using recorded information
    Both will start the execution in parent flow. Because sub-flow needs to be launched via parent flow in the 2nd case above, you need to change parent flow in this way:
    1. Add a new router activity at the beginning of the parent flow. Based on a test condition (to be described later), it will route the control to either the original parent flow or the task flow call activity (i.e., the sub-flow).
    2. Add an optional method call activity to initialize sub-flow before it's launched for the 2nd case (i.e., launching from Recent Items menu). Fusion developers can code the method in such a way that it can navigate to the sub-flow after initializing the parent state. This allows applications to render contextual area, navigating back to parent flow from sub-flow and any other customizations.
    3. Bind openSubTask to the command component (i.e., link or button) which causes the flow navigate to the task flow call activity in the original parent flow. openSubTask API registers the parent flow details (to be launched as a sub-flow later) to the Applications Core task flow history stack.
    Usually, you don't need to modify your sub-flow for this task. However, you can consolidate the initialization steps from two execution paths in such a way:
    1. Remove initialization parts from both paths in the parent flow. Instead set input parameters (which to be used as test conditions in sub-flows) in both paths only.
    2. Modify sub-flow to take input parameters.
    3. Add a new method call (say initSubFlow) at beginning of the sub-flow to initialize states in parent flow (for example, parent table) so that sub-flow can be launched in the appropriate context.
    Note that the design pattern also requires the application capable of navigating back to parent flow from sub-flow. So, the initialization code should take this into consideration (i.e., set up states to allow sub-flow to navigate back) too.
    In the following, we'll use an Employee sample implementation to demonstrate the details of this design pattern.

    Sample Implementation

    In this Fusion Web Application, users select Subflow Design Pattern from the Task list. They then specify some criteria for searching a specific employee or employees. From the list, they can choose the employee that they want to show the details for. This procedure is demonstrated in the following screen shots:
    Ename in the search result table is a link which can be used to navigate to the employee detail page of a specific employee. When this link is clicked, a sub-flow (or nested bounded task flow) is called and it displays the Employee Complete Detail page.

    If users would like to add this Employee Complete Detail page of a specific employee (say, employee named 'Allen') to their Recent Items list, application developers need to set up something extra to make this happen. If this page (actually what gets recorded is a bounded task flow whose default page is displayed) has been bookmarked, next time users can click it on the Recent Items menu and launch it directly by skipping the search step (i.e., identify the Employee whose details need to be displayed).

    Implementation Details

    Our parent task flow named ToParentSFFlow is shown below:


    decideFlow in the diagram is the router activity that decides whether the control flow should go to either original parent flow path (i.e., "initParent") or sub-flow path (i.e., "toChild"). The condition we used is defined as follows:
    <router id="decideFlow">
     <case>
       <expression>#{pageFlowScope.Empno == null}</expression>
       <outcome id="__9">initParent</outcome>
     </case>
    
     <case>
       <expression>#{pageFlowScope.Empno != null}</expression>
       <outcome id="__10">toChild</outcome>
     </case>
    
     <default-outcome>initParent</default-outcome>
    </router>
    In the test, we check whether Empno variable in the parent flow's pageFlowScope is null or not. #{pageFlowScope.Empno} is set via its input parameter Empno when parent flow is called . The input parameters on the parent flow (i.e., ToParentSFFlow) is defined as follows:
    <input-parameter-definition>
      <name>Empno</name>
      <value>#{pageFlowScope.Empno}</value>
      <class>java.lang.String</class>
    </input-parameter-definition>
     
    When parent flow is launched from Task List, parameter Empno is not set (i.e., not defined in the Application menu's itemNode). Therefore, it's null and router will route it to "initParent" path.
    When sub-flow is recorded via openSubTask API, we set Empno on the parametersList as follows:
    <methodAction id="openSubTask" RequiresUpdateModel="true"
                     Action="invokeMethod" MethodName="openSubTask"
                     IsViewObjectMethod="false" DataControl="FndUIShellController"
                     InstanceName="FndUIShellController.dataProvider"
                     ReturnName="FndUIShellController.methodResults.openSubTask_FndUIShellController_dataProvider_openSubTask_result">
         <NamedData NDName="taskFlowId" NDType="java.lang.String"
             NDValue="/WEB-INF/oracle/apps/xteam/demo/ui/flow/ToParentSFContainerFlow.xml#ToParentSFContainerFlow"/>
         <NamedData NDName="parametersList" NDType="java.lang.String"
                    NDValue="Empno=#{row.Empno}"/>
         <NamedData NDName="label" NDType="java.lang.String"
                    NDValue="#{row.Ename} complete details"/>
         <NamedData NDName="keyList" NDType="java.lang.String"/>
         <NamedData NDName="taskParametersList" NDType="java.lang.String"/>
         <NamedData NDName="viewId" NDType="java.lang.String"
                    NDValue="/DemoWorkArea"/>
         <NamedData NDName="webApp" NDType="java.lang.String"
                    NDValue="DemoAppSource"/>
         <NamedData NDName="methodParameters"
             NDType="oracle.apps.fnd.applcore.patterns.uishell.ui.bean.FndMethodParameters"/>
    </methodAction>
     
    We also set up:
    • taskFlowId to be parent flow's, not subflow's
    • label to be subflow's
    When end users click on the link (i.e., Ename), which the openSubTask method is bound to, openSubTask will be called. This link component is defined as follows:
    <af:column sortProperty="Ename" sortable="false"
               headerText="#{bindings.ComplexSFEmpVO.hints.Ename.label}"
               id="resId1c2">
      <af:commandLink id="ot3" text="#{row.Ename}"
                      actionListener="#{bindings.openSubTask.execute}"
                      disabled="#{!bindings.openSubTask.enabled}"
                      action="toChild">
        <af:setActionListener from="#{row.Empno}"
                              to="#{pageFlowScope.Empno}"/>
      </af:commandLink>
    </af:column>
    
    
     
    Note that when the link is clicked, actionListener and action specified on the link are executed and in that order. Also note that openSubTask needs to be called only from the original parent flow path (i..e, "initParent"), not sub-flow path(i.e., "toChild).
    EmployeeeDetails activity in the above figure is a Task Flow Call activity which invokes our sub-flow (i.e., ToChildSFFlow). Before sub-flow is executed, you need to add some initialization steps. These initialization steps could include, but not limited to:
    • Set up parent states. For our example, we need to set selected employee's row to be current.
    • Set up contextual area state.
    • Set up states to allow sub-flow to navigate back to parent flow.
    There are two approaches to set up initialization steps:
    1. In the parent flow
    2. In the sub-flow
    For the first approach, you can add logic to initialize both paths before the task flow call activity in the parent flow. For the second approach, you initialize states in the sub-flow by using input parameters of the sub-flow. For example, in our example, sub-flow will take an input parameter named Empno. So, the second approach just postpone the initialization to the sub-flow.
    Let's see how input parameters are defined in Task Flow Call activity and sub-flow.
    Here is the definition of input parameters in our Task Flow Call activity:
    
    <task-flow-call id="EmployeeDetails">
         <task-flow-reference>
           <document>/WEB-INF/oracle/apps/xteam/demo/ui/flow/ToChildSFFlow.xml</document>
           <id>ToChildSFFlow</id>
         </task-flow-reference>
         <input-parameter>
           <name>Empno</name>
           <value>#{pageFlowScope.Empno}</value>
         </input-parameter>
    </task-flow-call>
    
     
    Note that this means that the calling task flow needs to store the value of Empno in #{pageFlowScope.Empno}. For example, from the original parent flow path, it is set to be #{row.Empno} using setActionListener tag. For the sub-flow path, it is set using parent flow's input parameter named Empno. On the sub-flow, we need to specify its input parameters as below:
    <task-flow-definition id="ToChildSFFlow">
       <default-activity>TochildSFPF</default-activity>
       <input-parameter-definition>
         <name>Empno</name>
         <value>#{pageFlowScope.Empno}</value>
         <class>java.lang.String</class>
       </input-parameter-definition>
       ...
    </task-flow-definition>
    
    Note that the name of the input parameter (i.e., "Empno") needs to be the same as the parameter name defined on the task flow call activity. When parameter is available, ADF will place it in:
    #{pageFlowScope.Empno}
    to be used within sub-flow. However, this pageFlowScope is different from the one defined in the Task Flow Call activity because they have different owning task flow (i.e., parent task flow vs. sub-flow).
    Here is the definition of sub-flow:
    In the sample implementation, we chose to implement the initialization step in the sub-flow. Empno is passed as an parameter to sub-flow and used to initialize parent state. When sub-flow is launched, default view activity (i.e., ToChildPF) displays. Before it renders, initPage method on the ChildBean will be executed first. The page definition of the default page is defined as follows:
    <pageDefinition xmlns="http://xmlns.oracle.com/adfm/uimodel">
     <parameters/>
     <executables>
       ...
       <invokeAction id="initPageId" Binds="initPage" Refresh="always"/>
     </executables>
     <bindings>
       ...
       <methodAction id="initPage" InstanceName="ChildSFBean.dataProvider"
                     DataControl="ChildSFBean" RequiresUpdateModel="true"
                     Action="invokeMethod" MethodName="initPage"
                     IsViewObjectMethod="false"
                     ReturnName="ChildSFBean.methodResults.initPage_ChildSFBean_dataProvider_initPage_result"/>
        ...
     </bindings>
    </pageDefinition>
    
    
    As shown above, initPage is specified in the executables tag and will be invoked when the page is refreshed. initPage method itself is defined as follows:

    public void initPage()
    {
       FacesContext facesContext = FacesContext.getCurrentInstance();
       ExpressionFactory exp = facesContext.getApplication().getExpressionFactory();
       DCBindingContainer bindingContainer =
         (DCBindingContainer)exp.createValueExpression(
             facesContext.getELContext(),"#{bindings}",DCBindingContainer.class).getValue(facesContext.getELContext());
       ApplicationModule am = bindingContainer.getDataControl().getApplicationModule();
    
       ViewObject vo = am.findViewObject("ComplexSFEmpVO");
       vo.executeQuery();
    
       Map map = AdfFacesContext.getCurrentInstance().getPageFlowScope();
       if(map !=null){
            Object empObj = map.get("Empno");
            if(empObj instanceof Integer){
                Integer empno =(Integer)map.get("Empno");// new Integer(empnoStr);
                Object[] obj = {empno};
                Key key = new Key(obj);
                Row row = vo.getRow(key);
                vo.setCurrentRow(row);
            }
            else
            {
                String empnoStr = (String)map.get("Empno");
                Integer empno = new Integer(empnoStr);
                Object[] obj = {empno};
                Key key = new Key(obj);
                Row row = vo.getRow(key);
                vo.setCurrentRow(row);
            }
        }
    }
    In initPage, it takes input parameter Empno (i.e., from #{pageFlowScope.Empno}) as a key to select a row and set it to be the current row in the master table (i.e., Employee table).
    References
    1. Oracle® Fusion Applications Developer's Guide 11g Release 1 (11.1.1.5)
    2. openSubTask and closeSubTask APIs
    3. Oracle ADF Task Flow in a Nutshell

    Thursday, September 22, 2011

    Book Review: "Oracle WebCenter 11g PS3 Administration Cookbook"

    There are three major components in the WebCenter product stack:
    1. WebCenter Framework
      • Allows you to embed portlets, ADF Taskflows, content, and customizable components to create your WebCenter Portal application
      • All Framework pieces are integrated into the Oracle JDeveloper IDE, providing access to these resources as you build your applications
    2. WebCenter Services
      • Are a set of independently deployable collaboration services
      • Incorporates Web 2.0 components such as content, collaboration, discussion, announcement and communication services
    3. WebCenter Spaces
      • Is an out-of-the-box WebCenter Portal application for team collaboration and enterprise social networking
      • Is built using the WebCenter Framework, WebCenter services, and Oracle Composer
    As the strategic portal product of Oracle, WebCenter Framework plays in the Enterprise portal space, and WebCenter Services/Spaces plays in the Collaboration Workspace space.

    What's Portal Application?


    A portal can be thought of as an aggregator of content and applications or a single point of entry to a user's set of tools and applications. It is a web-based application that is customizable by the end-user both in the look and feel of the portal and in the available content and applications which the portal contains.

    The key elements of portals include:
    • Page hierarchy
    • Navigation
    • Delegated administration and other security features
    • Runtime customization and personalization.

    To design a successful enterprise web portal is hard, but getting easier and more practical with Oracle WebCenter which is built on top of Oracle ADF technology. As an enterprise portal, security is extremely important. Unauthorized people should never get access, and different groups may have different permissions. Customers, partners and employees should be able to use a single login to access all relevant information and applications.

    The Book


    To design, test, deploy, and maintain a successful web portal is nontrivial to say the least. Therefore, a cookbook like Oracle WebCenter 11g PS3 Administration Cookbook is needed. In fourteen chapters, it provides over a hundred step-by-step recipes that help the reader through a wide variety of tasks ranging from portal and portlet creation to securing, supporting, managing, and administering Oracle WebCenter.

    In the book, it covers many new features introduced by the 11g R1 Patch Set 3 version of the Oracle WebCenter product. It also touches upon all three components: WebCenter Framework, WebCenter Services, and WebCenter Spaces and roughly in that order. Besides important topics such as customization and security , it also discuss the analytics aspect of the product (i.e., Activity Graph).

    Resource Catalog

    Using resource catalog as an example, in this book, you'll learn that:
    • How to create a resource catalog either at design time or runtime
    • How to specify a catalog filter or a catalog selector
    • How to add a link to the resource catalog
    • How to add an existing resource catalog to the catalog
    • How to add custom components to a resource catalog
    • How to add custom folder to the resource catalog
    At each step, you'll learn how it works and why. For example, when you add a resource catalog at runtime, an XML file will also be created, but it will be stored in the MDS (Metadata Service Repository) which is a repository used by WebCenter to store metadata.

    Trade-offs


    After the introduction of different approaches, the author also discusses the trade-offs of each approach. For example, with WebCenter Spaces, it allows you to build collaborative intranets without needing to develop a lot. The problem you will be having with WebCenter Spaces is that it is not as easily customizable as a regular WebCenter Portal application. Therefore, you can combine the best of both worlds. When you need a high level of customization or you need to extend the site with your custom functionality, then you should create a WebCenter Portal application. When you need a collaborative environment where customization or added functionality is not as important as the collaborative services, then go for WebCenter Spaces.

    References
    1. Oracle WebCenter 11g PS3 Administration Cookbook
    2. Creating a Successful Web Portal
    3. Oralce WebCenter (Wikipedia)
    4. Oracle ADF Task Flow in a Nutshell
    5. Book Review: Web 2.0 Solutions with Oracle WebCenter 11g
    6. Oracle® Fusion Middleware Enterprise Deployment Guide for Oracle WebCenter Content 11g Release 1 (11.1.1)

    Saturday, September 17, 2011

    ADF View Criteria By Example

    There are different filtering approaches to query row data provided in Oracle ADF 11g:
    • By adding WHERE clause to View Object SQL statement
    • By creating View Criteria Programmatically
    • By using named View Criteria
    In this article, we will examine these different approaches followed by the discussion of view criteria.

    Not that all three examples shown in the article are defined in the application module.

    Adding WHERE Clause

    In the first example (i.e., getChannel1), it gets the query statement from the View Object and appends it with a WHERE clause. Then the PreparedStatement is executed with specified filtering.
    public OracleCachedRowSet getChannel1(Long channelId)
    throws SQLException
    {
    ResultSet rs = null;
    try
    {
    ViewObjectImpl vo =
    (ViewObjectImpl) this.findViewObject("ChannelOnly");
    StringBuffer query = new StringBuffer(vo.getQuery());
    query.append(" where ChannelEO.CHANNEL_ID =").append(channelId);
    DBTransaction txn = this.getDBTransaction();
    PreparedStatement ps =
    txn.createPreparedStatement(query.toString(), 1);
    rs = ps.executeQuery();
    OracleCachedRowSet ocs = new OracleCachedRowSet();
    ocs.populate(rs);
    
    return ocs;
    }
    catch (Exception e)
    {
    if (AppsLogger.isEnabled(AppsLogger.SEVERE))
    {
    AppsLogger.write(OsmmSetupUiModelAMImpl.class, e);
    }
    }
    finally
    {
    if (rs != null)
    rs.close();
    }
    return null;
    }
    

    Creating View Criteria Programmatically

    In the second example (i.e., getChannel2), it shows that a ViewCriteria object is created at runtime by using ViewCriteriaRow's, which in turn are composed of ViewCritiaItem's. Then this ViewCriteria object is applied to the View Object and used in the filtering.
     public  void getChannel2(Long channelId)
    {
    // Create and populate criteria rows to support query-by-example.
    ViewObject channelVO = this.findViewObject("ChannelOnly");
    ViewCriteria vc = channelVO.createViewCriteria();
    ViewCriteriaRow vcRow = vc.createViewCriteriaRow();
    
    // ViewCriteriaRow attribute name is case-sensitive.
    // ViewCriteriaRow attribute value requires operator and value.
    // Note also single-quotes around string value.
    ViewCriteriaItem jobItem = vcRow.ensureCriteriaItem("ChannelId");
    jobItem.setOperator("=");
    jobItem.getValues().get(0).setValue(channelId);
    vc.add(vcRow);
    
    channelVO.applyViewCriteria(vc);
    
    // Multiple rows are OR-ed in WHERE clause.
    System.out.println("Demo View Criteria");
    
    // Should print channel with specified channel ID
    printViewObject(channelVO);
    }
    
    public  void printViewObject(ViewObject vo)
    {
    // Execute the query, print results to the screen.
    vo.executeQuery();
    
    // Print the View Object's query
    System.out.println("Query: " + vo.getQuery());
    
    while (vo.hasNext())
    {
    Row row = vo.next();
    String rowDataStr = "";
    
    // How many attributes (columns) is the View Object using?
    int numAttrs = vo.getAttributeCount();
    
    // Column numbers start with 0, not 1.
    for (int columnNo = 0; columnNo < numAttrs; columnNo++)
    {    
    // See also Row.getAttribute(String name).    
    Object attrData = row.getAttribute(columnNo);
    rowDataStr += (attrData + "\t");
    }
    System.out.println(rowDataStr);
    }
    }
    

    Using Named View Criteria

    In the third example (i.e., getChannel3), it finds a named View Criteria (i.e., findByChannelId) which is defined at design time. After setting the value of named Bind Variable (i.e., ChannelIdBV), the view criteria is applied to the View Object and used in querying the row data.

    public Row[] getChannel3(Long channelId)
    {
    ChannelVOImpl viewObj = (ChannelVOImpl) this.getChannelOnly();
    if (viewObj != null)
    {
    ViewCriteria vc = viewObj.getViewCriteria("findByChannelId");
    viewObj.setNamedWhereClauseParam("ChannelIdBV", channelId);
    viewObj.applyViewCriteria(vc);
    viewObj.executeQuery();
    viewObj.setRangeSize(-1);
    Row[] allRows = viewObj.getAllRowsInRange();
    return allRows;
    }
    return null;
    }
    

    What's View Criteria

    Before the advent of Oracle ADF 11g, to show an employee list filtered by company role on one page and by department number on another page, you would have needed to either create separate view objects for each page or write custom code to selectively modify a view object's WHERE clause and bind variable values. With the new release, you can now use a single view object with multiple named view criteria filters to accomplish the same task.

    A view criteria you define lets you specify filter information for the rows of a view object collection. The view criteria object is a row set of one or more view criteria rows, whose attributes mirror those in the view object. The view criteria definition comprises query conditions that augment the WHERE clause of the target view object. Query conditions that you specify apply to the individual attributes of the target view object. Check out here for:
    • How to Create Named View Criteria Declaratively
    • How to Test View Criteria Using the Business Component Browser
    • How to Create View Criteria Programmatically

    Advantages of Using Named View Criteria

    Among the different approaches, the third one is the preferred approach. This is because view criteria that you define at design time can participate in these scenarios where filtering results is desired at runtime:
    • Supporting Query-by-Example search forms that allow the end user to supply values for attributes of the target view object[2].
    • Filtering the list of values (LOV) components that allow the end user may select from one attribute list (displayed in the UI as an LOV component)[3].
    • Validating attribute values using a view accessor with a view criteria applied to filter the view accessor results[4].
    • Creating the application module's data model from a single view object definition with a unique view criteria applied for each view instance[5].

    References

    1. Working with Named View Criteria
    2. Creating Query Search Forms
    3. Creating a Selection List
    4. How to Validate Against a View Accessor
    5. How to Define the WHERE Clause of the Lookup View Object Using View Criteria
    6. Reusable ADF Components—Application Modules
    7. Oracle Application Development Framework
    8. Oracle ADF Essentials

    Sunday, August 14, 2011

    Book Review: "Overview of Oracle Enterprise Manager Grid Control 11g R1: Business Service Management"

    There are different console applications or flavors provided in Oracle Enterprise Manager (OEM):
    • OEM Database Control
    • OEM Application Server and Fusion Middleware Control
    • OEM Grid Control
    The Business Service Management (BSM) capabilities of Oracle Enterprise Manager are available only in the Grid Control flavor.

    In this book "Overview of Oracle Enterprise Manager Grid Control 11g R1: Business Service Management", it covers OEM's Business Service Management capabilities in great details as described in this article.

    Business Service Management

    Business Service Management (BSM) is a methodology for monitoring and measuring Information Technology (IT) services from a business perspective. It allows IT departments to operate by service rather than by individual manageable entity or target.

    BSM software and services are provided by major vendors. Oracle Enterprise Manager (OEM) 11g is a product offering from Oracle that provides solutions to the typical IT infrastructure management issues.

    Management Issues

    Any enterprise IT infrastructure contains numerous disparate components that are geographically distributed across various data centers. These components include:
    • Hardware components
      • Such as servers hosting different applications, network switches, routers, storage devices, and so on
    • Software components
      • Such as operating systems, database servers, application servers, middleware components, packaged applications, distributed applications, and so on
    To make things even worse, an IT infrastructure also have the following characteristics:
    • Hardware and software could be sourced from multiple vendors
    • Multiple versions of the same software product, from the same vendor, could be deployed across the enterprise
    • Newer technologies such as service-oriented architectures (SOA), virtualization, cloud computing, portal frameworks, grid architectures, and mashups within an organizations make troubleshooting and monitoring of business services very difficult
    These heterogeneous, disparate and geographically distributed components give rise to the complexity of IT management issues.

    The Needs

    Facing these challenges, a successful management solution must:
    • Have the capability to model, monitor, administer, and configure higher-level logical entities that map to business functions
    • Provide different perspectives to get a comprehensive view of the health of the various business services and the underlying IT infrastructure
    • Be able to perform complex computations and scale very easily with a simple architecture and a small footprint
    • Take into consideration the geographical spread of the infrastructure landscape

    The Solution

    Oracle Enterprise Manager (OEM) is one of the industry leaders in the system management products arena. It provides the following capabilities:
    • A single unified platform for modeling and managing enterprise data centers
    • Comprehensive monitoring and management capabilities for the entire Oracle Grid within the enterprise
    • Discovery, monitoring, and management of various pieces of the IT infrastructure
      • Includes Non-Oracle Software Products
    • Supports both passive and active monitoring paradigms
    • Two distinct perspectives:
      • Target-based focus
        • This provides a highly specialized set of views exclusive for a specific target
      • Business service-based focus
        • This provides a holistic view that dwells on different targets within an enterprise and their interactions with each other to achieve a business objective
    • Capabilities of defining and tracking Service-Level Agreements (SLAs) of different business functions
    The Grid Control architecture (see the Figure above) is distributed in nature and relies on the agents to collect data on the individual hosts. It includes the following components:
    • Oracle Management Agent
      • A piece of software installed on a host that collects information about the targets on the host or remote hosts. The collected data is then passed onto the management service.
      • In case of remote monitoring (vs. local monitoring), there is no automatic discovery support and the administrator must use the console UI pages to initiate the remote discovery.
    • Oracle Management Service (OMS)
      • This is the brain of the OEM. It acts as the centralized management solution and also acts as the server to which all the management agents upload the collected data.
      • The OMS provides current and future insights into business functions and services by looking at the historical data that is stored in its management repository.
    • OEM Console
      • This is the user interface that exposes all the management functionalities to the end user of OEM.
      • It provides views into each of the targets and also allows the user to initiate actions and configuration changes on these targets.
    • Oracle Management Repository
      • This is the central repository that is used by the OMS to store all data.
    By distributing the data collection to individual agents the Oracle Management Service (OMS) is freed up to perform more important tasks.

    The Book

    In the book, it has used a travel portal as example to:
    • Illustrate the concepts of IT infrastructure management
    • Showcase OEM's BSM capabilities
    • Provide step-by-step instructions of using OEM
    The travel portal provides various business services such as flight search, car rental services, and so on to the end users. It also consumes the payment gateway services from various business partners.

    In the travel portal illustration, these services are configured as different service targets such as:
    • CarRentalService:
      • Modeled as a Generic Service target based on the TravelPortal-CarRental-System
    • FlightSearchWebSite:
      • Modeled as a Web Application service based on a Service Test from two different beacons
    • PaymentGatewayService:
      • Modeled as a Forms Application based on the PaymentGatewaySystem
    • TravelPortalSearchServices:
      • Modeled as an Aggregate Service comprising the CarRentalService and FlightSearchWebSite service targets

    Resources

    1. Overview of Oracle Enterprise Manager Grid Control 11g R1: Business Service Management
    2. Oracle Grid Products
    3. Oracle Grid Engine
    4. Oracle Enterprise Manager
    5. Enterprise Manager Grid Control
    6. Oracle Enterprise Manager Cloud Control 12c: Best Practices for Middleware Management

    Tuesday, July 26, 2011

    Beautifying Table and Column Comments for Design Review

    Data model design is an iterative process. As soon as the conceptual data model is accepted by the functional team, development of logical data model gets started. Once logical data model is completed, it is then forwarded to functional teams for review. A good data model is created by clearly thinking about the current and future business requirements.

    To facilitate the review process, you need to present descriptions of entities and attributes in the data model to the functional team. Some database developers prefer working at source level (i.e., SQL DDL). For example, you can present the following EMP table to the team for review:
    -- Employee Data
    CREATE TABLE "SCOTT"."EMP"
    (
    "EMPNO" NUMBER(4,0),        -- employee number
    "ENAME" VARCHAR2(10 BYTE),  -- employee name
    "JOB"   VARCHAR2(9 BYTE),   -- job description
    "MGR"   NUMBER(4,0),        -- manager ID
    "HIREDATE" DATE,            -- hiring date
    "SAL"    NUMBER(7,2),       -- salary
    "COMM"   NUMBER(7,2),       -- commission
    "DEPTNO" NUMBER(2,0),       -- department number
    CONSTRAINT "PK_EMP" PRIMARY KEY ("EMPNO") USING INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT) TABLESPACE "USERS" ENABLE,
    CONSTRAINT "FK_DEPTNO" FOREIGN KEY ("DEPTNO") REFERENCES "SCOTT"."DEPT" ("DEPTNO") ENABLE
    )
    In this article, we will show another way which presents the following table generated semi-automatically from the offline database using JDeveloper and Microsoft Excel:
    Using the first approach, the drawbacks include:
    • SQL DDL scripts tend to be error-prone
    • Comments are only for human reader and not part of the DB definitions
    That's why we propose the second approach which can resolve these two issues.


    Offline Database

    In JDeveloper, database development is available offline in the context of a project, allowing developers to create and manipulate schemas of database objects which can be generated to a database or to SQL scripts. Database objects can also be imported from a database into a project. See my previous post for more details.

    You can follow the instructions in [1, 2] to create offline database objects. For the demo, I've created a database diagram and drag an existing EMP table from the SCOTT schema to create a table on it.


    Adding Comments

    Double-click the EMP table component on the diagram to open the Edit Table dialog,

    Select Comment in the navigation panel to enter table's comment as shown above.
    Select Columns in the navigation panel and navigate them one by one. In the Comment field, enter column's comment as shown above. Click Save All to save your work.

    In the Application Navigator, under Offline Database Sources | EMP_DATABASE | SCOTT, right-click the EMP node, and choose Generate To > SQL script ... to create SQL script file named emp.sql.
    Open emp.sql in the editor window. Look for comments of table's and columns' at the bottom of the script as shown below:

    COMMENT ON TABLE EMP IS 'Employee Data';
    
    COMMENT ON COLUMN EMP.EMPNO IS 'employee number';
    
    COMMENT ON COLUMN EMP.ENAME IS 'employee name';
    
    COMMENT ON COLUMN EMP.JOB IS 'job description';
    
    COMMENT ON COLUMN EMP.MGR IS 'manager ID';
    
    COMMENT ON COLUMN EMP.HIREDATE IS 'hiring date';
    
    COMMENT ON COLUMN EMP.SAL IS 'salary';
    
    COMMENT ON COLUMN EMP.COMM IS 'commission';
    
    COMMENT ON COLUMN EMP.DEPTNO IS 'department number';

    Select the above comments and copy them into a text file (i.e., emp.txt).


    Generating Comment Table

    Start up Microsoft Excel and import text file as follows:
    On the Text Import Wizard, you specify delimiters using space and paired single quotes as shown below:

    After clicking on Finish button, you can remove column A,B, and E. It will then present you with the final comment table as shown at the beginning of this article.


    Conclusion

    Comment tables generated in the second approach have the following advantages:
    • The source of comment table is offline database object which can be validated by JDeveloper and can be source controlled.
    • They are part of the DB definitions and can be queried as follows:
      • select comments
        from user_tab_comments
        where table_name = 'EMP'
        /
      • select column_name, comments
        from user_col_comments
        where table_name = 'EMP'
        order by column_name
        /


    References

    1. Database Development with JDeveloper
    2. Modeling Data with Offline Database in JDeveloper

    Modeling Data with Offline Database in JDeveloper

    For Oracle Applications developers, the JDeveloper offline database modeler replaces the Oracle Designer repository, or CASE as it was referred to. Applications developers should not use SQL DDL scripts for deployment and source control of database objects, because they tend to error-prone and do not serve as a single source of truth. Instead, developers should use the JDeveloper offline database object files.

    What is the Offline Database

    JDeveloper provides the tools you need to create and edit database objects, such as tables and constraints, outside the context of a database, using the offline Database model. You can create new tables and views, and generate the information to a database, or you can import database objects from a database schema, make the changes you want, and generate the changes back to the same database schema, to a new database schema, or to a file that you can run against a database at a later date.

    Offline Database Model

    The JDeveloper Offline database supports the following object types:
    • Function
    • Materialized View
    • Materialized View Log
    • Package
    • Procedure
    • Sequence
    • Synonym
    • Table
    • Trigger
    • Type
    • View
    Currently, JDeveloper offline DB objects do not support these objects:
    • Queue
    • Queue tables
    • Policy
    • Context
    However, SXML persistence files for these object types can be imported using the applxdf extension.
    JDeveloper provides tools to create and edit database objects such as tables, view etc. outside the context of a database. This tool called Offline Database Definition will be used to model physical database objects in Fusion applications. The migration tool will support migrating all user selected database objects defined in CASE to this offline database definition in JDeveloper along with SXML/XDF1 deployment files.

    Metadata SXML Comparison Tool[3]

    Offline table definitions can be version controlled and shared using a source control system. If you just create objects in the DB schema via the database navigator, you have nothing to source control. JDeveloper provides a comparison tool optimized for working with offline table definitions, which handles:
    • The table data, properties, columns and constraints.
    • The identity of objects, to track name changes.
    • Checking for consistency, for example, ensuring:
      • That a column which is used in a key is not dropped.
      • That a constraint which uses an absent column is not added.
      • That a primary key column cannot be options.
    Using this comparison tool, you can compare object metadata of the same type from different databases. This comparison depends on SXML. SXML is an XML representation which more closely maps to the SQL creation DDL. Two SXML documents of the same type can be compared and a new SXML document is provided which describes their differences.
    Using this comparison tool, you're able to:
    • Compare object definitions in different JDeveloper projects
      • Since the objects to be compared are in separate projects, you need to create a dependency between them to be able to perform this comparison.
    • Compare versioned copies of DB objects
      • Versioning components allows you to browse through the historical changes of a component and make comparison between these versions. With JDEV, it's possible to compare different versions of database models.

    Working on Data Modeling at Different Levels:

    • UML class diagram
      • You can create a logical model using a UML class diagram to visually create or inspect classes, interfaces, attributes, operations, associations, inheritance relationships, and implementation relations and transform it to an Offline or Online Database definitions later.
      • See this tutorial for how-to.
      • You usually do logical modeling using a UML class model in the following steps:
        1. Preparing a class model diagram Environment
        2. Creating a Class Model Diagram
        3. Enhancing the Class Model
        4. Transform the Class Model into a Database Model
    • Database diagram
      • You can follow [4, 5] to create new database diagram.
      • You can also drag tables, views, materialized views, synonyms, and sequences from a database schema onto a database diagram, where they become accessible as offline database objects.
    • Offline Database
      • You can create new offline database objects, or capture them from a connection to a live database. After you have finished working with them, you can generate new and updated database definitions to online database schemas or to SQL scripts.
      • You can follow the instructions in [4, 5] to create new offline database objects. When you create an offline database, you choose the database emulation (for example, Oracle11g Database Release 1) the offline database should have.
      • You can also copy offline database objects to a project. In general, it is a good idea to make sure that the offline database uses the same database emulation as the source database.
      • Note that generation to a database is not certified against non-Oracle databases.

    Notes


    1. Prior to SXML migration, these were referred to as xdf (extension) files.

    References

    1. http://susanduncan.blogspot.com/
    2. Database Development with JDeveloper
    3. Metadata SXML Comparison Tool
    4. Database Development with JDeveloper

    Wednesday, July 13, 2011

    Language Identification

    Language identification is one of the supervised learning method. In this article, we will cover a specific Processing Resource (PR) in GATE (i.e., TextCat or Language Identification PR). Based on its documentation, it says that it:

    Recognizes the document language using TextCat. Possible languages: german, english, french, spanish, italian, swedish, polish, dutch, norwegian, finnish, albanian, slovakian, slovenian, danish, hungarian.

    N-Gram-Based Text Categorization

    TextCat PR uses N-Gram for text categorization. You can find the details from this article. See the following diagram for its data flow.



    There are two phases in the language identification task:
    1. Training
    2. Application

    We'll discuss those in the following sections.

    Training Phase

    In the training phase, the goal is to generate category profiles from the given category samples. In Language Identification PR (or TextCat PR), the categories are languages. So, we take document samples from different languages (i.e., English, German, etc.) and use them to generate category profiles.

    These category profiles are already provided in TextCat PR. At runtime, TextCat PR looks for a configuration file named textcat.conf. This files has the following content:

    language_fp/german.lm    german
    language_fp/english.lm english
    language_fp/french.lm french
    language_fp/spanish.lm spanish
    language_fp/italian.lm italian
    language_fp/swedish.lm swedish
    language_fp/polish.lm polish
    language_fp/dutch.lm dutch
    language_fp/norwegian.lm norwegian
    language_fp/finnish.lm finnish
    language_fp/albanian.lm albanian
    language_fp/slovak-ascii.lm slovakian
    language_fp/slovenian-ascii.lm slovenian
    language_fp/danish.lm danish
    language_fp/hungarian.lm hungarian

    In a sub-folder named language_fp which is relative to the location of textcat.conf, there are multiple category profile files with lm suffix. For example, german.lm is the category profile for German and english.lm is the category profile for English.

    Using English profile as an example, its content looks like this:

    _     20326
    e 6617
    t 4843
    o 3834
    n 3653
    i 3602
    a 3433
    s 2945
    r 2921
    h 2507
    e_ 2000
    d 1816
    _t 1785
    c 1639
    l 1635
    th 1535
    he 1351
    _th 1333
    ...

    On each line, there are two elements:

    • N-gram (N is from 1 to 5)
    • Frequency

    N-grams are sorted in the reverse order of frequency. For example, the most frequently found character in English documents is the space character (i.e., represented by '_') whose count of occurrences is 20326. From the training data, we also find that the most frequently found 2-gram is 'e_' (i.e., letter 'e' followed by a space).

    Application Phase


    In the application phase, the TextCat PR reads the learned model (i.e., category profiles ) and then applies the model to the data. Given a new document, first we generate a document profile (i.e., N-grams frequency profile) similar to the category profiles.

    The language classification task is then to measure profile distance: For each N-gram in the document profile, we find its counterpart in the category profile, and then calculate how far out of place it is.

    Finally, the bubble labelled "Find Minimum Distance" simply takes the distance measures from all of the category profiles to the document profile, and picks the smallest one.

    What's in TextCat PR?

    If you look inside the textcat-1.0.1.jar, you can identify the following structure:

    org/
    +--knallgrau/
    +--utils/
    +-- textcat/
    +-- FingerPrint.java
    +-- MyProperties.java
    +-- NGramEntryComparator.java
    +-- TextCategorizer.java
    +-- textcat.conf
    +-- language_fp/
    +-- english.lm
    +-- german.lm
    +-- ...

    Unfortunately, you cannot find the above source files from GATE's downloads. However, after Google search, I've found them from Google Code here.