Tuesday, July 26, 2011

Beautifying Table and Column Comments for Design Review

Data model design is an iterative process. As soon as the conceptual data model is accepted by the functional team, development of logical data model gets started. Once logical data model is completed, it is then forwarded to functional teams for review. A good data model is created by clearly thinking about the current and future business requirements.

To facilitate the review process, you need to present descriptions of entities and attributes in the data model to the functional team. Some database developers prefer working at source level (i.e., SQL DDL). For example, you can present the following EMP table to the team for review:
-- Employee Data
CREATE TABLE "SCOTT"."EMP"
(
"EMPNO" NUMBER(4,0),        -- employee number
"ENAME" VARCHAR2(10 BYTE),  -- employee name
"JOB"   VARCHAR2(9 BYTE),   -- job description
"MGR"   NUMBER(4,0),        -- manager ID
"HIREDATE" DATE,            -- hiring date
"SAL"    NUMBER(7,2),       -- salary
"COMM"   NUMBER(7,2),       -- commission
"DEPTNO" NUMBER(2,0),       -- department number
CONSTRAINT "PK_EMP" PRIMARY KEY ("EMPNO") USING INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT) TABLESPACE "USERS" ENABLE,
CONSTRAINT "FK_DEPTNO" FOREIGN KEY ("DEPTNO") REFERENCES "SCOTT"."DEPT" ("DEPTNO") ENABLE
)
In this article, we will show another way which presents the following table generated semi-automatically from the offline database using JDeveloper and Microsoft Excel:
Using the first approach, the drawbacks include:
  • SQL DDL scripts tend to be error-prone
  • Comments are only for human reader and not part of the DB definitions
That's why we propose the second approach which can resolve these two issues.


Offline Database

In JDeveloper, database development is available offline in the context of a project, allowing developers to create and manipulate schemas of database objects which can be generated to a database or to SQL scripts. Database objects can also be imported from a database into a project. See my previous post for more details.

You can follow the instructions in [1, 2] to create offline database objects. For the demo, I've created a database diagram and drag an existing EMP table from the SCOTT schema to create a table on it.


Adding Comments

Double-click the EMP table component on the diagram to open the Edit Table dialog,

Select Comment in the navigation panel to enter table's comment as shown above.
Select Columns in the navigation panel and navigate them one by one. In the Comment field, enter column's comment as shown above. Click Save All to save your work.

In the Application Navigator, under Offline Database Sources | EMP_DATABASE | SCOTT, right-click the EMP node, and choose Generate To > SQL script ... to create SQL script file named emp.sql.
Open emp.sql in the editor window. Look for comments of table's and columns' at the bottom of the script as shown below:

COMMENT ON TABLE EMP IS 'Employee Data';

COMMENT ON COLUMN EMP.EMPNO IS 'employee number';

COMMENT ON COLUMN EMP.ENAME IS 'employee name';

COMMENT ON COLUMN EMP.JOB IS 'job description';

COMMENT ON COLUMN EMP.MGR IS 'manager ID';

COMMENT ON COLUMN EMP.HIREDATE IS 'hiring date';

COMMENT ON COLUMN EMP.SAL IS 'salary';

COMMENT ON COLUMN EMP.COMM IS 'commission';

COMMENT ON COLUMN EMP.DEPTNO IS 'department number';

Select the above comments and copy them into a text file (i.e., emp.txt).


Generating Comment Table

Start up Microsoft Excel and import text file as follows:
On the Text Import Wizard, you specify delimiters using space and paired single quotes as shown below:

After clicking on Finish button, you can remove column A,B, and E. It will then present you with the final comment table as shown at the beginning of this article.


Conclusion

Comment tables generated in the second approach have the following advantages:
  • The source of comment table is offline database object which can be validated by JDeveloper and can be source controlled.
  • They are part of the DB definitions and can be queried as follows:
    • select comments
      from user_tab_comments
      where table_name = 'EMP'
      /
    • select column_name, comments
      from user_col_comments
      where table_name = 'EMP'
      order by column_name
      /


References

  1. Database Development with JDeveloper
  2. Modeling Data with Offline Database in JDeveloper

Modeling Data with Offline Database in JDeveloper

For Oracle Applications developers, the JDeveloper offline database modeler replaces the Oracle Designer repository, or CASE as it was referred to. Applications developers should not use SQL DDL scripts for deployment and source control of database objects, because they tend to error-prone and do not serve as a single source of truth. Instead, developers should use the JDeveloper offline database object files.

What is the Offline Database

JDeveloper provides the tools you need to create and edit database objects, such as tables and constraints, outside the context of a database, using the offline Database model. You can create new tables and views, and generate the information to a database, or you can import database objects from a database schema, make the changes you want, and generate the changes back to the same database schema, to a new database schema, or to a file that you can run against a database at a later date.

Offline Database Model

The JDeveloper Offline database supports the following object types:
  • Function
  • Materialized View
  • Materialized View Log
  • Package
  • Procedure
  • Sequence
  • Synonym
  • Table
  • Trigger
  • Type
  • View
Currently, JDeveloper offline DB objects do not support these objects:
  • Queue
  • Queue tables
  • Policy
  • Context
However, SXML persistence files for these object types can be imported using the applxdf extension.
JDeveloper provides tools to create and edit database objects such as tables, view etc. outside the context of a database. This tool called Offline Database Definition will be used to model physical database objects in Fusion applications. The migration tool will support migrating all user selected database objects defined in CASE to this offline database definition in JDeveloper along with SXML/XDF1 deployment files.

Metadata SXML Comparison Tool[3]

Offline table definitions can be version controlled and shared using a source control system. If you just create objects in the DB schema via the database navigator, you have nothing to source control. JDeveloper provides a comparison tool optimized for working with offline table definitions, which handles:
  • The table data, properties, columns and constraints.
  • The identity of objects, to track name changes.
  • Checking for consistency, for example, ensuring:
    • That a column which is used in a key is not dropped.
    • That a constraint which uses an absent column is not added.
    • That a primary key column cannot be options.
Using this comparison tool, you can compare object metadata of the same type from different databases. This comparison depends on SXML. SXML is an XML representation which more closely maps to the SQL creation DDL. Two SXML documents of the same type can be compared and a new SXML document is provided which describes their differences.
Using this comparison tool, you're able to:
  • Compare object definitions in different JDeveloper projects
    • Since the objects to be compared are in separate projects, you need to create a dependency between them to be able to perform this comparison.
  • Compare versioned copies of DB objects
    • Versioning components allows you to browse through the historical changes of a component and make comparison between these versions. With JDEV, it's possible to compare different versions of database models.

Working on Data Modeling at Different Levels:

  • UML class diagram
    • You can create a logical model using a UML class diagram to visually create or inspect classes, interfaces, attributes, operations, associations, inheritance relationships, and implementation relations and transform it to an Offline or Online Database definitions later.
    • See this tutorial for how-to.
    • You usually do logical modeling using a UML class model in the following steps:
      1. Preparing a class model diagram Environment
      2. Creating a Class Model Diagram
      3. Enhancing the Class Model
      4. Transform the Class Model into a Database Model
  • Database diagram
    • You can follow [4, 5] to create new database diagram.
    • You can also drag tables, views, materialized views, synonyms, and sequences from a database schema onto a database diagram, where they become accessible as offline database objects.
  • Offline Database
    • You can create new offline database objects, or capture them from a connection to a live database. After you have finished working with them, you can generate new and updated database definitions to online database schemas or to SQL scripts.
    • You can follow the instructions in [4, 5] to create new offline database objects. When you create an offline database, you choose the database emulation (for example, Oracle11g Database Release 1) the offline database should have.
    • You can also copy offline database objects to a project. In general, it is a good idea to make sure that the offline database uses the same database emulation as the source database.
    • Note that generation to a database is not certified against non-Oracle databases.

Notes


  1. Prior to SXML migration, these were referred to as xdf (extension) files.

References

  1. http://susanduncan.blogspot.com/
  2. Database Development with JDeveloper
  3. Metadata SXML Comparison Tool
  4. Database Development with JDeveloper

Wednesday, July 13, 2011

Language Identification

Language identification is one of the supervised learning method. In this article, we will cover a specific Processing Resource (PR) in GATE (i.e., TextCat or Language Identification PR). Based on its documentation, it says that it:

Recognizes the document language using TextCat. Possible languages: german, english, french, spanish, italian, swedish, polish, dutch, norwegian, finnish, albanian, slovakian, slovenian, danish, hungarian.

N-Gram-Based Text Categorization

TextCat PR uses N-Gram for text categorization. You can find the details from this article. See the following diagram for its data flow.



There are two phases in the language identification task:
  1. Training
  2. Application

We'll discuss those in the following sections.

Training Phase

In the training phase, the goal is to generate category profiles from the given category samples. In Language Identification PR (or TextCat PR), the categories are languages. So, we take document samples from different languages (i.e., English, German, etc.) and use them to generate category profiles.

These category profiles are already provided in TextCat PR. At runtime, TextCat PR looks for a configuration file named textcat.conf. This files has the following content:

language_fp/german.lm    german
language_fp/english.lm english
language_fp/french.lm french
language_fp/spanish.lm spanish
language_fp/italian.lm italian
language_fp/swedish.lm swedish
language_fp/polish.lm polish
language_fp/dutch.lm dutch
language_fp/norwegian.lm norwegian
language_fp/finnish.lm finnish
language_fp/albanian.lm albanian
language_fp/slovak-ascii.lm slovakian
language_fp/slovenian-ascii.lm slovenian
language_fp/danish.lm danish
language_fp/hungarian.lm hungarian

In a sub-folder named language_fp which is relative to the location of textcat.conf, there are multiple category profile files with lm suffix. For example, german.lm is the category profile for German and english.lm is the category profile for English.

Using English profile as an example, its content looks like this:

_     20326
e 6617
t 4843
o 3834
n 3653
i 3602
a 3433
s 2945
r 2921
h 2507
e_ 2000
d 1816
_t 1785
c 1639
l 1635
th 1535
he 1351
_th 1333
...

On each line, there are two elements:

  • N-gram (N is from 1 to 5)
  • Frequency

N-grams are sorted in the reverse order of frequency. For example, the most frequently found character in English documents is the space character (i.e., represented by '_') whose count of occurrences is 20326. From the training data, we also find that the most frequently found 2-gram is 'e_' (i.e., letter 'e' followed by a space).

Application Phase


In the application phase, the TextCat PR reads the learned model (i.e., category profiles ) and then applies the model to the data. Given a new document, first we generate a document profile (i.e., N-grams frequency profile) similar to the category profiles.

The language classification task is then to measure profile distance: For each N-gram in the document profile, we find its counterpart in the category profile, and then calculate how far out of place it is.

Finally, the bubble labelled "Find Minimum Distance" simply takes the distance measures from all of the category profiles to the document profile, and picks the smallest one.

What's in TextCat PR?

If you look inside the textcat-1.0.1.jar, you can identify the following structure:

org/
+--knallgrau/
+--utils/
+-- textcat/
+-- FingerPrint.java
+-- MyProperties.java
+-- NGramEntryComparator.java
+-- TextCategorizer.java
+-- textcat.conf
+-- language_fp/
+-- english.lm
+-- german.lm
+-- ...

Unfortunately, you cannot find the above source files from GATE's downloads. However, after Google search, I've found them from Google Code here.

Saturday, July 9, 2011

Book Review: "Oracle Warehouse Builder 11g R2: Getting Started 2011"

What's Data Warehouse

A data warehouse is a database used for reporting. The main source of the data is cleaned, transformed, cataloged and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support (Marakas & OBrien 2009).

The data warehouse must handle large amounts of data, and must be simple to query and understand by the end users. A data warehouse maintains its functions in three layers:
  • Staging— used to store raw data for use by developers (analysis and support).
  • Integration—used to integrate data and to have a level of abstraction from users.
  • Access— for getting data out for users


Data Warehouse vs. Transactional Database

A data warehouse (DW) is different from a transactional database (TD). A DW is a database used for reporting where the focus is getting data out of the system. On the other hands, TD is concerned with daily operational processing where the focus is on getting data into the system.

In the design, TD uses normalized approach while DW chooses dimensional approach.

The normalized structure divides data into entities, which creates several tables in a relational database. When applied in large enterprises the result is dozens of tables that are linked together by a web of joins.

In a dimensional approach, transaction data are partitioned into either facts (actual numerical measures) and dimensions (descriptive data about those measures) that place the facts in a context that is understandable to the end-user decision maker.

In this article, we will review a book named Oracle Warehouse Builder 11g R2: Getting Started 2011, in which it addresses Oracle Warehouse Builder (i.e., one of the Oracle's data mining solutions).


Introduction of the Book

If you are new to data warehousing and need a way to get started, this book offers great help. The task of designing a data warehouse is nontrivial and you can use some helps from a good tool. On this, I agree with the author:
Oracle Warehouse Builder is a great application to use to build your warehouse.
This book provides good coverage on both the theory of data warehousing and practical ways of implementing a data warehouse that can be directly applied in the real world. It's the final product of author's 27 years' working experience in the database industry and his thorough and careful research on the Oracle Warehouse Builder.

The following books are perused and referenced by the author:
  • Oracle Warehouse Builder Data Modeling, ETL and Data Quality Guide
  • Oracle Warehouse Builder Concepts Guide
  • Oracle Database SQL Language Reference
  • Oracle Warehouse Builder Users Guide
  • Oracle Database Error Messages 11g Release 2(11.2)
  • The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling


Oracle Warehouse Builder (OWB)

The Oracle Warehouse Builder is a tool provided by Oracle, which can be used at every stage of the implementation of a data warehouse, from initial design and creation of the table structure to the extract, transform, and load (ETL) process and data-quality management.

OWB has the ability to design the objects logically using cubes and dimensions in a dimensional design. It also has the ability to implement them physically in the underlying database as either a relational structure or a dimensional structure simply by checking a box. For a relational implementation (physical implementation), OWB actually provides us two options for implementing the database:
  • Pure relational option
  • Relational OLAP option
In the Figure below, it shows OWB components and its architecture:


ACME Toys and Gizmos

In this book, a fictional organization named ACME Toys and Gizmos is used to demonstrate practical ways of implementing a data warehouse that can be directly applied in the real world. This book provides explanations throughout on how to use the Oracle Warehouse Builder tool to build a data warehouse within the context of this invented company, which is involved in storefront and online Internet sales.

For instance, a customer makes a purchase of a toy with ACME Toys and Gizmos on a particular day over the Internet, which results in a dollar amount of the transaction. The dollar amount becomes the fact and the toy purchased, the customer, and the location of the purchase (the Internet in this case) become the dimensions that provide a scope of the fact measurement and give it a meaning.

The design of the data warehouse is drawn out in a star schema configuration showing the cube, which is surrounded by the dimensions with the individual items of information (attributes) . It looks like the following:


With Star Schema, managers can easily use the data for data mining, online analytical processing, market research and decision support. For example, if a manager for ACME Toys and Gizmos needs to know what products sold well in the last quarter, the query will only involve two tables—the main fact table containing the data on number of items sold and the product dimension table that contains all the information about the product.


Summary

The purpose of this book is to introduce the Oracle Warehouse Builder and use it to design and build your first data warehouse. By following author's carefully designed components and instructions, you will easily achieve that goal.


See Also

  1. Oracle Warehouse Builder (OWB) 11g
  2. Oracle Warehouse Builder 11g R2: Getting Started 2011
  3. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling
  4. Oracle Warehouse Builder Data Modeling, ETL and Data Quality Guide
  5. Oracle Warehouse Builder Concepts Guide
  6. Oracle Database SQL Language Reference
  7. Oracle Warehouse Builder Users Guide
  8. Oracle Database Error Messages 11g Release 2(11.2)