Xml and More: Oracle Text (11g)

What's Oracle Text

Oracle Text is a powerful search technology built into Oracle11g Standard and Enterprise Editions. Oracle Text uses standard SQL to index, search, and analyze text stored in structured form inside Oracle database, or in unstructured form in either local file system, or on the Web.

The Text index is in the database – no separate data or index storage required. Text queries are executed in the Oracle process – no separate query processing. The Oracle optimizer is aware of the Text index, so it can choose the best execution plan for any query – gives the best performance for ad hoc queries involving Text and structured criteria.

Oracle Text search functionality includes:

Boolean operators (AND, OR, NOT, NEAR etc)
Exact phrase match
Section searching
Fuzzy (words that are spelled similarly)
Stemming (search for mice and find mouse)
Wildcard
Thesaurus (synonyms)
Stopwords
Case sensitivity
Search scoring proximity (Searches for words near one another)
Results ranking
Keyword highlighting.

The advanced features of Oracle Text include:

Search for a document by theme
Get the themes of a document
Get the gist of a document
Get the theme gist of a document
Extend the knowledge base to include terms from a particular industry
Extend the knowledge base to make new associations between terms

Finally, Oracle Text also provides:

Classification
Clustering
Query refinement

SQL API

Oracle Text’s SQL API makes it simple and intuitive for application developers and DBA’s to create and maintain Text indexes and execute Text searches.

Create a Text index

To create an Oracle Text index, simply use the CREATE INDEX command. Forn example:

CREATE INDEX description_idx ON product_information(product_description) INDEXTYPE IS CTXSYS.CONTEXT;

Execute Text Searches

Now you can search for information about all products where “monitor” is near “high resolution” in the product description:

SELECT score(1),  product_id,  product_name FROM product_information WHERE CONTAINS(product_description, 'monitorNEARhighresolution', 1) > 0 ORDERBY score(1) DESC;

Maintain a Text index

As with any full-text retrieval index, changes to the underlying data are applied to the index in batch. To synchronize the text index “index_one” using 2 megabytes of memory, use:

exec ctx_ddl.sync_index(‘index_one’,’2M’);

The synchronize can be done periodically (every 5 minutes, every hour, etc.) or immediately. Note that the catalog indextype - designed specifically for the short pieces of text typically found in eBusiness catalogs – needs no synchronization. The catalog index is always up-to-date.

Simple Management

Oracle Text management can be done using the Oracle Enterprise Manager.

Creating Indices

Oracle Text provides different index types that are suitable for different purposes. There are four Text index types:

Standard index type (context)
Catalog index type (ctxcat)
Classification index type (ctxrule)
XPath index type (ctxxpath)

Different indexing approaches will be covered in more details in later section.

You create a text index as a type of extensible index to Oracle Database using standard SQL. This means that an Oracle Text index operates like an Oracle Database index. It has a name by which it is referenced and can be manipulated with standard SQL statements.

The benefits of a creating an Oracle Text index include fast response time for text queries with the CONTAINS, CATSEARCH, and MATCHES Oracle Text operators. These operators query the CONTEXT, CTXCAT, and CTXRULE index types respectively.

Oracle Text Indexing Process

The Oracle Text indexing process is modeled after a pipeline, where data items retrieved from a data store pass through a series of transformations before their keywords are added to the index. The indexing process is split into multiple phases, which is configurable by the application developer.

The indexing process includes the following phases:

Data Retrieval: Data is simply fetched from a data store, for example, a Web page, database large object, or local file system, and passed as a stream of data to the next phase.
Filtering: The filters are responsible for converting data in different file formats to plain text. The other components in the indexing pipeline only process plain text data and don't know about file formats such as Microsoft Word or Excel.
Sectioning: The sectioner adds metadata about the structure of the original data item.
Lexing: A stream of characters is split into words based on the language of the item.
Indexing: In this final phase, the keywords are added to the actual index.

Index Types

CONTEXT Index

CONTEXT index enables users to search document collections, such as Web sites, digital libraries, or document warehouses. The collection is typically static with no significant change in content after the initial indexing run. Documents can be of any size and of different formats, such as HTML, PDF, or Microsoft Word. These documents are stored in a document table.

To query this index, the application uses the SQL CONTAINS operator in the WHERE clause of a SELECT statement.

CTXCAT index

CONTEXT index enables users to search catalog information which consists of inventory type information, such as that of an online book store or auction site. The stored information consists of text information, such as book titles, and related structured information, such as price. The information is usually updated regularly to keep the online catalog up to date with the inventory.

To query this index, the application uses the CATSEARCH operator in the WHERE clause of a SELECT statement. Queries are usually a combination of a text component and a structured component. Results are almost always sorted by a structured component, such as date or price. Good response time is always an important factor with this type of query application.

CTXRULE index

CONTEXT index enables users to classify an incoming stream or a set of documents by comparing to a pre-defined set of rules. When a document matches one or more rules, the application performs some action.

For example, assume there is an incoming stream of news articles. You can define a rule to represent the category of Finance. The rule is essentially one or more queries that select document about the subject of Finance. When a document arrives about a Wall Street earnings forecast and satisfies the rules for this category, the application takes an action, such as tagging the document as Finance or e-mailing one or more users.

To classify an incoming stream of text, use the MATCHES operator in the WHERE clause of a SELECT statement. Before you start, you need to create a table of rules and then create a CTXRULE index.

CTXXPATH Index

CTXXPATH index is provided for improving performance on XPath searches on XML documents. Note that this indextype will be deprecated and is only supported for backward compatibility with older releases of Oracle Database where XMLIndex is not available. If you are building a new application, Oracle recommends that you use the XMLIndex. Although this index type can be helpful for existsNode() queries, it is not required for XML searching. For more information on XML search, see here.

Theme Index

A document theme is a concept that is sufficiently developed in the document. By default in English and French, Oracle Text indexes theme information. Themes can be queried with the ABOUT operator.You can enable and disable theme indexing with the index_themes attribute of the BASIC_LEXER preference type. You can also index theme information in other languages provided you have loaded and compiled a knowledge base for the language.

Document Services

In addition to the search capabilities of Oracle Text, a number of other features are provided to simplify application development:

Document Format Support - In order to index documents stored in a variety of native formats, such as Word, Excel, PowerPoint, WordPerfect, HTML, and Acrobat/PDF, interMedia supplies a broad variety of "filters" that allow documents stored in their native formats to be indexed.
Document Viewing and Highlighting - Oracle Text Services can convert any supported document format to either plain text or formatted text (an HTML approximation retaining as much as possible of the original formatting; available for all formats except PDF). Both plain text and HTML versions may be viewed in a standard browser, allowing maximum flexibility in deployment, especially for public Web applications. Both plain text and HTML versions may be viewed with or without highlighting and navigation of the search words or themes. It is also possible to download the original document in its native form for viewing in either the original authoring system (Word, PowerPoint etc.) or in any third-party viewer. For viewing PDF documents, the Acrobat viewer should be used.
Document Storage - Oracle Text Services supports several document storage options. The most straightforward option is "direct" storage where documents are kept in an Oracle database table. Where this is not possible, documents may be indexed directly from a file system or URL address by simply storing a file pointer or URL in the indexed column.
Text Manager - Oracle Text supplies an administration tool, integrated with the Oracle. Enterprise Manager, through which all major text maintenance and administration functions may be performed.

Resources

1 comment:

Jenice said...: Great post. Absolutely its a great tool for searching process that has been added in 11g version of Oracle. I too have used this feature in many of the projects in which I have worked. Reading your post gave me opportunity to learn so many new facts and points about this feature that will surely help me out in learning this technology better.
sap erp 6.0; June 23, 2011 at 5:30 AM

Thursday, October 14, 2010

Oracle Text (11g)