Sunday, April 11, 2010

The Role of URI in ROME for Duplicate Detection

As described in this article RSS Duplicate Detection by James Holderness, it's tricky to detect duplicate items in an RSS feed while still allowing for that item to be updated. In RSS feeds, RSS aggregators need to rely on different strategies to resolve the issue due to lack of standard element that can be used for this purpose. His article only covers the issue from RSS' perspective. Adding other feed types such as Atom standards, the issue becomes more messy.

In this article, we'll look at what's ROME strategy in dealing with this issue.

RSS Duplicate Detection

After running 150 tests on more than 20 different RSS aggregators, James has concluded that:


Supporting guids seems like an essential starting point. Also, when a feed doesn’t
contain guids, the link element is probably a good fallback (possibly combined with
or as an alternative to other elements).

ROME's Solution

ROME Synd* beans (i.e., SyndFeed and SyndEntry) define the concept of URI at both feed and entry levels. The returned URI is a normalized URI as specified in RFC 2396bis. The purpose of this URI is to help uniquely identifying feeds and entries when processed with ROME. This is particularly useful when a system is persisting feeds and when doing data manipulation where a primary key for the feeds or the entries is needed.

How the feed URI maps to a concrete feed type (RSS or Atom) depends on the concrete feed type. This is explained in detail in Rome documentation, Feed and entry URI mapping.

Sources of URI for SyndFeed:

  • atom:id element (Atom 0.3 and 1.0)
    • The "atom:id" element conveys a permanent, universally unique identifier for an entry (or feed).
    • The content of an atom:id element MUST be created in a way that assures uniqueness.
    • Instances of atom:id elements can be compared to determine whether a feed (or entry) is the same as one seen before.
Sources of URI for SyncEntry:
  • rss:guid (RSS 0.94 and 2.0)
    • guid stands for globally unique identifier. It's a string that uniquely identifies the item. When present, an aggregator may choose to use this string to determine if an item is new.
  • rss:link (RSS 0.91, RSS 0.92, RSS 0.93 & RSS 1.0)
    • The URL of the item. Will be used to set uri of the entry if one exists.
  • atom:id element (Atom 0.3 and 1.0)
    • See the description above.
  • atom:link element (Atom 0.3 and 1.0)
    • if atom:id element is missing
    • if link's relation type is alternate , and
    • if one and only one exists
For RSS 0.94 and 2.0, the settings of uri and link properties on SyndEntry object are based on the following algorithm:

if (guid!=null) {
syndEntry.setUri(guid.getValue());
if (item.getLink()==null && guid.isPermaLink()) {
syndEntry.setLink(guid.getValue());
}
}
else {
syndEntry.setUri(item.getLink());
}

When URI Property Could Be Missing

Uri property on SyndFeed and SyndEntry will not always be present. Here we summarize when that will be the case (i.e., uri == null):
  • SyndFeed
    • All syndication feed types except Atom 0.3 and 1.0 that have atom:id set.
  • SyndEntry
    • (RSS 0.91, RSS 0.92, RSS 0.93 & RSS 1.0) and (rss:link == null)
    • (RSS 0.94 & RSS 2.0) and (rss:guid == null) and (rss:link == null)
    • (Atom 0.3 & Atom 1.0) and (atom:id == null) and (alternate link == null)

Conclusions

Uri property at the feed or entry level of ROME implementation (i.e., SyndFeed or SyndEntry) can be used as feed's or entry's primary key. For example, it can be used to detect duplicate items in a syndication feed.

However, it's possible that uri property could be missing from SyndFeed or SyndEntry. In this case, other elements or combinations of them may be used as primary keys.

Friday, April 9, 2010

Understanding Module Implementation in ROME

ROME is a set of Atom/RSS Java utilities that make it easy to work in Java with most syndication formats. It provides a Java-friendly abstraction layer on top of the various syndication specifications, that maps the commonalities of the various feed formats into a single simple JavaBeans Data Model.

ROME is designed to be extensible. It uses a plugin mechanism as described here. All the supported feed types (RSSs and Atom) is done by plugins.

Based on this article--How Rome works, it describes what happens during Rome Newsfeed parsing:

  1. Your code calls SyndFeedInput to parse a Newsfeed, for example (see also Using Rome to read a syndication feed):
  2. SyndFeedInput input = new SyndFeedInput();
    SyndFeed feed = input.build(new XmlReader(feedUrl));

  3. SyndFeedInput delegates to WireFeedInput to do the actual parsing.
  4. WireFeedInput uses a PluginManager of class FeedParsers to pick the right parser to use to parse the feed and then calls that parser to parse the Newsfeed.
  5. The appropriate parser parses the Newsfeed parses the feed, using JDom, into a WireFeed. If the Newsfeed is in an RSS format, the the WireFeed is of class Channel and contains Items, Clouds, and other RSS things from the com.sun.syndication.feed.rss package. Or, on the other hand, if the Newsfeed is in Atom format, then the WireFeed is of class Feed from the com.sun.syndication.atom package. In the end, WireFeedInput returns a WireFeed.
  6. SyndFeedInput uses the returned WireFeedInput to create a SyndFeedImpl. Which implements SyndFeed. SyndFeed is an interface, the root of an abstraction that represents a format independent Newsfeed.
  7. SyndFeedImpl uses a Converter to convert between the format specific WireFeed representation and a format-independent SyndFeed.
  8. SyndFeedInput returns to you a SyndFeed containing the parsed Newsfeed.


How the Extensibility Is Supported

Using parsing as an example, the key implementation is the FeedParsers class (a subclass of PluginManager). At runtime, parsers that support different feed types are identified and created on demand using context ClassLoader for the current thread. Parser classes are defined in the properties files (i.e., rome.properties) as below:
WireFeedParser.classes=com.sun.syndication.io.impl.RSS090Parser \
com.sun.syndication.io.impl.RSS091NetscapeParser \
com.sun.syndication.io.impl.RSS091UserlandParser \
com.sun.syndication.io.impl.RSS092Parser \
com.sun.syndication.io.impl.RSS093Parser \
com.sun.syndication.io.impl.RSS094Parser \
com.sun.syndication.io.impl.RSS10Parser \
com.sun.syndication.io.impl.RSS20wNSParser \
com.sun.syndication.io.impl.RSS20Parser \
com.sun.syndication.io.impl.Atom10Parser \
com.sun.syndication.io.impl.Atom03Parser
In step 3 described above, WireFeedInput class picks the right parser to use based on the default namespace declaration in the document (i.e., XML feed). For example, the following document is an Atom 1.0 feed:

<feed xmlns="http://www.w3.org/2005/Atom">
...
</feed>

and WireFeedInput will choose com.sun.syndication.io.impl.Atom10Parser as its parser.

Module

Modules are supported in RSS 1.0, RSS 2.0, Atom 0.3, and Atom 1.0. The primary objective of modules is to extend the basic XML schema established for more robust syndication of content. This inherently allows for more diverse, yet standardized, transactions without modifying the core syndication specification.

To establish this extension, a tightly controlled vocabulary for module is declared through an XML namespace to give names to concepts and relationships between those concepts. For example, some RSS 2.0 modules with established namespaces are:
The extensibility of ROME also include the support for module plugins. There are two types of module plugins:
  1. Module parser plugins
  2. Module generator plugins
Both types of module plugins can be defined at feed and item (or entry) level.

Module Plugins

At the time of parser instantiation, modules of the same feed type are identified and created on demand using context ClassLoader for the current thread. Module classes are also defined in the properties files (i.e., rome.properties) as below:

atom_1.0.feed.ModuleParser.classes=com.sun.syndication.feed.module.georss.SimpleParser \
com.sun.syndication.feed.module.georss.W3CGeoParser
atom_1.0.item.ModuleParser.classes=com.sun.syndication.feed.module.georss.SimpleParser \
com.sun.syndication.feed.module.georss.W3CGeoParser

As shown above, two module parser plugins are specified:
  • SimpleParser
  • W3CGeoParser
for the Atom 1.0 feed type at both feed and item levels. Similarly, module generator plugins can be specified as this:

atom_1.0.feed.ModuleGenerator.classes=com.sun.syndication.feed.module.georss.SimpleGenerator \
com.sun.syndication.feed.module.georss.W3CGeoGenerator \
com.sun.syndication.feed.module.georss.GMLGenerator
atom_1.0.item.ModuleGenerator.classes=com.sun.syndication.feed.module.georss.SimpleGenerator \
com.sun.syndication.feed.module.georss.W3CGeoGenerator \
com.sun.syndication.feed.module.georss.GMLGenerator
To specify module parser or generator plugins for other feed types, just replace the type prefix (i.e., atom_1.0) with other types:
  • atom_0.3
  • rss_1.0
  • rss_2.0
In the above, we have used GeoRss modules as examples. Using GeoRss modules, users can quickly and easily add to their existing feeds with location in an interoperable manner as shown in the example below:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:georss="http://www.georss.org/georss"
xmlns:gml="http://www.opengis.net/gml">
<title>Earthquakes</title>

<subtitle>International earthquake observation labs</subtitle>
<link href="http://example.org/"/>
<updated>2005-12-13T18:30:02Z</updated>
<author>
<name>Dr. Thaddeus Remor</name>
<email>tremor@quakelab.edu</email>
</author>
<id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
<entry>
<title>M 3.2, Mona Passage</title>
<link href="http://example.org/2005/09/09/atom01"/>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<updated>2005-08-17T07:02:32Z</updated>
<summary>We just had a big one.</summary>
<georss:where>
<gml:Point>
<gml:pos>45.256 -71.92</gml:pos>
</gml:Point>
</georss:where>
</entry>
</feed>

Summary

The default plugins definition file is included in the ROME JAR file, com/sun/syndication/rome.properties, this is the first plugins definition file to be processed. It defines the default parsers, generators and converters for feeds and modules ROME provides.

After loading the default plugins definition file, ROME looks for additional plugins definition files in all the CLASSPATH entries, this time at root level, /rome.properties. And appends the plugins definitions to the existing ones. Note that if there are several /rome.properties files in the different CLASSPATH entries all of them are processed. The order of processing depends on how the ClassLoader processes the CLASSPATH entries, this is normally done in the order of appearance -of the entry- in the CLASSPATH.

The plugins classes are then loaded and instantiated. All plugins have some kind of primary key. In the case or parsers, generators and converters the primary key is the type of feed they handle. In the case of modules, the primary key is the module URI.