Sunday, April 11, 2010

The Role of URI in ROME for Duplicate Detection

As described in this article RSS Duplicate Detection by James Holderness, it's tricky to detect duplicate items in an RSS feed while still allowing for that item to be updated. In RSS feeds, RSS aggregators need to rely on different strategies to resolve the issue due to lack of standard element that can be used for this purpose. His article only covers the issue from RSS' perspective. Adding other feed types such as Atom standards, the issue becomes more messy.

In this article, we'll look at what's ROME strategy in dealing with this issue.

RSS Duplicate Detection

After running 150 tests on more than 20 different RSS aggregators, James has concluded that:


Supporting guids seems like an essential starting point. Also, when a feed doesn’t
contain guids, the link element is probably a good fallback (possibly combined with
or as an alternative to other elements).

ROME's Solution

ROME Synd* beans (i.e., SyndFeed and SyndEntry) define the concept of URI at both feed and entry levels. The returned URI is a normalized URI as specified in RFC 2396bis. The purpose of this URI is to help uniquely identifying feeds and entries when processed with ROME. This is particularly useful when a system is persisting feeds and when doing data manipulation where a primary key for the feeds or the entries is needed.

How the feed URI maps to a concrete feed type (RSS or Atom) depends on the concrete feed type. This is explained in detail in Rome documentation, Feed and entry URI mapping.

Sources of URI for SyndFeed:

  • atom:id element (Atom 0.3 and 1.0)
    • The "atom:id" element conveys a permanent, universally unique identifier for an entry (or feed).
    • The content of an atom:id element MUST be created in a way that assures uniqueness.
    • Instances of atom:id elements can be compared to determine whether a feed (or entry) is the same as one seen before.
Sources of URI for SyncEntry:
  • rss:guid (RSS 0.94 and 2.0)
    • guid stands for globally unique identifier. It's a string that uniquely identifies the item. When present, an aggregator may choose to use this string to determine if an item is new.
  • rss:link (RSS 0.91, RSS 0.92, RSS 0.93 & RSS 1.0)
    • The URL of the item. Will be used to set uri of the entry if one exists.
  • atom:id element (Atom 0.3 and 1.0)
    • See the description above.
  • atom:link element (Atom 0.3 and 1.0)
    • if atom:id element is missing
    • if link's relation type is alternate , and
    • if one and only one exists
For RSS 0.94 and 2.0, the settings of uri and link properties on SyndEntry object are based on the following algorithm:

if (guid!=null) {
syndEntry.setUri(guid.getValue());
if (item.getLink()==null && guid.isPermaLink()) {
syndEntry.setLink(guid.getValue());
}
}
else {
syndEntry.setUri(item.getLink());
}

When URI Property Could Be Missing

Uri property on SyndFeed and SyndEntry will not always be present. Here we summarize when that will be the case (i.e., uri == null):
  • SyndFeed
    • All syndication feed types except Atom 0.3 and 1.0 that have atom:id set.
  • SyndEntry
    • (RSS 0.91, RSS 0.92, RSS 0.93 & RSS 1.0) and (rss:link == null)
    • (RSS 0.94 & RSS 2.0) and (rss:guid == null) and (rss:link == null)
    • (Atom 0.3 & Atom 1.0) and (atom:id == null) and (alternate link == null)

Conclusions

Uri property at the feed or entry level of ROME implementation (i.e., SyndFeed or SyndEntry) can be used as feed's or entry's primary key. For example, it can be used to detect duplicate items in a syndication feed.

However, it's possible that uri property could be missing from SyndFeed or SyndEntry. In this case, other elements or combinations of them may be used as primary keys.

No comments: