Content Syndication

Web Architecture (INFO 290-03)

Erik Wilde, UC Berkeley School of Information
2008-10-30

Creative Commons License

This work is licensed under a CC
Attribution 3.0 Unported License

Abstract

For many information sources on the Web, it is useful to have some standardized way of subscribing to information updates. Syndication formats such as RSS and Atom can be used by these information sources to publish a feed of updated information items. While RSS and Atom are read-only formats, the Atom Publishing Protocol (AtomPub) build on top of Atom and provides a protocol for submitting new items to feeds.


Content Feeds


Syndication Formats

Outline (Syndication Formats)

  1. Syndication Formats [18]
    1. RSS [11]
    2. Atom [7]
  2. Syndication Aggregation [7]
    1. FeedBurner [5]
  3. Atom Publishing Protocol [8]
  4. Conclusions [1]

RSS

RSS History

  • The Myth of RSS Compatibility provides a good overview
  • RSS is a schoolbook example for why standards are a good thing
    • RSS 0.9 was created for the My Netscape portal in March 1999
    • RSS 0.91 (a simplification) was introduced in July 1999 (as an interim solution)
    • the AOL/Netscape merger removed the format from the company's portal
    • RSS was without an owner, and different parties claimed/denied ownership
    • RSS 1.0 was created by an informal developer group
    • RSS 0.92 (and 0.93 and 0.94) were published without acknowledging RSS 1.0
    • finally, RSS 2.0 was released as a follow-up to the RSS 0.9x versions
  • Using RSS has become an exercise in managing a menagerie of versions

RSS 0.9

  • RSS means RDF Site Summary (or Rich Site Summary?)
    • based on an RDF draft and not compatible with the final RDF specification
    • RDF was considered too cumbersome and unstable
    • 0.90 (proto-RDF) was quickly replaced by the non-RDF 0.91 version
  • RSS 0.92+ versions were developed as unilateral specifications
    • starting with RSS 0.91, RSS means Rich Site Summary
    • it is no longer built on RDF, instead it simply uses XML
    • the 0.9x branch eventually was renamed to RSS 2.0

RSS 0.91 Example

<rss version="0.91">
 <channel>
  <title>XML.com</title>
  <link>http://www.xml.com/</link>
  <description>XML.com features a rich mix of information and services for the XML community.</description>
  <language>en-us</language>
  <item>
   <title>Normalizing XML, Part 2</title>
   <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
   <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
  </item>http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html

RSS 1.0

  • RSS means RDF Site Summary (this time for real)
    • based on the final RDF specification and thus incompatible with any RSS 0.9
    • developed when the Semantic Web and RDF were first heavily marketed (1999)
    • RDF was expected to become the format for metadata on the Web
  • RSS 1.0 makes heavy use of XML Namespaces
  • RSS 1.0 introduces features which were not present in 0.91
    • date information for published items (very relevant for news feeds)
    • individual authors for various items in a feed
  • RSS 1.0 is the latest version of RDF-based RSS
    • the Semantic Web wave is not over yet, but RDF has lost its novelty appeal
    • for a more XML-oriented encoding, RSS 0.9 provides a better foundation

RSS 1.0 Example

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/">
 <channel rdf:about="http://www.xml.com/cs/xml/query/q/19">
  <title>XML.com</title>
  <link>http://www.xml.com/</link>
  <description>XML.com features a rich mix of information and services for the XML community.</description>
  <language>en-us</language>
  <items>
   <rdf:Seq>
    <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/normalizing.html"/>
    <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/som.html"/>
    <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/svg.html"/>
   </rdf:Seq>
  </items>
 </channel>
 <item rdf:about="http://www.xml.com/pub/a/2002/12/04/normalizing.html">
  <title>Normalizing XML, Part 2</title>
  <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
  <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
  <dc:creator>Will Provost</dc:creator>
  <dc:date>2002-12-04</dc:date>
 </item>http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html

RSS 2.0

  • RSS now means Really Simple Syndication
    • RSS 2.0 is the continuation of the 0.91 branch (which dropped RDF)
    • together with RSS 1.0 it is the most popular version of RSS
    • migration from 0.91 to 2.0 is easily possible
  • RSS 2.0 tries to avoid the use of XML Namespaces
  • RSS 2.0 is increasingly used with extensions for vendor-specific information
    • the RSS core is minimal, so many applications need extensions
    • many extensions have overlapping functionality
    • most extensions have unclear semantics and unclear versioning policies

RSS 2.0 Example

<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
 <channel>
  <title>XML.com</title>
  <link>http://www.xml.com/</link>
  <description>XML.com features a rich mix of information and services for the XML community.</description>
  <language>en-us</language>
  <item>
   <title>Normalizing XML, Part 2</title>
   <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
   <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
   <dc:creator>Will Provost</dc:creator>
   <dc:date>2002-12-04</dc:date>
  </item>http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html

The Case for Content Management

  • RSS is very rarely produced by hand
    • by definition, RSS contains redundant information for a specific purpose
  • If a Content Management System (CMS) is used, RSS can be generated
    • basic metadata can be generated by the CMS (title, author, date)
    • better tagging of content results in better tagging of feeds
    • well-tagged feeds are better foundations for large-scale reuse of feed items
  • Blogging is simply a specialized case of a CMS
    • Web-based interface for controlling everything
    • strictly time-ordered sequenced of published items
    • navigation features primarily based on the time-specific facets of the blog (maybe tags)
    • all blogging tools include feed support

Consuming RSS

  • RSS feeds often have quality problems
    • surprisingly often feeds do not even deliver well-formed XML
    • the use of embedded markup in RSS is not well-defined
  • Writing an RSS reader from scratch is not a good idea
  • There are three major tasks which RSS readers must do
    1. accept non-XML RSS feeds and fix them to be XML
    2. look at the feed contents and bring them into a unified form
    3. produce a unified view of feeds regardless of the RSS version

RSS Technical Problems

  • What to put into an item's description
    • the fundamental question is whether a description is text or HTML
    • if there is no well-defined way, then interpretation is client-specific
      <description>This is a <em>very important</em> blog post …
      <description>This is a &lt;em>very important&lt;/em> blog post …
      <description>This is a blog post about <em> in RSS feeds …
      <description>This is a blog post about &lt;em> in RSS feeds …
      <description>This is a blog post about &amp;lt;em> in RSS feeds …
  • Underspecified and not very robust in various other areas
    • broken RSS is accepted by most readers (but fixing it can change the interpretation)
    • the interpretation of relative URIs is not mentioned in the specifications
    • some minimal semantics (classification) for items would be very useful

RSS Political Problems

  • Multiple and incompatible RSS History are still in widespread use
    • RSS 1.0 and RSS 2.0 are incompatible by design (RDF vs. non-RDF)
    • none of the RSS versions is maintained by a universally accepted standards body
  • None of the specifications is being updated or fixed
    • some of the lessons learned by RSS deployment are not used in a new version
    • it is unlikely that a new version will be produced which merges the RSS landscape
  • Invent something new instead of trying to fix RSS
    • Atom started in 2003 (called Echo at first)
    • W3C or IETF would have been promising candidates for a new RSS
    • W3C is more formal, IETF is more developer-centered
    • IETF was chosen over W3C because the of Atom community's preferences

Atom

Atom History

atom-logo.png
  • RSS's shortcomings were very apparent and could not be fixed
  • In mid-2003, discussions started about an improved format
  • It also became apparent that the format should have a protocol
  • Atom 0.3 was released in December 2003 but had no formal home
  • IETF was chosen as the new home with a working group in June 2004
  • RFC 4287 was published in December 2005
  • AtomPub has been published as RFC 5032 in October 2007

Atom vs. RSS

  • Standardized by the IETF (well-defined process)
  • Classification of entries (user-defined categories)
  • More XML-like markup design (more nesting)
  • Namespaces are used and supported as standard mechanism
  • Atom feeds must be well-formed XML (there even is a schema)
  • Interpretation of content is well-defined (various content types)
  • Support for xml:lang and xml:base

Atom Example

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-us">
 <title>ongoing</title>
 <id>http://www.tbray.org/ongoing/</id>
 <link rel='self' href="http://www.tbray.org/ongoing/ongoing.atom"/>
 <updated>2007-04-11T12:55:09-07:00</updated>
 <author>
  <name>Tim Bray</name>
 </author>
 <subtitle>ongoing fragmented essay by Tim Bray</subtitle>
 <entry xml:base="When/200x/2007/04/02/">
  <title>Atom Publishing Protocol Interop!</title>
  <id>http://www.tbray.org/ongoing/When/200x/2007/04/02/APP-Interop</id>
  <published>2007-04-02T13:00:00-07:00</published>
  <updated>2007-04-10T14:24:00-07:00</updated>
  <category scheme="http://www.tbray.org/ongoing/What/" term="Technology/Atom"/>
  <category scheme="http://www.tbray.org/ongoing/What/" term="Technology"/>
  <category scheme="http://www.tbray.org/ongoing/What/" term="Atom"/>
  <content type="xhtml">
   <div xmlns="http://www.w3.org/1999/xhtml">
    <p>Mark your calendar: <a href="http://www.intertwingly.net/wiki/pie/April2007Interop">April 16-17 at Google</a>. <em>Everybody</em> is invited, provided they bring along an APP implementation, client or server. This was just announced a couple of days ago, and as I write this there are already <s>six</s> twelve client and <s>seven</s> fourteen server implementations signed up to be there and try to <a href="http://www.intertwingly.net/wiki/pie/InteropGrid">fill in the grid</a>. Let’s drop some names, in alphabetical order: AOL, Flock, Google, IBM, Lotus, Microsoft, Oracle, O’Reilly, Six Apart, Sun, WordPress. Um, have I mentioned that the APP is going to be huge?</p>
   </div>
  </content>
 </entry>
</feed>atom.xml

Atom Content

  • RSS had no safe way of finding out what an entry's content is
    • this led to different implementations using smart ways of what the RSS author really wanted
    • one of Atom's main goals was to improve this in a well-defined way
    • Atom allows escaped markup (the only way to include non-XML HTML in an XML format)
  • Each content element should have a type (the default is text)
  • Atom's content interpretation algorithm (use first applicable rule):
    1. if type is text, no child elements are allowed (plain text content)
    2. if type is html then RSS's method of escaped markup is used
    3. if type is xhtml then there must be an div containing XHTML markup
    4. if type is an XML media type then the content should be treated as this type
    5. if type starts with text/ then no child elements are allowed
    6. for all other values, the content must be an base64-encoded entity of the specified MIME type

Atom Categories

  • Atom allows to assign categories to entries
    • each category element must have a term attribute for the category
    • an optional scheme identifies the categorization scheme (ontology, taxonomy, …)
    • an optional label attribute provides a human-readable label for the category
  • AtomPub defines a document format for Category Documents
  • Three different cases of categorization can be distinguished
    1. use a well-known scheme (such as Dublin Core)
    2. use a private but well-designed scheme (which has a URI and can be reused reliably)
    3. use tags without schemes, which then are little more than content labels
  • Widely-known tags are not easy to handle
    • they are more than just privately assigned tags
    • there is no formal scheme for them, just an emerging consensus

Switching from RSS to Atom

  • Generate both feeds but serve RSS with a HTTP redirect (301)
    • old subscribers with broken clients can still use the RSS feed
    • old subscribers with correct clients will use the Atom feed
  • Atom exposes more information than RSS (category for tags)
    • the mapping of publishing info to the feed has to be changed/extended
    • for standard metadata use Atom's built-in metadata elements
    • for application-specific metadata consider reusing an existing metadata schema
  • Atom can be used to publish snippets as well as full content
    • content allows any type of content to be used and may contain a complete entry
    • summary allows only text and should provide a condensed version of an entry
    • some Atom sources publish two feeds for summaries and content
  • Generate good Atom and downgrade it to RSS 1.0 & 2.0

Syndication Aggregation

Outline (Syndication Aggregation)

  1. Syndication Formats [18]
    1. RSS [11]
    2. Atom [7]
  2. Syndication Aggregation [7]
    1. FeedBurner [5]
  3. Atom Publishing Protocol [8]
  4. Conclusions [1]

End-User Aggregation

feed-icon.png
<link rel="alternate" type="application/rdf+xml" title="…" href="…" />
<link rel="alternate" type="application/rss+xml" title="…" href="…" />
<link rel="alternate" type="application/atom+xml" title="…" href="…" />

Aggregation Intermediaries


FeedBurner

Outline (FeedBurner)

  1. Syndication Formats [18]
    1. RSS [11]
    2. Atom [7]
  2. Syndication Aggregation [7]
    1. FeedBurner [5]
  3. Atom Publishing Protocol [8]
  4. Conclusions [1]

Fixing Feeds

Cleaning Up Feeds

Load Balancing

Providing Feed Load Balancing

Statistics/Analytics

Providing Feed Statistics

Query Capabilities

  • Feed technology is still evolving
  • Feeds are mostly viewed as ordered by time
    • allows optimization for accesses and caching
    • makes it hard to use feeds for non-timed information
  • Feeds could be ordered by any sort key
    • makes server-side feed processing much more expensive
    • enables customized feeds that are processed on the server-side

Supporting Queryable Feeds

Supporting Queryable Feeds

Atom Publishing Protocol

Outline (Atom Publishing Protocol)

  1. Syndication Formats [18]
    1. RSS [11]
    2. Atom [7]
  2. Syndication Aggregation [7]
    1. FeedBurner [5]
  3. Atom Publishing Protocol [8]
  4. Conclusions [1]

Syndication Format Protocols


RESTified Syndication


Collections, Members, Entries, Media


Protocol Summary

Resource HTTP Method Representation Description
Introspection GET Atom Service Document Enumerates a set of collections and lists their URIs and other information about the collections
Collection GET Atom Feed A list of member of the collection (this may be a subset of all entries in the collection)
Collection POST Atom Entry Create a new entry in the collection
Member GET Atom Entry Get the Atom Entry
Member PUT Atom Entry Update the Atom Entry
Member DELETE n/a Delete the Atom Entry from the collection

Service Documents

Service Documents represent server-defined groups of Collections, and are used to initialize the process of creating and editing resources.

Service Document Example

<service xmlns="http://purl.org/atom/app#" xmlns:atom="http://www.w3.org/2005/Atom">
 <workspace>
  <atom:title>Main Site</atom:title>
  <collection href="http://example.org/reilly/main">
   <atom:title>My Blog Entries</atom:title>
   <categories href="http://example.com/cats/forMain.cats"/>
  </collection>
  <collection href="http://example.org/reilly/pic">
   <atom:title>Pictures</atom:title>
   <accept>image/*</accept>
  </collection>
 </workspace>
 <workspace>
  <atom:title>Side Bar Blog</atom:title>
  <collection href="http://example.org/reilly/list">
   <atom:title>Remaindered Links</atom:title>
   <accept>entry</accept>
   <categories fixed="yes">
    <atom:category scheme="http://example.org/extra-cats/" term="joke"/>
    <atom:category scheme="http://example.org/extra-cats/" term="serious"/>
   </categories>
  </collection>
 </workspace>
</service>atom-service.xml

Category Documents


Category Document Example

<app:categories xmlns:app="http://purl.org/atom/app#" xmlns="http://www.w3.org/2005/Atom" fixed="yes" scheme="http://example.com/cats/big3">
 <category term="animal"/>
 <category term="vegetable"/>
 <category term="mineral"/>
</app:categories>atom-category.xml

Conclusions

Outline (Conclusions)

  1. Syndication Formats [18]
    1. RSS [11]
    2. Atom [7]
  2. Syndication Aggregation [7]
    1. FeedBurner [5]
  3. Atom Publishing Protocol [8]
  4. Conclusions [1]

Semantic Web Light