Microsoft Communities
Posted By: Joshua Allen | Sep 24th, 2007 @ 7:07 PM

Last week we updated the Tafiti site with a new Halo 3 theme, and the Halo 3 online manual and videos site went live on Silverlight today.  It's nice to see us using Silverlight in some places, but the use of Silverlight in Tafiti raises a more interesting point -- it demonstrates that you can effectively separate presentation from content using Silverlight -- something that is necessary if we hope to see people authoring "semantic XAML" the way that the best web designers use "semantic HTML".

First I'll summarize why semantic HTML is important, and then describe some ways that the same aims could be accomplished using XAML (Silverlight's markup language).

Semantic HTML

"Semantic HTML" is the name for a set of best practices for web page authors which result in:

  • Easier ability to style the page with CSS, enabling re-skinning and reuse for multiple device form factors.
  • Easier to update the base content without breaking things.
  • Pages which are optimized for mashups and screen-scraping by tools such as jQuery and Dapper.
  • Pages which are more accessible to people with disabilities.
  • Pages which work better support complete separation of code from content (another best practice).
  • Pages which are better optimized for search engines.

There is no official specification for "semantic HTML", and one could argue that the practice of "semantic HTML" is a set of hacks shoehorned into HTML that run counter to the actual intent of the original web standards (I'll address this criticism later).  Suffice it to say that, although there is no canonical reference, these are indeed best practices.  A good start is Tantek's presentation on Semantic XHTML.

Alternatives to Semantic HTML

HTML started as a minimal presentation language, with a small set of tags which were for the most part semantically rich as well as presentational.  In fact, the standard way for a search engine, database, or other tool to extract the content from an HTML document was to strip all the tags, and concatenate all text nodes in document order.  But people very soon started demanding more presentational and dynamic capabilities, and these were added directly to the document content with little to no additional semantic value.  Web sites today are often an unintelligible mishmash of intertwined content, additional semantics, and piles of extraneous and semantically meaningless stuff.

HTML's mixture semantic and presentational elements was considered confusing, HTML was considered too limiting for real semantics, and CSS had limited presentational capabilities.  So the authors of the web standards began looking for cleaner and better alternatives:

  • XML plus XSLT.  HTML's mixture semantic and presentational elements was considered confusing, HTML was considered too limiting for real semantics, and CSS had limited presentational capabilities.  So the W3C went to work on XML+XSLT, to enable complete separation between content and semantics.  The idea was that you would use non-presentational XML for your data and documents, and then use XSLT to create HTML "views" over that data.  In this scheme, HTML was relegated to being primarily a presentational language, since the XML document contained the true semantics.  XSLT enabled far more complex visual transformations of the content, and XML gave more flexibility in defining semantics, so this would seem to have been the ideal strategy.  In fact, it is possible to build very powerful applications this way today, but it has not become the norm.
  • XML Data Islands in HTML.  Another approach is to embed the semantic XML in the HTML document, and then use the HTML purely to build the UI skeleton.  As AJAX has become popular, developers have rediscovered cross-browser techniques for doing this.  It's equally possible to bind to XML data that is external to the file, though this adds some complications.
  • RDF.  I mention this only for sake of completeness.  W3C, the consortium who author the HTML, XML and RDF specifications, do not traditionally consider HTML to be about "semantics".  W3C's "Semantic Web" initiatives rely partly on XML, but primarily on RDF.  RDF is a data model for "knowledge representation".

Why Semantic HTML Wins

Despite the fact that there are cleaner, more standards-friendly ways to separate semantics and content from presentation, plain old "semantic HTML" is the preferred antidote to "ugly HTML".  Why?

The web has been optimized to handle HTML.  If you publish a document as raw XML, it's unlikely that any search engines, screen readers, or screen-scraping tools will understand enough to do anything useful with your document.  The only exception to this rule is perhaps if you use RSS or Atom -- no other XML vocabularies have enough adoption to enable web-scale semantics.

In short, "semantic html" can be seen as a lowest common denominator approach -- authoring your page such that it makes sense even when read as tag soup, and such that everything semantically important can easily be addressed by tools such as a screen reader or CSS stylesheet.

Towards Semantic XAML

One of the oft-cited benefits of Silverlight is the fact that the markup language, XAML, is plaintext and can be crawled by search engines.  Content rests in text nodes, just like HTML, so search engines can get good information by simply stripping off the tags.

But beyond being able to strip content, most web browser screen readers, search engines, and similar software can't infer any additional semantics.

XAML is an interesting case, because we designed XAML to work well for desktop applications (using WPF).  And accessibility and automation are well-solved problems for desktop applications.  WPF supports MSAA and UIA, as does Flash on Windows.  But MSAA/UIA are not cross-platform, and neither WPF nor Flash apps are typically designed to enable all of the aforementioned benefits of "semantic HTML".

Just as with web applications, smart XAML authors will work to modularize and separate presentation from content as much as possible.  As the Tafiti example demonstrates, the basic application structure can remain the same, while the visuals can be swapped out.  And already there are many examples of Silverlight applications which use XAML for the visuals and bind to data that is stored as XML.

But there is no reason to continue authoring rich applications as if the web doesn't exist.  The most interesting pattern to me is the use of semantic HTML that is "skinned" with XAML.  It is straightforward to use XAML to data bind to XML documents, and that includes XHTML.  The key insight here, is that once you have applied the best practices collectively known as "semantic HTML", you have already isolated the base content and semantics that you need in a data source.

Using semantic XHTML as your bound data source, as opposed to some other XML vocabulary, gives you all of the benefits of semantic HTML with all of the rich dynamic capabilities of Silverlight.  This is essentially the approach that companies like Avenue A and Blitz have begun to optimize RIAs for search engines, and I think this approach will continue to grow in popularity -- especially for content-oriented applications.

Tags:

Page Navigation