XML Basics

XML Foundations (INFO 242)

Erik Wilde, UC Berkeley School of Information
2008-09-04

Creative Commons License

This work is licensed under a CC
Attribution 3.0 Unported License

Abstract

The Extensible Markup Language (XML) defines a simple way for structuring data. The power and popularity of XML can be explained by its versatility, the platform-independence, the standards and technologies leveraging it, and the number of tools and products supporting it. Understanding XML itself is rather simple, it only depends on a very small set of other technologies. Unicode and URIs are the most important foundations of XML. XML itself specifies two different things: on the one hand the format for structured data, which are called XML documents, and on the other hand a constraint language for XML documents, which is called Document Type Definition (DTD).


Foundations for XML

Outline (Foundations for XML)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

Identifications


Unicode

XML's Idea of Content and Names

XML documents can use a wide array of characters. They are defined by Unicode, which currently (Version 5.0) defines more than 100'000 characters (#100'000 added in 2005).

<?xml version="1.0" encoding="UTF-8"?>
<JAPANESE>
 <TITLE>専門家リスト </TITLE>
 <ITEM>アシム・アブドゥラー氏(コマースネット事務局長)</ITEM>
 <ITEM>アラン・A・メッコラー氏(メッコラーメディア会長兼CEO)</ITEM>
 <ITEM>アラン・サルディッチ氏(メトリコムディレクター)</ITEM>
 <ITEM>ウィスター・ウォルコット氏(パイロットネットワーク・サービシズ副社長)</ITEM>
 <ITEM>・エリック・リンゲワルド氏(ビー・インク副社長)</ITEM>
 <ITEM>ジェームス・L・バークスデール氏(ネットスケープ・コミュニケーションズ社長)</ITEM>
</JAPANESE>japanese1.xml
<?xml version="1.0" encoding="UTF-8"?>
<文書 改訂日付="1999年3月1日">
 <題>サンプル</題>
 <段落>これはサンプル文書です。</段落>
 <!-- コメント -->
 <段落>会社名</段落>
 <図面 図面実体名="サンプル" />
</文書>japanese2.xml

XML and Unicode

  • XML is based on Unicode
  • How are XML documents encoded?
    • applications can use any character encoding they like
    • XML processors must support UTF-8 and UTF-16
    • XML processors may support any number of additional encodings
  • How is the encoding encoded?
    • part of the XML document: <?xml version="1.0" encoding="UTF-8"?>
    • bootstrap problem solved heuristically or by out-of-band information

Uniform Resource Identifier (URI)

Outline (Uniform Resource Identifier (URI))

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

Identifiers are Essential

  • Uniform Resource Locator (URL) is the old concept
    • introduced to distinguish between locating and naming
    • locating and naming are two ways of identification
    • URLs have been replaced by URIs, technically URLs do not exist anymore
  • URIs identify resources

XML

Outline (XML)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

XML Use Cases


XML Documents

Outline (XML Documents)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

Markup?

  • Structures are encoded using special characters
    • a fundamental difference when comparing to binary formats
    • markup languages can be read and modified using text-based tools
    • programs must treat markup characters in a special way
  • Documents are content interspersed with markup (i.e., structures)
    • XML-aware software interprets the markup
    • XML-unaware software just sees a text file
    • modifications must be made XML-aware (e.g., inserting AT&T as AT&amp;T)
  • You have to pay the The Price for Markup

Basic Concepts

  • XML Documents have an XML declaration (optional)
  • There is exactly one document element (a.k.a. root element)
  • Elements may be nested (there is no conceptual limit)
    • elements may be repeated (they can be identified by position)
  • Elements are marked up using tags
    • most elements have content, surrounded by start and end tags
    • empty elements are allowed and may use a special notation
  • Elements may have attributes (zero to any number)
    • attributes can only occur once on an element (i.e., they cannot be repeated)
<?xml version="1.0" encoding="UTF-8"?>
<element>
 <subelement attribute="value">Content</subelement>
 <subelement a2="value2">More Content</subelement>
 <empty-element a3="v3"></empty-element>
 <empty-element a4="v4" a5="v5"/>
</element>my-first.xml

Tree Syntax

  • Markup is important, but only a notation
  • XML documents are trees with different node types
    • nodes so far: document, element, attribute, text
    XML document tree

Elements

  • Elements can use a wide variety of names
    • Allowed: html, id9832798472, _, :, こんにちは
    • Disallowed: leading numbers, spaces, control characters
  • Element names usually convey some information about the content
    • this is not reliable and highly language-dependent
    • it is very useful when working with a known vocabulary
    • it is potentially harmful when working with an unknown vocabulary
  • Elements are the foundation for XML's versatility
    • they can be nested (<address><city>Berkeley</city><zip>94709</zip>…)
    • they can be repeated (<givenname>Erik</givenname><givenname>Thomas</givenname>)
    • their sequence can convey additional information (given names have a sequence)

Attributes

  • Additional information pertaining to elements
  • Traditionally, anything that is not considered content
    • SGML is a document markup language
    • XML uses SGML's concepts
    • XML has its roots in the document world
  • Elements: Content (i.e., Data); Attributes: Metadata
  • Documents often distinguish by what is textual content
 <section id="xml" author="bob">
  <title>Extensible Markup Language (XML)</title>
  <p>XML is based on SGML (Section <ref name="sgml"/>) ...</p>
  <p type="example">XML can be used ...</p>
  <section id="xml-syntax" author="dret">
   <title>XML Syntax</title>
   <p>Section <ref name="sgml-syntax"/> describes ...</p>
  </section>
 </section>section.xml (line 12-20)

Attribute Syntax

  • Naming rules are the same as for Elements
  • Attributes always appear within an element's start tag
  • Attributes are name/value-pairs
    • the value is enclosed in single or double quotes
  • Attribute with a single-quote value: elem attr="Single: '"/
  • Attribute with a double-quote value: elem attr='Double :"'/
  • How can attribute values contain both?

The Price for Markup

  • Markup characters have a special meaning
    • < opens a tag
    • for attribute values, quotes delimit the value
  • The literal use of a markup character requires escaping
    • XML's entities can refer to pieces of content
    • entity syntax is &name; for referring to the entity name
    • XML has 5 predefined entities: &lt;, &gt;, &amp;, &apos;, &quot;
  • Attribute using both kinds of quotes: <elem attr="Single ' and Double &quot;"/>
<li>Attribute using both kinds of quotes: <code>&lt;elem attr="Single ' and Double &amp;quot;"/></code></li>

Mixed Content

The term Mixed content in XML refers to elements which have text content mixed with elements. What these elements do depends on the elements smiley.gif, but the important point is that they are on the same level as the text nodes of the mixed content.

<p>The term <em>Mixed content</em> in XML refers to elements <a href="http://www.w3.org/TR/xml/#sec-mixed-content">which have text content mixed with elements</a>. What these elements do depends on the elements <img style="height : 1em" src="smiley.gif"/>, but the important point is that they are on the same level as the text nodes of the mixed content.</p>
XML tree for mixed content

Mixed Content Usage

  • Database people find mixed content irritating
    • cannot be easily mapped to relational structures
    • is more document-like than data-like
    • much harder to optimize for query analysis and query processing
  • Document people find mixed content very intriguing
    • textual content can still be used as simple text
    • markup provides additional information for rich text
    • start with a text-only document and use markup to add structure to it

Whitespace

  • XML documents often are pretty-printed
  • Whitespace text nodes often are not really content
    • XML whitespace characters are space, tab, newline, and carriage return
    • whitespace text nodes are text nodes containing only whitespace characters
    XML tree with whitespace text nodes

Significant Whitespace

  • Some whitespace text nodes are relevant
  • Usually text nodes in mixed content elements

Whitespace can be very important!

<p>Whitespace <i>can be</i> <u>very</u> <b>important</b>!</p>
XML tree containing significant whitespace

Processing XML

Outline (Processing XML)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

Observing XML Syntax

  • XML's syntax requires you to use the right characters
  • XML processors (a.k.a. XML parsers) check for these rules
    • if there are problems, the document cannot be interpreted as XML
    • otherwise, the document is said to be well-formed
  • Only well-formed documents can be regarded as a tree
    • other documents are not XML at all, even though they may be close
    • XML processors must report problems to the application (no silent recovery)

Validity

  • Well-formed documents observe XML rules
    • they observe the XML syntax
    • they observe all well-formedness constraints
  • Applications require the right elements and attributes
  • Validity is a more comprehensive concept
  • Valid documents observe additional rules

Semantics

  • XML is a language for encoding trees
    • Elements and attributes are labeled node in this tree
    • the labels can be chosen freely by document authors
  • The tree's meaning is nothing XML is concerned with
    • peers must have a mutual understanding of the semantics
    • XML without mutual understanding is almost useless
    • reverse engineering often is possible, but it is risky and brittle

Conclusions

Outline (Conclusions)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

XML Documents