What is XML?:
XML is a markup language for documents containing structured information.
Structured information contains both content (words, pictures, etc.) and
some indication of what role that content plays (for example, content in a
section heading has a different meaning from content in a footnote, which means
something different than content in a figure caption or content in a database
table, etc.). Almost all documents have some structure.
A markup language is a mechanism to identify structures in a document. The
XML specification defines a standard way to add markup to documents.
What s a Document?:
The number of applications currently being developed that are based on, or
make use of, XML documents is truly amazing (particularly when you consider that
XML is not yet a year old)! For our purposes, the word "document"
refers not only to traditional documents, like this one, but also to the miriad
of other XML "data formats". These include vector graphics,
e-commerce transactions, mathematical equations, object meta-data, server APIs,
and a thousand other kinds of structured information.
So XML is Just Like HTML?
No. In HTML, both the tag semantics and the tag set are fixed. An <h1>
is always a first level heading and the tag <ati.product.code> is
meaningless. The W3C, in conjunction with browser vendors and the WWW
community, is constantly working to extend the definition of HTML to allow new
tags to keep pace with changing technology and to bring variations in
presentation (stylesheets) to the Web. However, these changes are always
rigidly confined by what the browser vendors have implemented and by the fact
that backward compatibility is paramount. And for people who want to
disseminate information widely, features supported by only the latest releases
of Netscape and Internet Explorer are not useful.
XML specifies neither semantics nor a tag set. In fact XML is really a
meta-language for describing markup languages. In other words, XML provides a
facility to define tags and the structural relationships between them. Since
there s no predefined tag set, there can t be any preconceived semantics. All
of the semantics of an XML document will either be defined by the applications
that process them or by stylesheets.
So XML Is Just Like SGML?
No. Well, yes, sort of. XML is defined as an application profile of SGML.
SGML is the Standard Generalized Markup Language defined by ISO 8879. SGML has
been the standard, vendor-independent way to maintain repositories of
structured documentation for more than a decade, but it is not well suited to
serving documents over the web (for a number of technical reasons beyond the
scope of this article). Defining XML as an application profile of SGML means
that any fully conformant SGML system will be able to read XML documents. However,
using and understanding XML documents does not require a system that is
capable of understanding the full generality of SGML. XML is, roughly speaking,
a restricted form of SGML.
For technical purists, it s important to note that there may also be subtle
differences between documents as understood by XML systems and those same
documents as understood by SGML systems. In particular, treatment of white
space immediately adjacent to tags may be different.
Why XML?:
In order to appreciate XML, it is important to understand why it was
created. XML was created so that richly structured documents could be used over
the web. The only viable alternatives, HTML and SGML, are not practical for
this purpose.
HTML, as we ve already discussed, comes bound with a set of semantics and
does not provide arbitrary structure.
SGML provides arbitrary structure, but is too difficult to implement just
for a web browser. Full SGML systems solve large, complex problems that justify
their expense. Viewing structured documents sent over the web rarely carries
such justification.
This is not to say that XML can be expected to completely replace SGML.
While XML is being designed to deliver structured content over the web, some of
the very features it lacks to make this practical, make SGML a more
satisfactory solution for the creation and long-time storage of complex
documents. In many organizations, filtering SGML to XML will be the standard
procedure for web delivery.
XML Development Goals:
The XML specification sets out the following goals for XML
- It shall be straightforward
to use XML over the Internet. Users must be able to view XML documents as
quickly and easily as HTML documents. In practice, this will only be
possible when XML browsers are as robust and widely available as HTML
browsers, but the principle remains.
- XML shall support a wide
variety of applications. XML should be beneficial to a wide variety of
diverse applications: authoring, browsing, content analysis, etc. Although
the initial focus is on serving structured documents over the web, it is
not meant to narrowly define XML.
- XML shall be compatible with
SGML. Most of the people involved in the XML effort come from
organizations that have a large, in some cases staggering, amount of
material in SGML. XML was designed pragmatically, to be compatible with
existing standards while solving the relatively new problem of sending
richly structured documents over the web.
- It shall be easy to write
programs that process XML documents. The colloquial way of expressing this
goal while the spec was being developed was that it ought to take about
two weeks for a competent computer science graduate student to build a
program that can process XML documents.
- The number of optional
features in XML is to be kept to an absolute minimum, ideally zero.
Optional features inevitably raise compatibility problems when users want
to share documents and sometimes lead to confusion and frustration.
- XML documents should be
human-legible and reasonably clear. If you don t have an XML browser and
you ve received a hunk of XML from somewhere, you ought to be able to look
at it in your favorite text editor and actually figure out what the
content means.
- The XML design should be
prepared quickly. Standards efforts are notoriously slow. XML was needed
immediately and was developed as quickly as possible.
- The design of XML shall be
formal and concise. In many ways a corollary to rule 4, it essentially
means that XML must be expressed in EBNF and must be amenable to modern
compiler tools and techniques.
There are a number of technical reasons why the SGML grammar cannot
be expressed in EBNF. Writing a proper SGML parser requires handling a
variety of rarely used and difficult to parse language features. XML does
not. - XML documents shall be easy
to create. Although there will eventually be sophisticated editors to
create and edit XML content, they won t appear immediately. In the
interim, it must be possible to create XML documents in other ways:
directly in a text editor, with simple shell and Perl scripts, etc.
- Terseness in XML markup is of
minimal importance. Several SGML language features were designed to
minimize the amount of typing required to manually key in SGML documents.
These features are not supported in XML. From an abstract point of view,
these documents are indistinguishable from their more fully specified
forms, but supporting these features adds a considerable burden to the
SGML parser (or the person writing it, anyway). In addition, most modern
editors offer better facilities to define shortcuts when entering text.
How Is XML
Defined?:
XML is defined by a number of related specifications:
Extensible Markup Language (XML)
1.0:
Defines the syntax of XML. The XML
specification is the primary focus of this article.
XML Pointer Language (XPointer) and XML Linking Language
(XLink):
Defines a standard way to represent
links between resources. In addition to simple links, like HTML s <A>
tag, XML has mechanisms for links between multiple resources and links between
read-only resources. XPointer describes how to address a resource, XLink
describes how to associate two or more resources.
Extensible Style Language (XSL):
Defines the standard stylesheet
language for XML.