|  The 
Promise of XML
 
 	The 
Internet can be described as a massive system of interconnected computers that 
send millions of documents every second. These documents could be web pages, e-mail, 
news postings, advanced information documents, and more. The Internet exists as 
it does today because of a document format called HTML that offers an easy way 
to transfer information between computers of all types. 	Even 
though HTML has been so successful, the technology is aging. It does not have 
the ability to define exactly what type of information is contained in a document. 
This makes the Internet a very unorganized technology, with billions of virtually 
undefined documents just waiting for someone to find them, often by chance. With 
the Internet in an obvious need of a better information infrastructure, a new 
format called XML has been developed to improve the data definition problems. 
XML has the potential to change the Internet into a much more efficient medium. The 
Origin of the Markup Languages 	To 
understand these document formats, and how they compare, it is useful to understand 
their relationships. HTML and XML are related formats; in fact, HTML can be defined 
as a subset of XML. A third format called SGML is a very complex superset of XML, 
with GML being the predecessor of SGML. GML was the initial format of this family, 
called the markup languages. 	GML 
is an abbreviation for Generalized Markup Language, and IBM first developed it 
under the name Text Description Language with the explicit purpose of storing 
law office documentation. IBM quickly realized how universally useful this format 
was, and began to test it on other computer applications. However, using different 
document types on a computer in the early 1970’s was no small task, as computers 
at that time were very specialized. GML was useful because it was easily portable 
to different types of computers, rather than building a whole new computer system 
to use it. 	An 
example of GML is demonstrated here:    
		.ceCentered Title
 .ll 39
 .ss
 This 
is a paragraph that is single spaced with a line length of 39 characters.
  
	A document created as 
such would then be saved as plain text. A computer reading this document with 
GML capability would then process the information and display it on the monitor 
or printer like this:    Centered 
Title This is a paragraph that 
is singlespaced with a line length of 39
 characters.
  This 
example has shown how a non-intensive GML document can be created to produce a 
simple, formatted document, with a title and a defined paragraph. However, GML 
was also adapted for use with repetitive types of data, such as lists. This application 
let GML be defined as a raw data storage format, which does not happen again for 
a markup language until XML is developed. SGML 
is the second iteration of the markup language, an it dates back to its conception 
in 1974 by Charles Goldfarb, a co-developer of GML at IBM. SGML stands for Standard 
Generalized Markup Language, and it has become an internationally used document 
standard. The 
major goal of SGML was to be useful for electronic manuscripts and documentation. 
These applications for SGML are still widely used today in a much more advanced 
state. Goldfarb 
wanted SGML to become a standardized format to ensure its usefulness and compatibility 
between computers:    SGML 
is designed to make your information last longer than the computer systems that 
created it. Such longevity also implies immunity to short-term changes – such 
as a change from one application to another – so SGML is also inherently designed 
for re-purposing and portability. [. . .] But the real key to SGML’s success – 
both politically and technically – is the fact that SGML is a bona fide International 
Standard, not the creation of a dominant vendor or a consortium. I say "politically" 
because large users feel they can safely invest millions to convert to SGML because 
the SGML specification is stable and is maintained by a neutral organization.  	SGML 
also introduced a new syntax that was much easier to read, more versatile, and 
less prone to error when compared to GML. This syntax has also been carried from 
SGML to HTML and XML. The best way to explain the concept is by comparison:  
  		"This 
is a statement." 	The 
statement is defined as a quotation because of double-quotes before and after 
it. You can also notice that there is a difference in the double-quotes, with 
a beginning and ending type. 	Taking 
this quotation, and putting it in an arbitrary SGML format, it would look like 
this:    		<quotation>This 
is a statement.</quotation> 	There 
is a beginning and ending of what are called tags, and these tags have replaced 
the double-quotes. However, the statement is still defined as a quotation because 
of the tags surrounding it. 	There 
can be multiple tags encapsulating a statement, or any type of marked up data. 
By doing this, the data can be better defined. For example, if we wanted to define 
a statement as both a quotation, and referring to Goldfarb, we could do this:  
  		<goldfarb><quotation>This is a statement.</quotation>
 </goldfarb>
 	As 
shown above, the tags can be placed in any manner, however, it is the sequence 
of the tags that defines the document. For readability, all marked-up documents 
should be well organized. 	The 
full SGML specification of today is extremely complex. For example, not only do 
tags define the data; the tags are also defined by something called a Document 
Type Definition, or DTD, resulting in a full definition language with specific 
rules.  	Since 
SGML can be so complex, computer programs have been written to aid in the creation 
of SGML documents, therefore reducing errors, and increasing production speed 
dramatically. The 
Rise of HTML  	With 
SGML being a very large specification, it is only suitable for industrial and 
professional applications where data integrity is a priority. With the creation 
of the Internet, it was obvious that SGML would not be suited for Internet applications, 
so a new subset language had to be created. 	This 
new specification was called HTML, or HyperText Markup Language, and was first 
established in 1992. The World Wide Web Consortium, or W3C, was then a newly formed 
Internet standards organization that developed the HTML specification as a simple 
language. The HTML tags were predefined by a standard DTD to be used by all HTML 
documents. The tags that were defined focused mostly on defining such items as 
titles, paragraphs and their properties, much like the GML example. However, simplified 
syntax was borrowed from SGML for the use of embedded images and hyperlinks, therefore 
making it more useful as an Internet medium. 	An 
HTML web page is displayed on a computer using a web browser. The web browser 
reads HTML sent to the computer across the Internet by a server. A server is a 
computer that has the task of talking to other computers by sending and receiving 
data. In order to receive an HTML document, you have to go out and request it 
by typing in a web address, or clicking on a link in a web page. The two processes 
are exactly the same type of request, even though they seem very different in 
their use. 	HTML 
makes creating a basic web page easy because it is such a simple format. However, 
making a complex web page becomes tedious because not all web browsers read HTML 
the same, understand all the same tags, and conform to the standards. This is 
the type of fragmentation that Goldfarb was able to avoid by setting up a strict 
standards system for SGML. Even though HTML is a standard, intense competition 
in the Internet software industry have been the cause of fragmentation. 	A 
basic HTML document can be shown by example: <html><body>
 <img src="johnsphoto.jpg"/>
 <b><font color="red">John 
Doe. </font></b>
 <a 
href="resume.html">Link to my Resume.</a>
 </body>
 </html>
  
	This document 
would display a photo with a bold, red "John Doe" next to it. Next would 
be a link to his resume, which is a separate HTML document. Even with a handful 
of HTML, a web page can be made that proves useful. The 
Need for a New Language 	A 
drawback of HTML is that it cannot define the data within it. The tags in HTML 
do not say anything about the data, therefore making it ambiguous. Searching HTML 
data can return inaccurate results, resulting in a loss of time by manually sorting 
data. 	This 
drawback of HTML can be shown by example. For instance, we have an HTML document 
with some items for sale: <html><body>
 <p>Red chair 
for sale, $40.</p>
 <p>Blue table 
for sale, $60.</p>
 </body>
 </html>
  
Imagine that you are a buyer 
looking for a red table. In your search, you are going to get the above example 
document in your search results because it contains both the words "red" 
and "table". However, this document does not contain a red table for 
sale, and time was lost looking at this irrelevant page. What 
if we could combine the precise data definition of SGML and the Internet capabilities 
of HTML? Such a language would have likely saved us time in our search for a red 
table. There is such a language, and it is XML. With XML, it is even possible 
to tell what is contained in a document without actually looking at the content, 
just the tags. If we rewrite the above example in XML, we could have something 
like this: <?xml 
version="1.0"?><sale>
 <red>
 <chair>Red chair for sale, $40.</chair>
 </red>
 <blue>
 <table>Blue table for sale, $60.</table>
 </blue>
 </sale>
  
The sole <table> tag 
is defined inside of <blue>, therefore defining that table as blue. There 
is not a <table> tag inside of <red>, but if the seller had another 
table that was red, it would go there. Since 
there is not an item defined as a red table, we would not get this document in 
our search results, therefore saving time. For applications like this, the benefits 
of XML can be seen in making the Internet more useful and reliable.  
 
  |