What SGML Can Teach Us About XML & the Web
Interviewed by Tony Byrne,
Founder, Principal, CMSWatch
Betty Harvey is founder and president of Electronic Commerce Connection, Inc. (ECC), a consulting and training firm specializing in SGML/XML technologies.
CMSWatch: When did you start working with markup languages?
And how did you fall in love with them?
Harvey: In 1992, I was working with the US Navy in scientific and engineering
computer support. I loved that job, but after changes to the organization and
the supercomputing center being moved, I decided switch to something different.
At the time, we had this other project -- which was kind of a black hole --
called “CALS,” (computer-aided logistics support), which included EDI and SGML.
In our organization, it was always joked that if you worked with CALS it was
the end of your career -- it was reputed to be a dead-end job. I thought I
would just move there until I could find something else, but it was actually
kind of interesting because we were also responsible for the CALS standards
for the Navy. These were technical publishing standards based in part on SGML.
But I didn’t fall in love with SGML. There were problems with SGML because
it was way too expensive, it was way too complicated due to the all the various
options, and there was limited vendor support. You couldn’t do it for under
a million dollars. But at the same time, while I was still working in the CALS
arena, the Web hit, in 1994, and HTML proved that SGML could work, somehow.
At the first World Wide Web conference in 1994 it clicked: SGML is going to
revolutionize the way we think of information and because HTML could prove you
didn’t need a million dollar to get involved in it. That was like a lightbulb
to me. I got involved in an “SGML on the Web” initiative that helped pave the
way for XML.
So what are the key differences between SGML and XML?
Harvey: Fundamentally, XML has taken SGML and made it less complicated,
but more standardized. Especially if you take just XML 1.0, anybody can learn
it. Yet, everything has to be well formed, where in SGML that wasn’t the case.
In SGML you could eliminate the end tags, you could eliminate the beginning
tags, or you could eliminate both tags. It was crazy.
So it’s stricter – which is good -- but it’s more flexible in other ways.
CMSWatch: Many of us came to XML via Web development. What
lessons does SGML impart to anyone thinking of working with XML?
Harvey: Perhaps the biggest lesson – if you look at the history of
SGML – is to be wary of unnecessary complexity around specifications.
The real danger for a developer is that you don’t know which standard you should
work under. Let’s take XML transport: should you use SOAP or ebXML for a transport/routing
protocol, or should you use XML-RPC instead? And some of these specs aren’t
quite finished yet, yet people are developing to them. But you don’t know 5
years down the road whether vendors will support these specs.
If you look at SGML, we had a similar problem. When SGML just started, formatting
was supposed to remain separate from the content. But in reality, you need
something to display this data – no one is going to pick up an SGML document
and look at it in the raw SGML, unless you’re a geek. Same goes for XML. So
you need a way of displaying it. There was an ISO standard called “DSSSL” -- Document Style Semantics and Specification
Language, but it took DSSSL ten years to get
through the ISO world and there were modifications along the way – similar to
what has happened with the W3C specs. In the meantime, the Department of Defense
(where I worked) said, “we can’t wait for DSSSL -- we’re going to do our own
standard,” called FOSI, Formatting Output Specifications Instance.
They developed FOSI, and vendors were helping the Defense Department define
it and develop product around it, but every vendor dropped out, except two --
ArborText and DataLogics -- who both still support it. But the spec went through
with ambiguities, and FOSI implementations between the two remaining vendors
I think history is repeating itself, especially where schemas are concerned.
There is a real danger of the same mistakes being made again: The specs revolving
around XML are conflicting, complicated, and in some cases ambiguous, and you
really don’t realize that until you start to use multiple products. So you’re
still in danger of being locked into a particular vendor, which was just what
XML was supposed to solve. It was supposed to be vendor neutral.
CMSWatch: Then how closely does the manager of a major corporate
e-business effort need to track major XML-based schemas and languages emerging
in her industry?
Harvey: She could spend her whole career doing nothing but that. I
think it’s important to be aware of what’s going on, but that’s mostly
I have always stressed that the important thing is your information.
If you have your information structured in a way that is meaningful to you,
then going to an industry standard is just another transformation that you can
make -- like the transformation to HTML. It’s nothing you should find daunting.
The thing with these standards is that you have 100 to 300 to -- in the case
of ebXML -- a thousand people working on them. Standards become inflexible
and incorporate all kinds of different features that don’t make sense for an
individual organization. So what you really want to do is keep your organization’s
information in mind, but still recognize how your data can be taken to that
industry standard. Remember that in lots of cases, industry standards have
not been finalized; they’re being worked on – so they’re going to change.
CMSWatch: If I have an SGML-based document repository, when
should I seriously consider migrating it to XML?
Harvey: First, you have to think about your current infrastructure.
If you’ve put $1-2 million into your repository, and your authors are working
fine in that arena, then there is less incentive to go with XML. Now if you
have to migrate – a good example would be if your SGML repository resides
within a vendor platform that is out of business and you must change software
regardless – that would b a good time to convert to XML. This is not a big
deal, actually. But the time to do it is when you want to look at new software.
CMSWatch: If you had 30 seconds to explain the rationale
for XML to a non-technical business manager, what do you say?
Harvey: Repurposing of information is typically the biggest advantage
of going to XML. You can take it to traditional paper, you can take it to Web,
you can take it to wireless applications, and you can take it eBooks.
And you have one source that goes to all these different formats or areas.
Of course, in the future -- five or ten years down the road -- we don’t know
where the biggest bang for your buck is going to be, but we know that if your
data is in XML, you know you can get there.
CMSWatch: Can’t you go from Microsoft Word to all those formats?
Harvey: No. There’s no inherent structure in Word. If you export
a Word file, you’re going to get RTF or HTML. And if you’re trying to get some
sort of complex information from the data, Word just isn’t going to do it for
CMSWatch: But can’t you repurpose content with databases?
Harvey: Yes. And in some cases, that make sense. It depends on the
Let’s say you’re looking at XML seriously and you already have a relational
database. I’ve seen companies take the relational database, throw it away,
and go with an XML, object-oriented database. But this doesn’t make sense,
because you already have fielded data, and you can still use XML on top of the
Remember that with all of these CM systems at the moment, we don’t know which
ones are really going to be survivors. But you know Oracle and SQL Server are
not going away. So if you have your data in a relational database – and of
course it may be that you can’t get all your data into that format – you can
still use XML where it makes sense.
CMSWatch: You do a lot of XML education. What do you see
as the biggest hurdle for its adoption in Web Content Management?
Harvey: One of the biggest problems is that most people have put their
websites together with spit and glue. You have to sit down and do in-depth
content and data analysis, and most people don’t want to do that. It’s not
something that you can go in overnight and handle.
You have to look at the entire infrastructure, and not strictly from the standpoint
of web delivery, but workflow as well. Where does the data come from, and how
is it used? In most cases, the XML project gets pushed back further within
the infrastructure of the business, not just in Web delivery. If you’re just
going to do it for Web delivery, I don’t think it makes sense. It’s something
that you need to make a part of your information infrastructure. And this can
be daunting, because it can’t be done overnight.
One of the major reasons why some XML projects fail is that people think, “OK,
from here on out everything’s going to be XML.” But they don’t take an in-depth
look at how the work flows, how the information flows, where the information
comes from and when it comes in, what they do with it, and so forth. You need
to look at the entire scope of the organization, not just web delivery.
CMSWatch: What about the “fear of pointy brackets”? Should
we tell people to just get over it and accept them?
Harvey: Yes, I think so. And I think it’s already happened with HTML.
People aren’t afraid of it anymore the way they used to be. That was probably
the biggest hurdle in SGML, because pointy brackets were something new and alien,
but now people are used to seeing them every day.
CMSWatch: But one of the big pushes in XML editors is to make all those
tags go away so that the casual business user doesn’t have to deal with them.
Do you hold much promise for that?
Harvey: Actually, it reminds me of working with [Corel] WordPerfect.
People who worked with WordPerfect loved it, because one of the things you could
do in WordPerfect that you can’t do in [Microsoft] Word is get to the code.
In WordPerfect you could “reveal codes” and fix things that weren’t quite right.
That’s the way I view XML. When you don’t have the tags “on,” it’s really
nice to work with it, but if something’s not working right, you can reveal the
underlying structure and see what’s going on. In most cases, you’re going to
be able to work without the tags revealed, but in some instances you still need
to see what’s going on.
CMSWatch: Do you ever share the recipe for your famous chocolate-chip cookies?
Harvey: Actually, it’s the Tollhouse recipe on the back of the chocolate
chip bag. I consider it a kind of industry standard. But like all good organizations,
I modify its implementation to my liking: I add twice the number of pecans.
see more people