![]() | ![]() | ![]() |
| Main | |
I Don't Endorse Ill-Formed XML, Part 1
My product CMarkup can sometimes allow you to parse ill-formed XML. I definitely do not encourage anyone to intentionally produce ill-formed XML but the bottom line is that sometimes you don't have control over the "XML" you need to consume.
Specification versus reality
There is a row of identical new houses all built to the same specification, and a carpenter is tasked to put together and install built-in shelves in an alcove in each of the homes. Would the carpenter take the measurements from the specification and go away and cut all the materials to the exact measurements and even preassemble some parts? No, the carpenter would at least measure some of the houses to get an idea of potential variations, and wait to cut the shelves to the exact measurements on site.
Mature industries like carpentry have practices and toolsets that allow for the realities of the job. XML is a relatively immature area of the software industry in which developers are still discovering the best practices and toolsets.
Ill-formed XML (i.e. XML that is not well-formed) is like a house that is not built precisely to the specification (I am purposefully using the term "specification" instead of "blueprint" because blueprint would be more analagous to schema or DTD; I know it is not a perfect metaphor but bear with me). I am not talking about validation, I'm talking about unescaped ampersands and invalid characters, dropped end tags and inconsistent case in tag names.
Yes, ill-formed XML is bad! As a producer of XML, you can check it by loading it with a standard parser such as Internet Explorer. This is easy except that it does not prove that you are producing XML correctly for all potential future data variations. You should probably use an XML tool to generate XML to help ensure it is well-formed, but depending on your platform and cirsumstances that may not be the best option.
No way, Jose
An XML tool should guide you into generating well-formed XML when you are a producer of XML. But when you are a consumer of XML, it is a disservice if the XML parser absolutely requires well-formedness because you often do not have control over the XML that you are loading. An XML parser that simply throws its hands up in the air and says "no way, Jose" should be criminal. That is like the carpenter arriving at a house and saying "forget it, this house is 3 inches off the specification; I can't help you." But the founders of XML carefully crafted a message that has brain-washed developers the world over to build and use XML tools that say "no way, Jose" and even to expect this disservice.
Why? In the beginning, seeing the potential for a markup data-interchange standard on the Internet, the creators of XML were concerned about the lesson of HTML: almost all HTML on the web is badly formed (with ambiguously nested tags and indeterminate character encodings). And since browsers don't reject this bad HTML, nobody fixes it. The people who designed XML considered themselves Internet visionaries. They wanted to do their part to fix the ill-formed web. So one of their primary purposes from the very beginning was to frame the XML specification in such a way that ill-formed XML should always be met with outright rejection.
The grand coalescence of information
Of course, it was not this strictness that vaulted XML into the hype and prevalence it attained. The message of simplicity combined with a mystique about its potential is what somehow caught on. After all, unlike EDI and CSV and binary and fixed formats, you can read XML with your own eyes! Everyone wanted to use XML for data file formats rather than create "yet another proprietary format." With the advent of all this standardized markup, some people had visions of a grand coalescence of information (also known as "the semantic web").
The voices of the XML industry felt that a key to evolving towards this vision was enforcing well-formedness. They wanted to head off the great looming evil of ill-formed XML before it hit. The solution was to enlist the armies of developers around the world to help with the work of badgering and coercing XML producers into making sure their XML was well-formed. A noble effort no doubt, but badgering and coercing are not really in the job description of the average developer.
One way of looking at the lesson of HTML is that left unenforced markup will tend to be ill-formed. However, another way of looking at it is that the explosion of markup as the way to structure and give meaning to information on the web is part and parcel of its looseness and approachability. The realities of the industry forced browsers to be lenient with HTML, but at the same time web developers are still always encouraged to produce well-formed and even validated HTML and XHTML. So the tool is lenient, but the best practice is strict. Why not the same with XML? Part 2 digs deeper into how the strictness of XML adversely affects XML development.
| Main | |


