![]() | ![]() | ![]() |
| Main | |
I Don't Endorse Ill-Formed XML, Part 2
I believe in adhering to the XML standard but I just want to call attention to ridiculous things that are experienced by many programmers as part and parcel of XML.
Doing everything but your original task
XML comes with strings attached: if you use XML you are supposed to be drafted as a soldier in an epic battle for well-formed markup around the globe. They don't tell you the significance of that up front. They lure you in with a hint of a panacea, and vague notions of a host of accompanying benefits that come to those who buy into the whole XML way of doing things.
The huge hype and popularity of XML together with industry focus on standards compliance created a unique situation allowing counter-productive efforts to become a major part of the development cycle. The preoccupation with ensuring strict adherence spilled over into the related area of "validation", the mechanism in a companion data format for declaring the arrangement and values allowed in a document. Validation by DTD and XML Schema quasi-language formats is often counter-productive (that is another story), but at the very least involves new mechanisms that are less powerful than traditional techniques employed on transaction formats since the beginning of programming.
Suffice it to say that someone's original programming task did not include fighting the global battle for well-formed XML and learning new and questionable validation techniques.
XML data generated by company XYZ
Take an example developer in hypothetical company ABC given the task to process XML data generated by hypothetical company XYZ. Happy to get into the exciting world of XML, she downloads a free and industry standard XML parser from the web and eventually gets it installed and working in her project. After several weeks she has it working on some sample data and is getting pretty confident in her solution and assumes she is on schedule for completion. Suddenly she comes across a lot of XML data documents that are getting rejected, and discovers that the XML data is ill-formed due to un-encoded ampersand characters.
She goes and tells her bosses at company ABC that some of the XML data from company XYZ is bad. The bosses tell her that they have no control over company XYZ, but they will ask them to fix their data. In the meantime, being productive-minded, she starts researching into ways of dealing with the bad XML data herself. She spends some time looking into her industry standard XML parser to see if it can help, but no luck there. It is adamant that the XML must be perfectly well-formed to do anything with it. After a couple of weeks, the bosses come back and say "nope, even if company XYZ was willing to cooperate, there is already a lot of legacy data beyond their reach."
Ironically, the XML tool is no help
Now she is back to square one. She has a great solution for all the well-formed XML data, but that's not good enough. So she starts looking into text parsing tools and finally finds something that she can tailor to locate the markup tags, and extract data strings and process them appropriately. It takes several weeks, and first she considers generating well-formed XML to feed into her original solution. But having learned much more than she ever intended about XML, and having obtained the data already, she decides it is more efficient just to use the data directly.
At this point she realizes that ironically the XML tool she originally spent many weeks with is completely useless for her task because she was forced to use another parsing tool. And perhaps, in defense of company XYZ, the XML generator tool simply had a bug that they have since patched, but this does not help her with the bad XML already generated during that time.
One thing to report ill-formed, another to simply shut down
Some people will read this story and say that we must all just shout the mantra of well-formed XML even louder to avoid these cases in the first place. Some people are worried that any sign of weakness could open the floodgates of ill-formed XML on the web. However, this fails to recognize the prevalence of "roll your own" XML solutions that helped fuel the popularity of XML in the first place.
With most of the web's HTML badly formed but functional, it is unrealistic to expect XML to become hugely popular without some bad XML. Why can't a tool indicate the XML data is not well-formed and then continue to provide reasonable functionality rather than simply shutting down? Let the consumers of XML data be the judge of its value.
Interestingly, RSS Readers generally support invalid RSS feeds (feeds that would fail XML validation) due to the same pressures of the industry that led browsers to support invalid and ill-formed HTML. Dare Obasanjo (On Crappy XML Formats) writes:
if every aggregator rejected invalid feeds then they wouldn't exist. However, just like in the browser wars, aggregator authors consider it a competitive advantage to be able to handle malformed feeds
All Ill-formed markup is bad. It reduces the potential usefulness of the information in it, and its interoperability. Nevertheless, if you invest in learning and integrating an XML parser product that has no support for ill-formed XML you should recognize it as a disservice to you; and yes, this is another reason why my product CMarkup is great.
| Main | |


