-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathsec-xml.tex
359 lines (338 loc) · 18.2 KB
/
sec-xml.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
\subsection{XML}
\label{sec:xml}
\begin{quotation}%
XML has succeeded beyond the wildest expectations as a convenient format
for encoding information in an open and easily computable fashion. But it
is just a format, and the difficult work of analysis and modeling
information has not and will never go away.
\\\quotationsource \textcite{Wilde2008}
\end{quotation}
\noindent
The \Tacro{Extensible Markup Language}{XML} was designed between 1996
and 1998 as simplified subset of the \tacro{Standard Generalized
Markup Language}{SGML} for the Web \cite{Bray1998}. Its origin in
\acro{SGML} (see section~\ref{sec:markuplanguages} about \acro{SGML} and
markup languages in general) gave \acro{XML} strong support for marked
up text documents, but also some features, that for most applications
only add unnecessary complexity. Beginning from the late 1990s, more and
more domain specific data formats were created based on \acro{XML}, or they
migrated to \acro{XML} from \acro{SGML}. \acro{XML}~1.0 was first published
as \acro{W3C} recommendation in February 1998. Soon it was accompanied
by numerous extensions and revisions, such as the
\tacro{Document Object Model}{DOM} in late 1998,
% http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/
\term{XML Namespaces} (1999),
% http://www.w3.org/TR/1999/REC-xml-names-19990114/
\term{XPath} (1999),
%, see example~\ref{ex:xpath} at \pageref{ex:xpath}), % http://www.w3.org/TR/1999/REC-xpath-19991116
\acro{XSLT} \cite{Clark1999x}, % http://www.w3.org/TR/1999/REC-xslt-19991116
\tacro{XML Schema}{XSD} (2001), % http://www.w3.org/TR/2001/REC-xmlschema-0-20010502/
\term{Canonical XML} (2001), % http://www.w3.org/TR/2001/REC-xml-c14n-20010315
\term{XML Base} (2001), % http://www.w3.org/TR/2001/REC-xmlbase-20010627/
\term{XML Infoset} (2001), % http://www.w3.org/TR/2001/REC-xml-infoset-20011024/
and \term{XInclude} (2004). % http://www.w3.org/TR/2004/REC-xinclude-20041220/
% W3C. XML Linking Language (XLink) Version 1.0, June 2001.
% http://www.w3.org/TR/2000/REC-xlink-20010627/.
% W3C. XPointer xpointer() Scheme, December 2002.
% http://www.w3.org/TR/2002/WD-xptr-xpointer-20021219/.
\acro{XML}~1.1 was introduced in 2004 as successor to \acro{XML}~1.0
\cite{Bray2004}, but it never got widely adopted.
% because it broke compatibility
% and introduced no compelling new features.% see "Xml 1.1 failed"
The listed extensions define slightly different models of \acro{XML},
and the degree of their support varies among applications, what
complicates an exact definition of \acro{XML} documents \cite{Dodds2002}.
However, all definitions share a common subset, that can be described as
an ordered tree with Unicode strings and key-value-pairs as node-properties.
Beginning with \acro{XML}~1.0, we will first describe the most common parts of
\acro{XML} syntax, then discuss aspects of \acro{XML} processing and differences
between models of the \acro{XML} family of standards, and finally give an
overview and review of the most common \acro{XML} structures.
\acro{XML}~1.0 is defined based on a context-free grammar over a sequence of
\term{Unicode} characters with some additional \Term[well-formed]{well-formedness}
constraints. The grammar is given in a variant of \term{Backus-Naur-Form}.
Figure~\ref{fig:xmlbnf} shows a slightly adopted subset of the grammar rules:
A \bnf{document} starts with an optional \bnf{prolog}, followed by a mandatory
root \bnf{element}, and optional \bnf{comment}, processing-instructions (\bnf{pi}),
and whitespaces (\bnf{s}). The \bnf{prolog} usually contains an
\acro{XML} declaration, that among other information can specify the
character encoding, a standalone flag, and a \tacro{document type definition}{DTD}.
An \bnf{element} in \acro{XML} syntax either consist of a \bnf{starttag} and
an \bnf{endtag} with the same \bnf{name}\footnote{The same name requirement
that is one of the constraints that cannot be expressed in \acro{BNF}.} and some
\bnf{content} in between, or it is an \bnf{emptytag}.
Start tags and empty tags can have a list of \bnf{attribute}, which are
key-value-pairs with unique \bnf{name} per attribute list.%
\footnote{The uniqueness requirement of attribute names is another
additional well-formedness constraint.} A \bnf{content} may contain
other \bnf{elements}, resulting in the general tree of \acro{XML} documents
(see example~\ref{ex:xmlmods} for a document).
\begin{figure}[h]
\centering
\begin{lstlisting}[language=BNF]
document = prolog element misc*
misc = comment | pi | s
s = ( #x20 | #x9 | #xA | #xD )+
element = starttag content endtag | emptytag
starttag = "<" name (s attribute)* s? ">"
endtag = "</" name s? "/>"
emptytag = "<" name (s attribute)* s? "/>"
content = text? ((element | reference | cdata | pi | comment) text?)*
text = chars - (chars ("<" | "&" | "]]>") chars)
reference = charref | entityref
charref = "&#" [0-9]+ ";" | "&#x" [0-9a-fA-F]+ ";"
entityref = "&" name ";"
value = (text | reference)*
cdata = "<![CDATA[" (chars - (chars "]]>" chars)) "]]>"
comment = "<!--" (chars - (chars "--" chars | chars "-") "-->"
pi = "<?" pitarget s (chars - (chars "?>" chars)) "?>"
attribute = name s? "=" s? ( '"' (value - (value '"' value)) '"'
| "'" (value - (value "'" value ) "'" )
\end{lstlisting}
\caption{Subset of the formal grammar of \acrostyle{XML}}
\label{fig:xmlbnf}
\end{figure}
Textual data (\bnf{text}) in \acro{XML} can be any \term{Unicode}
string, except some codepoints below \U{0020}, \U{FFFE} and \U{FFFF}.
Furthermore the characters `\verb|<|' and `\verb|&|', and in \bnf{content}
the sequence `\verb|]]>|' is not allowed. To include these characters
in an \acro{XML} document, you can use character references (\bnf{charref})
which can refer to an allowed Unicode character by its \acro{UCS} codepoint.
In addition there are predefined named entities (\bnf{entityref}): `\verb|<|'
for `\verb|<|', `\verb|>|' for `\verb|>|', `\verb|&|' for `\verb|&|',
`\verb|'| ``for `\verb|'|', and `\verb|"|' for `\verb|"|'.
\acro{XML} is further complicated by the possibility to define named entities
in a \acro{DTD}. These entities can either stand for an arbitrary piece
of \bnf{content} (\Term{internal entity}) or as placeholder for some other data
that is referenced by an \acro{URI} (\Term{external entity}).
Most entities are replaced by their content, when an \acro{XML} document is
read by an \Term{XML processor} (a piece of software that parses the syntax
of an \acro{XML} document and provides access to its content and structure).
However, some named entities can remain as unparsed artifacts because they
are external or because the \acro{DTD} is not taken into account by the
processor. In practice the \tacro{Simple API for XML}{SAX} \cite{Megginson2004}
\label{note:sax}
is a common abstraction in \acro{XML} processors, especially for the \term{Java}
programming language. \acro{SAX} is not a formal specification but it originates
in an implementation of an \acro{XML} parser that was first discussed in early
1998. The \acro{API} of \acro{SAX} provides a stream of parsing events that can
be used to construct an \acro{XML} document, if the stream of events follows the
well-formedness constrain of \acro{XML} (every \acro{XML} document can be mapped
to a stream of \acro{SAX} events but not vice versa).
\acro{XML} 1.0 defines two types of \acro{XML} processors:
validating and non-validating processors. Non-validating processors must
only check whether a document is well-formed, but they do not need to
process all aspects of a \acro{DTD}.\footnote{Some simple \acro{XML}
processors just ignore the \acro{DTD} although this is against the
specification. Removal of \acro{DTD} is one of the most common request
in discussions about a future ``\acro{XML} 2.0'', as most \acro{XML}
documents have no \acro{DTD}, and validating is mostly done by
using other schema languages.}
Validating parsers must analyze the entire \acro{DTD}, including other
documents referenced from the \acro{DTD}, and they must check whether
the document matches the additional rules from its schema (see
section~\ref{sec:xmlschemas}). A processor may even change the content of an
\acro{XML} document by normalizing strings and by adding default values.
\begin{figure}[h]
\centering
\begin{tikzpicture}
\node[text width=2.5cm,align=center] (d) {\textbf{\acrostyle{XML} document}\\(syntax)};
\node[right=1.8cm of d,text width=3cm,align=center](m){\textbf{document model}\\(structure)};
\draw[->] (d) to node[yshift=3mm] {parsing} (m);
\node[right=0 of m,text width=5cm,yshift=-9mm] (l) {%
parsed document can be:
\begin{itemize}
\item not well-formed\\(syntax error, no model)
\item of some model type\\ (\acrostyle{DOM}, \term{Infoset}, \term{Canonical} \ldots)
\item modified by validation
\item invalid by validation
\end{itemize}
};
\node[below=3mm of d,text width=3cm,xshift=4mm]{\texttt{<a b:c="d">\\%
~<e f='g'>h</e>\\
~\&i;<?j k?>\\</a>}};
\node[below=2mm of m.south west,xshift=6mm] (a0) {};
\begin{scope}[orm];
\node at (a0) (a) {a};
\node[right=2mm of a,yshift=0mm] (bc) {b:c};
\node[right=2mm of bc] (d) {d};
\draw (a) -- (bc) -- (d);
\node[below=9mm of a] (e) {e};
\node[right=2mm of a,yshift=-5mm] (j) {j};
\node[right=0mm of a,yshift=-9mm] (i) {i};
\node[right=1mm of e,yshift=-5mm] (h) {h};
\draw (a) -- (e);
\draw (e) -- (h);
\draw (a) -- (i);
\node[right=2mm of e] (f) {f};
\node[right=2mm of f] (g) {g};
\draw (e) -- (f) -- (g);
\draw (e) -- (h);
\node[right=2mm of j] (k) {k};
\draw (a) -- (j) -- (k);
\end{scope}
\draw[decoration={brace},decorate,yshift=2mm] (l.south west) to (l.north west);
\end{tikzpicture}
\caption{\acrostyle{XML} document and \acrostyle{XML} document models}
\label{fig:xmlparsing}
\end{figure}
Parsing \acro{XML} can best be understood as a process that converts \acro{XML}
syntax, given as sequence of characters, to another data structure (figure%
~\ref{fig:xmlparsing}). In general the act of parsing an \acro{XML} document
is not reversible, because some aspects of \acro{XML} syntax are considered
as irrelevant (figure~\ref{fig:irrelevantxmlparts}). The resulting data
structure is a model not only of the parsed document, but of all other
``logically equivalent'' documents that result in the same model. Parsing
\acro{XML} can result in different structures. If the
original data was not well-formed,
there is no model, and the document is no \acro{XML} by definition.\footnote{%
In practice you sometimes have to deal with not-well-formed documents that were
intended to be \acro{XML}. You can call this documents `broken' \acro{XML} if
there is a chance to recover well-formedness.} The specific type of model
defines, which parts of syntax are translated to which parts of a model and
which parts are omitted as irrelevant to the given model
(figure~\ref{fig:irrelevantxmlparts}). A processor may also modify the
document to some degree or it may mark the document as invalid.
\label{p:xmlmodel}
The most prominent models of \acro{XML} are the \Tacro{Document Object Model}{DOM}
and \Term{XML Infoset}. \acro{DOM} evolved parallel to \acro{XML} in the late 1990s.
It was created to harmonize existing \term{JavaScript}-Interfaces that had been created
by Web browser makers for manipulating \acro{HTML} documents. The part of
\acro{DOM} that deals with \acro{XML} documents is `\acro{DOM} Core'.
Actually there are three variants: Level 1 is based on the
tree structure of \acro{XML} 1.0, Level 2 expresses the structure of \acro{XML}
with Namespaces, and Level 3 expresses a model compatible with XML Infoset
\cite{Cowan2004}. Another model of \acro{XML} is shared by XPath 1.0 and
Canonical \acro{XML} \cite{Boyer2008}, XPath 2.0 and XQuery define yet
another model \cite{Berglund2010}. A given model may also be expressed in
other languages but \acro{XML} syntax. For instance \Term{Fast Infoset}
\cite{FastInfoset2005} is a binary representation of Infoset based on
\acro{ASN.1} and \textcite{Tobin2001} defines an \acro{RDF} Schema to serialize
\acro{XML} document models as \acro{RDF} instances.
\begin{figure}
\begin{multicols}{2}
\begin{itemize}
\item type of attribute delimiters ("/')
\item type of character entities
\item original character encoding
\item CDATA sections
\item standalone flag
\item all entity references
\item specified schemas
\item whitespaces
\item position of namespace declarations
\item namespace prefixes
\item attribute types (e.g. \dtd{ID}, \dtd{IDREF}\ldots)
\item explicit default attributes
\item original form of normalized attributes
\item original form of normalized Unicode
\item comments
\item processing instructions
\end{itemize}
\end{multicols}
\caption{Some properties of \acrostyle{XML} considered as irrelevant by some processors}
\label{fig:irrelevantxmlparts}
\end{figure}
Despite all minor differences, all document and processing models of \acro{XML}
share a basic structure, that can be described as ordered tree with nodes of
different types. Basically, there are element nodes with
exactly one element as root, attribute nodes, and text nodes. Other node
types (processing-instructions, comments, external entity references etc.) are
much less used to hold relevant information, and they more depend on the
particular document model.
Each element node has a (possibly empty) set of unordered attribute
nodes with unique attribute names, and an ordered (possibly empty) list
of text and/or element nodes as child nodes. Attribute nodes cannot hold
nested structures but only one text node each, and text nodes are Unicode
strings with some code points excluded.
Each attribute and each element node has a name. The exact definition of a
\bnf{name} from figure~\ref{fig:xmlbnf} depends on the specific
\acro{XML} model: in \acro{XML} 1.0 a name is just a Unicode
string that not contains some disallowed characters. The dependence on a particular version
of Unicode was lifted with the fifth edition \cite{Bray2008}. The most
important (and often confusing) extension to \acro{XML}~1.0 is
\acro{XML} Namespaces \cite{Bray2009}: namespaces allow names of elements
and attributes to be qualified by an \acro{URI}. This way names can be grouped
together in vocabularies
and elements from different vocabularies can be mixed in one document.
In the model of \acro{XML} with namespaces (and in other techniques that
build upon namespaces, such as \acro{DOM} Level 2 and 3, \term{Infoset} etc.)
a name is triple consisting of the namespace \acro{URI}, a local name, and a
namespace prefix. In \acro{XML} syntax namespaces are declared by
special attributes that start with \verb|xmlns| (in example~\ref{ex:xmlmods}
the namespace is declared at the root element so it applies to the whole
document).
Example~\ref{ex:xmlns} shows three \acro{XML} elements that make use of
a namespace declaration. In most cases only the namespace
\acro{URI} and the local name matter, so the first two examples should
be treated as equivalent. The prefix is also included in most
models, and some applications rely on it.\footnote{
See \url{http://www.w3.org/TR/xml-c14n\#NoNSPrefixRewriting} for details.}
The third example \ref{ex:xmlns} is always
different from the two above: in contrast to \acro{RDF} Turtle syntax
(see section~\ref{sec:rdf}), namespaces and local names cannot be used to
construct a canonical name, but they must be used together to identify the
full name of an \acro{XML} element or attribute.\footnote{Some vocabularies
may specify \emph{additional} identifiers for \acro{XML} elements, for
example in \term{XML Schema} each element has an \acro{URI} that happens to
be constructable by appending local name to namespace \acro{URI}. However
there is no general rule to do so in other vocabularies.}
\begin{example}
\begin{tabular}{ll}
\textbf{element in \acrostyle{XML} syntax} & \textbf{namespace, local name, prefix} \\
\hline
\lstinline[language=XML]|<x:zz xmlns:x="http://example.org/"/>| & ( http://example.org/, zz, x ) \\
\lstinline[language=XML]|<y:zz xmlns:y="http://example.org/"/>| & ( http://example.org/, zz, y ) \\
\lstinline[language=XML]|<xz:z xmlns:xz="http://example.org/z"/>| & ( http://example.org/z, z, xz )\\
\end{tabular}
\caption{Namespaces in \acrostyle{XML}}
\label{ex:xmlns}
\end{example}
To allow more complex graph structures, there are several techniques to extend
the basic tree model of \acro{XML} with links: attributes can be defined to
only hold unique \dtd{ID} values or references to other identifiers
(\dtd{IDREF} in \acro{DTD} or keyref constraints in \term{XML Schema}).
\term{XLink} \cite{DeRose2010} and \term{XPointer} \cite{Grosso2003} describe
other extensions to \acro{XML} to create links to portions of \acro{XML}
documents. However, like other extensions to \acro{XML} 1.0, this adds another
layer of complexity and another model that first must be agreed on to achieve
interoperability. To reduce complexity within the family of \acro{XML}
specification, simplified subsets have been proposed by \textcite{Bray2002},
\textcite{Clark2010} and others, but none of them has widely been adopted yet.
Nevertheless, \acro{XML} is successfully being used to encode and exchange data
on the Web and in other areas from markup languages such as \acro{TEI} to
structured metadata formats such as \acro{METS}, \acro{MODS}, and \acro{EAD}.
Furthermore several serialization forms of other formats in \acro{XML} exist,
for instance \acro{RDF/XML} for \acro{RDF} and \acro{MARCXML} for \acro{MARC}.
As described by \textcite{Wilde2008}, many problems with \acro{XML} arose from
overbroad claims for \acro{XML}, which in the end is just a format. It still
suits best for marked-up textual data and other records that can be modeled
well as ordered tree, but less for data with arbitrary order and links.
\begin{example}
\centering
\begin{lstlisting}[language=XML,morekeywords={xmlns}]
<?xml version="1.0" encoding="UTF-8"?>
<mods xmlns="http://www.loc.gov/mods/v3" version="3.4">
<titleInfo>
<nonSort>The </nonSort>
<title>C programming language</title>
</titleInfo>
<name type="personal">
<namePart>Kernighan, Brian W.</namePart>
</name>
<name type="personal">
<namePart>Ritchie, Dennis M.</namePart>
</name>
<originInfo>
<place>
<placeTerm type="text">Englewood Cliffs, NJ</placeTerm>
</place>
<publisher>Prentice-Hall</publisher>
<dateIssued>1978</dateIssued>
</originInfo>
</mods>
\end{lstlisting}
\acro*{MODS}
\caption{\acrostyle{MODS} record in \acrostyle{XML}}
\label{ex:xmlmods}
\end{example}