XXE Attacks— Part 1: XML Basics

11 min readJan 7, 2019

XXE has been included in OWASP 2017, officially, in a separate category and the importance of XML based vulnerabilities in web application security has given prominence. This XML series will provide a relevant and important XML basics and vulnerabilities.

This first post will discuss about essential XML basics from a pentester’s point of view and which I believe are important to understand.

What is XML?

eXtensible Markup Language. The essence of XML resides in its name. Let’s break the name then.

Extensible

Simply in nutshell, XML is skeleton which can be extended or elongated. It lets you define your own tags, the order in which these customized tags occur and how these tags gets displayed or processed. From the prospective of what a document could be, XML document can be utilized by a machine for configuration purpose, in case of web services like SOAP, WebDAV etc. the exchanged back and forth transient data is again, XML. Document formats like DOCX and ODF, Image formats like SVG are all composed of XML.

Markup Language

Markup means providing definition to texts and symbols. Just like HTML, XML is a markup language. Actually, a meta markup language i.e It allows us to create or define other markup languages like MathML, RSS etc

XML vs HTML:-

HTML is a Presentation language means it doesn’t provide information about how a document is structured or what it means.

XML is a Data Description language means using XML one can deduce the semantics of a document. Hence, XML documents are self-describing.

A quick question: What if the tags are not conveying the information they markup, directly? A moderate google search resulted in this:

XML document is not actually self-describing in itself!

For a XML syntax and tags to be self-describing, it should simultaneously convey the specific information they markup, all the semantics needed to distinguish among same tags and all the rules that govern the relationship to the content- all without any additional information.

Consider a simple example,

The above doc describing itself that we are talking about location of CP. In short, the XML tags are providing the information in conjunction with what they are marking up without any extra info!

Here, the doc is syntactically correct but it doesn’t describe itself. Right?Anyways, there is no point arguing over whether XML is truly self-describing or not. Just a food for thought.

Some XML technologies:-

XPath

It’s a query language used for locating and processing nodes in XML document. Because of doc’s hierarchical structure, it becomes possible to navigate in logical form.

XSLT

eXtensible Stylesheet Language Transformation. It’s a language used for transforming XML documents into other formats.

XHTML

It’s a stricter and a cleaner version of HTML which was designed to replace HTML. But why it was designed in the first place? Simple, to bridge up the gap between XML and HTML.

XML structure

Elements, tags and Nodes

XML document is made up of elements, tags and nodes. Tags are simply opening and closing ones. An element contains an opening tag, some content and a closing tag. Node is a generic term that applies to any type of XML document object and is a part of hierarchical structure.

Attributes

Just like HTML, XML attributes, basically, are adjectives or characteristics to an element.

Legality of XML document

In order for a XML document to be ‘legal’, there are two levels of legality:

Well-formedness
Validity

A well-formed XML document follows some rules which are as follows:-

There must be single root element.
Elements must be properly nested.
Attribute values must be quoted.
Attribute values should not contain “<” or “&”.

Now, for a document to be valid, it must be well-formed and follows all the rules set down in document’s DTD. We’ll talk about DTD after some time.

But why there are two levels of legality?

Because most XML parsers are non-validating. XML parsers in the browser are non-validating, they just check for well-formedness.

Hence, well-formedness is mandatory and validity is optional.

Document Type Definition (DTD)

To achieve the validity of XML doc, we need DTD. Lets’ focus on the word “valid”. When a thing is said to be validated that means it is being compared against some pre-set rules and regulations, laws or logic. If there is a need to share the XML document and it need to be “validated” then some rules must be laid down for data to be formatted consistently.

DTD defines the doc structure with a list of valid elements, the order in which these elements occur and what data they contain. Hence, in two words, DTD is about element ordering and containment. A Document Type Definition can be declared inline inside an XML document, or as an external reference.

Oh! One more thing. DTD can also refer to Document Type Declaration or simply DOCTYPE. It is an instruction which connects an XML doc with DTD (Document Type Definition). The word “connection” messes things little bit. Hence, XML Document Type Declaration contains or points to Markup Declaration that provide Document Type Definition for XML docs. It can contain a Definition. It can point to a Definition.

Markup Declaration can be anything whether it is element, attribute, entity or for that matter notation.

The document type declaration must appear before the first element in the document.

The Document Type Definition for the whole document consists of both subsets taken together i.e Internal as well as External.

Note: We’ll refer DTD as Document Type Definition unless and otherwise specified.

For example:

We have an address.xml

Internal DTD Example

External DTD example, where a.dtd contains all the rules. While referencing an external DTD, include standalone=”no” attribute in XML declaration.

Note: As mentioned above about Document Type Declaration (DOCTYPE), the external DTD (a.dtd) does not need the <!DOCTYPE preamble as it is already been mentioned in xml document (address.xml).

a.dtd

Let’s check it out. Our a.dtd is served at localhost:8000

Parser validated the address.xml with a.dtd successfully

In the above DTDs, we defined some elements. Below are element declaration rules:

A DTD element declaration consists of a tag name and a definition in parentheses. These parentheses can contain rules for any of:

Plain text
A single child element or elements
A sequence of elements

There are also some notations which help, if in case, an element need to be specified more than once. Suppose, we need to specify that a single element can appear as many times as necessary. How will we do that? The below notations comes to the rescue just like regex for ease!

Elements that contain only Text are specified by PCDATA which stands for Parsed Character Data and refers to anything other than XML elements. In the above DTD, the elements other than <address> are specified by PCDATA which means that all these elements contains parsed string. Because PCDATA is parsed, it needs to be well-formed. Thus, appearance of < will make parser throw error.

In connection with PCDATA, there is another thing that keeps appearing in webpages is CDATA. CDATA is section that will not be parsed by a parser. Tags inside the text will not be treated as markup, and one can include <,> and /characters.

<![CDATA[anything here]]>

Also, CDATA is generally used in xml directly, signalling the parser to tell that don’t parse the data or content inside a tag. CDATA section have no encoding, so there is no way to include the string “]]>”. It is a valid part of the document.

In the above, we inserted a <test> tag inside CDATA and didn’t close it and no error was thrown.

Entities

An entity is piece of XML code that can be used and reused, again and again in a document by referencing it. It’s sort of symbolic representation of information.

Entities can be used to substitute bits of information, difficult to type characters or to include a complete document.

Entity Declaration

Entities must be declared before they can be used or referenced. They may be declared in DTD as in external subset or internal subset.

One more interesting thing was found that if same entity is declared more than once, only the first declaration is taken into consideration.

Types of Entities:

Based on context, Entities can be divided into different categories:

If the context is substitution locally within a DTD as internal subset or from an external subset, then the entities are categorized as Internal and External.
If the context is whether the entities declared will be parsed or not, then entities are categorized as Parsed and Un-parsed.
If the context is how replacement or substitution will be used, then the entities are categorized as General and Parameter.

Internal Entities

The entities which are used as replacement text.

<!ENTITY asd “pppppppppppppppppp”>

The above entity can be referenced by &asd;

Internal entities are always parsed.

Five internal entities are predefined in XML:

By default, All XML parsers support references to these entities.

Character References

Character references, which look exactly like entity references but in real are not, allows referencing of Unicode character in documents. These references are numeric and the format either is &#nnn; or &#xhhh; where n ∈ decimal Unicode character number and h ∈ Hexadecimal Unicode character number. XML parsers expand these references as soon as it finds them. HTML character reference list currently has 252 references as per HTML 4 DTD.

External Entities

For longer, multi-line replacements, storing the entity value in an external file would be better. This type of getting things done by calling substitutions that exists externally using an entity, simply, is External entity. Further, external entities can refer to internal or other external entities but there should not be any circular reference.

External entities are of two types: Public and Private

Private External Entity: These are identified by the keyword SYSTEM and are intended to use by single author.

<!ENTITY name SYSTEM “URI/resource”>

Public External Entity: These are identified by the keyword PUBLIC and are intended for broader use.

<!ENTITY name PUBLIC “public_id” “URI/resource”>

Parsed Entities

Simply, the entities which are parsed are Parsed entities.

Unparsed Entities

Entities which refer to Non-XML data, identified by a notation, are “unparsed”. NOTATION is an element that describes the format of Non-XML data.

<!NOTATION GIF SYSTEM “CompuServe Graphics Interchange Format 87a”>

For eg.

<!ENTITY mypicture SYSTEM "normphoto.gif" NDATA GIF>

Unparsed entities can only be used as attribute values on elements with ENTITY attributes.

For embedding an unparsed entity in Document, first insert an element with ENTITY type attribute whose value is the name of unparsed entity declared in the DTD. An ENTITY attribute can only contain the name of an external, unparsed entity. It can contain the name of the entity, not a reference to the entity.

You could also declare the image attribute as CDATA and simply type the filename.

General Entities

All the above defined, declared and referenced entities are General entities. These entities are used within the XML document content. These are used as shorthand or substitution macros.

Parameter Entities

There was a limitation. Suppose if there is requirement of some reusable section of replacement text inside a DTD and general entity references are not expanded in the DTD, so what to do? Instead, there is a new entity provided which can be exclusively defined and referenced inside a DTD, the Parameter entity. These entities lets you reuse part of DTD in multiple places with some conditions. Also, Parameter entities can’t be used in the content of document. References to Parameter entities can only occur within DTD.

Creating & Referencing Parameter entities

Let’s try to understand using an example,

Consider a DTD in which there is requirement of separate elements for each sector that exist in a city. The element will be declared somewhat like this:

One thing can be seen here is that there is lot of repetition of elements. The first logic that pops up in the brain is that can these repeated elements be substituted using some kind of entity reference inside DTD? The answer is Parameter entities.

<!ENTITY % res “address, flat, shop, public_parking”> and it can referenced using %res; The final simpler and better element declaration will be:

But this way of elemental substitution is limited to external DTDs. In above procedure, parameter entities are holding element groups in external subset of DTD.

The internal DTD subset is damn strict.

In the internal subset of DTD, references to parameter entities are not allowed within markup declarations. The below mentioned Parameter Entity reference fails:

PE references are forbidden in internal subset.

XML parsers throws up the following error:

The parameter entity reference “%param;” cannot occur within markup in the internal subset of the DTD.

These entities may define a DTD syntax but doesn’t define a value that is immediately used inside another DTD tag. The below mentioned syntax will also fail:

But there is a way to use parameter entities in internal DTD subset by inserting markup declaration through external parameter entities.

Where a.xml at http://192.168.56.102 contains the following:

Parameter entities allow creating other entities and parameter entities.

In case of the above defined Parameter Entities, following points need to be taken into notice:

References to Parameter entities can ONLY occur within DTD.
Parameter entities can’t be used in the document body.
A parameter entity reference is not allowed within markup in internal DTD subset.
Parameter entities allow creating other entities and parameter entities.

So, here we are done with necessary XML basics.

Next post will discuss about XML DTD related attacks.

Till then,😎🕛

References:

and lots of Googling.

Thanks to my friends Pяαкαѕн and Lokesh for helping me out…

XXE Attacks— Part 1: XML Basics

Legality of XML document

Document Type Definition (DTD)

Entities

References:

Written by klose