Basic XML constructs - XML ​​elements, tags, attributes, processing instructions, CDATA sections, comments. XML elements. Empty and non-empty XML elements

  • Translation
  • Tutorial

SQL injections, cross-site request forgery, corrupted XML... Scary, scary things that we would all like to be protected from, but we just want to know why it’s all happening. This article explains the fundamental concept behind it all: strings and handling strings within strings.

The main problem It's just text. Yes, just the text - that’s the main problem. Almost everything in computer system represented by text (which, in turn, is represented by bytes). Is it possible that some texts are intended for computers, while others are intended for people. But both of them still remain text. To understand what I'm talking about, here's a small example:
Homo Sapiens Suppose, there is the English text, which I don"t wanna translate into Russian
You won't believe it: this is text. Some people call it XML, but it's just text. It may not be suitable for showing to the teacher. in English, but it's still just text. You can print it on a poster and go to rallies with it, you can write it in a letter to your mother... it's text.

However, we want certain parts of this text to have some meaning to our computer. We want the computer to be able to extract the author of the text and the text itself separately so that we can do something with it. For example, convert the above to this:
Suppose, there is the English text, which I don"t wanna translate into Russian by Homo Sapiens
How does the computer know how to do this? Well, because we very conveniently wrapped certain parts of the text with special words in funny parentheses, like and. Since we've done this, we can write a program that looks for these specific parts, extracts the text, and uses it for some invention of our own.

In other words, we used certain rules in our text to indicate some special meaning that someone else, following the same rules, could use.
Okay, this isn't all that hard to understand. What if we want to use these funny parentheses that have some special meaning in our text, but without using this very meaning?.. Something like this:
Homo sapiens< n and y >
The "" characters are nothing special. They can legally be used anywhere, in any text, as in the example above. But what about our idea of ​​special words, like? Does this mean that it is also some kind of keyword? In XML - perhaps yes. Or perhaps not. This is ambiguous. Since computers are not very good at dealing with ambiguities, something can end up giving an unexpected result if we don’t dot the i’s ourselves and resolve the ambiguities.
This dilemma can be solved by replacing ambiguous symbols with something unambiguous.
Homo Sapiens Basic math tells us that if x< n and y >n, x cannot be larger than y.
Now, the text should become completely unambiguous. "".
The technical definition of this is shielding, we escape special characters when we don't want them to have their own special meaning.
escape |iˈskāp| [no obj. ] break free [ with obj. ] not to notice / not to remember [...] [ with obj. ] IT: a reason to be interpreted differently [...]
If certain characters or sequences of characters in a text have special meanings, then there must be rules that specify how to handle situations where those characters must be used without invoking their special meaning. Or, in other words, escaping answers the question: “If these symbols are so special, how can I use them in my text?”.
As you can see in the example above, the ampersand (&) is also a special character. But what if we want to write "


If your users are good and kind, they will post quotes from old philosophers, and the messages will look something like this:

Posted by Plato on January 2, 15:31

I am said to have said "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."


If users are smart, they will probably talk about math, and the messages will be like this:

Posted by Pascal on November 23, 04:12

Basic math tells us that if x< n and y >n, x cannot be larger than y.


Hmm... These desecrators of our brackets again. Well, from a technical point of view they may be ambiguous, but the browser will forgive us for that, right?


Okay, STOP, what the hell? Some prankster introduced javascript tags to your forum? Anyone looking at this message on your site is now downloading and executing scripts in the context of your site that can do who knows what. And this is not good.

Not to be taken literally In the above cases, we want to somehow tell our DB or browser that this is just text, don't do anything with it! In other words, we want to "remove" the special meanings of all special characters and keywords from any information provided by the user, because we do not trust him. What to do?

What? What are you saying, boy? Oh, you say, "shielding"? And you're absolutely right, take a cookie!
If we apply escaping to the user data before merging it with the query, then the problem is solved. For our database queries it will be something like:
$name = $_POST["name"]; $name = mysql_real_escape_string($name); $query = "SELECT phone_number FROM users WHERE name = "$name""; $result = mysql_query($query);
Just one line of code, but now no one can "hack" our database anymore. Let's see again what the SQL queries will look like, depending on the user input:
Alex
SELECT phone_number FROM users WHERE name = "Alex"
Mc Donalds
SELECT phone_number FROM users WHERE name = "Mc\"Donalds"
Joe"; DROP TABLE users; --
SELECT phone_number FROM users WHERE name = "Joe\"; DROP TABLE users; --"
mysql_real_escape_string indiscriminately places a forward slash in front of anything that might have some special meaning.


We apply the htmlspecialchars function to all user data before outputting it. Now the pest's message looks like this:

Posted by JackTR on July 18, 12:56


Note that the values ​​received from users are not actually "corrupted". Any browser will parse this as HTML and display everything on the screen in the correct form.

Which brings us back to... All of the above demonstrates a problem common to many systems: text in text must be escaped if it is not supposed to have special characters. By placing text values in SQL, they must be escaped according to SQL rules. When placing text values ​​in HTML, they must be escaped according to HTML rules. When placing text values ​​in (technology name), they must be escaped according to (technology name) rules. That's all. For completeness, there are, of course, other ways to deal with user input that may or may not contain special characters:
  • Validation
    You can check if user input matches some given specification. If you require a number to be entered and the user enters something else, the program should inform the user and cancel the input. If all this is organized correctly, then there is no risk of catching "DROP TABLE users" where the user was supposed to enter "42". This is not very practical for avoiding HTML/SQL injections, because... Often you need to accept free-format text that may contain tricks. Typically, validation is used in addition to other measures.
  • Sanitization
    You can also “quietly” remove any symbols that you consider dangerous. For example, simply remove anything that looks like an HTML tag to avoid being added to your forum. The problem is that you can remove perfectly legal parts of the text.
    Prepared SQL statements
    There are special functions that do what we wanted: make the database understand the differences between the SQL query itself and the information provided by users. In PHP they look something like this:
    $stmt = $pdo->prepare("SELECT phone_number FROM users WHERE name = ?"); $stmt->execute($_POST["name"]);
    In this case, sending occurs in two stages, clearly distinguishing between the request and variables. The database has the ability to first understand the structure of the request and then fill it with values.

  • In the real world, these are all used together for different levels of protection. You should always use validation to ensure that the user is entering the correct data. You can then (but are not required to) scan the entered data. If a user is clearly trying to sell you some script, you can simply delete it. Then, you should always, always escape user data before putting it into an SQL query (the same goes for HTML).

For a long time, the standard has prescribed for inserting regular quotation marks into HTML text use the construction "For inside tags, quotation marks" are used to denote attributes.

However, I have not yet come across a browser that would not show the simple symbol “OUTSIDE any tags” as a quotation mark. So tell me, dear colleagues, maybe using “outside tags is simply a tediousness no one needs?” Can you calmly and without further ado write "? Especially in texts where there are a lot of quotation marks, and compliance with strict design rules (regarding the correct use of national quotation marks) is irrelevant.

IMHO, many people do this... but the question is not entirely clear: if you understand that according to standards you need to write quotes as ", but it’s lazy, despite the fact that a lot of sites work like that, then what do you expect to hear? I think it’s about that , no one knows whether the display of quotes will be supported in new versions of browsers, so most likely we can give an obvious recommendation: if you don’t want problems in the future, 100% - stick to the standards :) But you already know this. Or are you waiting for confirmation: yes, it’s boring that's all, forget it, and in 10 years everything will be the same, I (Microsoft, Mozilla, etc.) guarantee?

Lynn "Coffee Man"[dossier]
yes, by the way... now it’s useful to read, nowhere is it stated that quotation marks should be represented in the form "
http://www2.stack.ru/~julia/HTML401/charset.html:

Some authors use a character entity reference """ to encode instances of double quotes (") because this character can be used to separate attribute values.

about the fact that you need to use the entity it says only about and &:

If the author wants to place the character "" (ASCII decimal code 62) in the text.

To avoid confusion with character references (the start mark of a character reference), the "&" reference (ASCII decimal code 38) should be used instead of the "&" character. Additionally, the "&" reference should also be used in attribute values, since character references within CDATA attribute values ​​are allowed.

But I’m just expecting something like Lynn’s answer: that there is actually no such standard. It didn’t even occur to me - my information is from popular textbooks and for reasons of “everyone does it.”

Or another option: but if you follow new standards that I have not encountered in my practice - like xhtml (exactly, I checked xhtml), then this trick will not work. Therefore, there is no need to create problems with the portability of the written HTML code.

Or finally: how do you do it yourself?

&, by the way, raises a similar question. The document above says "to avoid confusion". But confusion is possible only if & is followed by one of the provided codes. What if it's, say, a URL like "..../script?A=1&B=2" ? Am I risking anything if I mistakenly specified this URL as the href (which, of course, works correctly during testing)? Anything other than the extremely unlikely situation that in 10 years (when the site is out of date or has already been rewritten ten times) an entity will appear with the extravagant name &B without a final one; ? In other words, how carefully should all such cases be checked?

Daniel, if you are sure that you have no problems with existing codes, then you can write simply &. If new code appears in the future, I think it will be declared explicitly not in HTML specifications 4.01, therefore it should not affect a normally declared document. Or do you expect to support future standards by simple change document outlines?

Daniel Alievsky[dossier]
In XML, a regular quotation mark as text also does not pose any problem (correspondingly, in XHTML, of course). IMHO quotes are usually translated into " for only one reason - you don’t want to write two functions to convert text to a safe form when substituting in XML/ HTML / XHTML.

The purpose of this lesson:

  • BI must know the XML recording format
  • BI must be able to draw up a document in the form of XML code
  • BI must know data types and be able to use them
  • Note: XML is not as concise as we describe it in this tutorial. We consider only those features of the XML language that will be used in the ODA-TM system.

    XML. The basis

    XML was created to structure, store and transport information.

    The following example, “A note from a friend to a friend,” has the XML form:

    Nikolay Ivan Reminder I hope you haven’t forgotten about our meeting

    Visually, this code can be represented in the following form (Fig. 1.).

    The code has a sender and a recipient of information, it also has a message header and body.

    It is meant to be processed, sent and displayed by someone.

    But still, this XML document doesn't do anything. It's just information wrapped in tags.

    XML - tree

    XML has a tree structure. The document always has a root element (the instruction has nothing to do with the tree). A tree element always has descendants and ancestors, except for the root element, which has no ancestors, and dead-end elements (tree leaves), which have no descendants. Each element of the tree is located at a certain nesting level (hereinafter referred to as the “level”). Elements at the same level have previous and next elements.

    Create your own tags using XML

    There is no standard format for creating tags (descriptors, elements).

    XML has no predefined tags.

    • XML allows the author to define his own tags and his own document structure.
    • XML is used to transfer data
    • XML is a software and hardware independent tool for transferring information.
    • XML is now as important to the web as HTML
    • XML is the most common tool for transferring data between different applications
    • XML is used in many aspects of web development, often to simplify data storage and exchange
    XML syntax

    The syntax of XML rules is very simple and logical

    • All XML elements must have a closing tag
    • XML elements must be nested correctly (one inside the other, and in no case intersect)
    • XML documents must have a root element (XML documents must contain one element that is the parent of all other elements. This element is called the root element.
    • The XML attribute value must be enclosed in quotation marks.
    Comments

    If you need to make some fragment of an XML document completely “invisible” for the analyzer program, then you can format it as a comment by writing the characters in front of it< !-- , а после него - символы -->with two hyphens in a row.

    For example:

    < !-- Это комментарий -->

    The analyzer program will skip this entire structure without even “looking” into it.

    This comment syntax imposes two restrictions on it:

    • You cannot write two hyphens in a row in a comment;
    • a comment cannot be ended with a hyphen.
    XML elements

    An XML element is everything from the element's start tag to its end tag.

    The element may contain:

    • other elements
    • text
    • attributes
    • or a combination of all of the above...
    XML Naming Rules

    XML elements must follow these naming rules:

    • Names can contain letters, numbers and other symbols
    • Names cannot begin with a number or punctuation mark
    • Names cannot contain spaces
    Attributes

    Attributes provide Additional information about elements that is not part of the data.

    In the example below, the file type is not relevant to the data, but is important to software that may manipulate the element:

    computer.gif

    XML attributes must be enclosed in quotes

    Attribute values ​​must always be in quotes. Either single or double quotes can be used. Example: to determine the gender of a person, the element can be written like this:

    If the attribute value itself contains double quotes you can use single quotes, like in this example:

    or you can use character objects: &&

    Some examples of using the Date data type

    Date as an attribute

    Tove Jani Reminder Don"t forget me this weekend!

    Date as element

    10/01/2008 Tove Jani Reminder Don"t forget me this weekend!

    Date as an extended element

    01/10/2008 Tove Jani Reminder Don"t forget me this weekend!

    Metadata Attributes

    These identifiers can be used to define XML elements.

    Example:

    Tove Jani Reminder Don"t forget me this weekend! Jani Tove Re: Reminder I will not

    Data about data must be stored as attributes, and the data itself must be stored as elements.

    XML. Data type Built-in simple types Date and time
    • dateTime contains the date and time in the format CCYY-MM-DThh:mm:ss
    • duration - represents a temporal duration, which is expressed in terms of Gregorian days, hours, minutes and seconds.

    For example: record P1Y2M3DT10H30M45S means one year (1Y), two months (2M), three days (3DT), ten hours (10H), thirty minutes (30M) and 45 seconds (45S).

    The entry can be abbreviated P120M means 120 months, and T120M means 120 minutes.

    • time contains the time in normal format hh:mm:ss
    • date contains the date in the format CCYY-MM-DD
    • gYearMonth allocates the year and month in the format CCYY-MM
    • gYear means the year in the format CCYY
    • gMonthDay contains the month and day in the format MM-DD
    • gDay day of the month in format DD
    • gMonth month in format MM
    Character strings

    string is the basic character type.

    A character string as a sequence of Unicode characters, including the space, tab, carriage return, and line feed characters.

    • normalizedString - a subtype of the type - these are strings that do not contain line feeds "\n", carriage returns "\r" and horizontal tabs "\t".
      • token - a subtype of the normalizedString type - no, except for leading and trailing spaces and several consecutive spaces.
        • language - token subtype, defined to record the name of a language according to the recommendation of RFC 1766, for example, ru, en, de, fr.
        • NMTOKEN is a token subtype, used only in attributes to record their enumerated values.
        • Name - a subtype of token, made up of XML names - sequences of letters, numbers, hyphens, periods, colons, underscores, starting with a letter (except for the reserved sequence of letters X, x, M, m, L, l in any combination of cases) or underscore. Names starting with a string xml, are used by the XML specification itself.
          • NCName is a subtype of name that does not contain a colon. Three subtypes are defined: ID, IDREF, ENTITY
    Binary types
    • boolen - binary, logical. Accepts values: True or False (1 or 0)
    • base64Binary - Base64 encoded binary integers
    • hexBinary - binary integers in hexadecimal form without any additional characters
    Real numbers
    • decimal are real numbers written with a fixed point: 123.45, -0.48747798, etc.
    • double and float types comply with the IEEE754-85 standard, written with fixed or floating point.
    Whole numbers
    • integer - the basic integer type containing numbers with order zero, understood as a subtype decimal
    • number - defines a number (without restrictions on the number of digits); may contain sign, fractions, and exponent. Values ​​change

    from 1.7976931348623157E+308 to 2.2250738585072014E-308

    We continue our study of XML again and in this article we will get acquainted with such XML constructs as processing instructions, comments, attributes and other XML elements. These elements are basic and allow you to flexibly, in strict accordance with the standard, mark up documents of absolutely any complexity.

    We have already partially discussed some points, such as XML tags, in the previous article “”. Now we will touch upon this topic again and examine it in more detail. This is done specifically to make it easier for you to get the full picture of XML constructs.

    XML elements. Empty and non-empty XML elements

    As mentioned in the previous article, tags in XML do not simply mark up text, as is the case in HTML, but highlight individual elements (objects). In turn, elements hierarchically organize information in a document, which in turn made them the main structural units of the XML language.

    In XML, elements can be of two types - empty and non-empty. Empty elements do not contain any data, such as text or other constructs. Unlike empty elements, non-empty elements can contain any data, such as text or other XML elements and constructs. To understand the point of the above, let's look at examples of empty and non-empty XML elements.

    Empty XML element

    Non-empty XML element

    Element content...

    As we can see from the example above, the main difference between empty elements and non-empty ones is that they consist of only one tag. In addition, it is also worth noting that in XML all names are case sensitive. This means that the names myElement, MyElement, MYELEMENT, etc. differ from each other, therefore this moment It’s worth remembering right away to avoid mistakes in the future.
    So, we figured out the elements. Now let's move on to the next point, which is the logical organization of XML documents.

    Logical organization of XML documents. Tree structure of XML data

    As you remember, the main construct of the XML language is elements, which can contain other nested constructs and thereby form hierarchical structure tree view. In this case parent element will be the root, and all other children will be branches and leaves of the XML tree.

    To make it easier to understand the above, let's look at the following image with an example.

    As we can see, organizing an XML document as a tree is a fairly simple structure to process. At the same time, the expressive complexity of the tree itself is quite great. The tree representation is the most optimal way to describe objects in XML.

    XML attributes. Rules for writing attributes in XML

    In XML, elements can also contain attributes with values ​​assigned to them, which are placed in single or double quotes. The attribute for an element is set as follows:

    In this case, an attribute with the name “attribute” and the value “value” was used. It’s worth noting right away that the XML attribute must contain some value and cannot be empty. Otherwise, the code will be incorrect from an XML point of view.

    It is also worth paying attention to the use of quotation marks. Attribute values ​​can be enclosed in either single or double quotes. In addition, it is also possible to use some quotes inside others. To demonstrate, consider the following examples.

    Before we look at other XML constructs, it's also worth noting that special characters such as the ampersand "&" or angle brackets "" cannot be used as values ​​when creating attributes. These characters are reserved as control characters (“&” is an entity, and “” opens and closes an element tag) and cannot be used in its “pure form”. To use them, you need to resort to replacing special characters.

    XML processing instructions (processing instructions). XML declaration

    XML has the ability to include instructions in a document that carry specific information for applications that will process a particular document. Processing instructions in XML are created as follows.

    As you can see from the example above, in XML, processing instructions are enclosed in corner quotes with a question mark. This is a bit like the usual one that we looked at in the first PHP lessons. The first part of the processing instruction specifies the application or system to which the second part of this instruction or its contents are intended. However, processing instructions are valid only for those applications to which they are addressed. An example of a processing instruction could be the following instruction.

    It is worth noting that XML has a special construct that is very similar to a processing instruction, but it itself is not one. This is an XML declaration that conveys to the processor software some information about the properties of the XML document, such as encoding, version of the language in which this document is written, etc.

    As you can see from the example above, the XML declaration contains so-called pseudo-attributes, which are very similar to the regular attributes that we talked about just above. The fact is that, by definition, an XML declaration and processing instructions cannot contain attributes, so these declarations are called pseudo-attributes. This is worth remembering for the future to avoid various mistakes.

    Since we've dealt with pseudo-attributes, let's look at what they mean.

    • Encoding – is responsible for encoding the XML document. Typically UTF8 encoding is used.
    • Version – the version of the XML language in which this document is written. Typically this is XML version 1.0.

    Well, now let's move on to the concluding part of the article and consider such XML constructs as comments and CDATA sections.

    Hello, dear site visitors! Let's continue the topic of XML markup language and look at the use of attributes. Attributes can be present in XML elements, just like in HTML. Attributes provide additional information about an element.

    XML Attributes

    IN HTML attributes provide additional information about the elements:

    XML Attributes Must Be Enclosed in Quotes

    Values attributes in xml must always be enclosed in quotation marks. Both single and double quotes can be used. To indicate the gender of the person element, you can write it like this:

    If the attribute value itself contains double quotes, you can use single quotes, as in this example:

    XML Elements vs. Attributes

    Take a look at the following examples:

    Victoria
    Petrova

    female
    Victoria
    Petrova

    In the first example, sex is an attribute. In the latter, sex is an element. Both examples provide the same information.

    There are no rules about when to use attributes and when to use elements. Attributes are handy in HTML. In XML, I advise avoiding them. Use elements instead.

    My Favorite Way

    The following three XML documents contain exactly the same information:

    The XML date attribute is used in the first example:

    The extended date element is used in the third one: (THIS IS MY FAVORITE WAY):



    10
    01
    2008

    Peter
    Sveta
    Reminder

    Avoid XML Attributes?

    Some of the problems with using xml attributes:

    • attributes cannot contain multiple values ​​(elements can)
    • attributes cannot contain tree structures (elements can)
    • attributes are harder to extend (for future changes)

    Don't do it like this:


    XML Attributes for Metadata


    Vasya
    Sveta
    Reminder
    Don't forget to call me tomorrow!


    Sveta
    Vasya
    Re: Reminder
    OK

    The id attributes above are used to identify different notes. They are not part of the note itself.

    What I'm trying to say here is that metadata (data about data) should be stored as xml attributes and the data itself should be stored as elements.

    Thank you for your attention!.