Course sections

Software Development and Design

Lesson 2: Describe parsing of common data format (XML, JSON, YAML) to Python data structures

Describe parsing of common data format (XML, JSON, and YAML) to Python data structures

As we discussed earlier, XML is a markup language much like HTML and consists of a set of rules for encoding documents that are human and machine-readable. XML was formally defined in W3C specifications.

Using XML, you can define your tags or elements, their order, and how they are supposed to be processed or displayed on screen. XML encoded file can live on a server or take on a transient when being transmitted between two machines.

One of the most distinguishing characteristics of XML is that it allows you to define your tags or elements, as opposed to HTML where tags are standardized. It is similar to HTML, but at the same time more flexible, i.e. it is both a language as well as a meta-language where you can define other languages using as it the basis, for example, RSS or XSLT.

XML Parsing in Python

Parsing means analyzing a message and breaking it into its components. When messages are transmitted over the wire, they are communicated as a stream of characters. Upon arrival, they need to be parsed into a semantically appropriate data structure where each component is recognized as an integer, float, string, and so on. Compiling source code is also a type of parsing. Serialization, on the other hand, is about converting a data structure into a format that can be transmitted. When you use a REST API which reads data from Python dictionaries and output them as equivalent JSON/YAML/XML in string form to the remote resource – when you are serializing, you are actually encoding. De-serialization is a particular type of parsing (or decoding); it takes serialized data and recreates the original data structure from it.

Python allows you to parse, modify and build XML documents. Your XML document can be stored in a file or the form of a string. There are two well-known methods to parse XML with Python, i.e. you can use the ElementTree (ET) APIs or the Minidom module to load and parse XML.

The XML data format is hierarchical and the most fitting way to represent that data is with a tree. ET has two classes to help break that hierarchy down into two levels, i.e. ElementTree which represents the whole XML document as a tree and Element which represents a single node in that tree.

Interaction with the entire document, such as reading and writing files, is commonly done using the ElementTree, whereas interactions with a single XML element (or child) or sub-elements (or sub-child) are carried out using the Element level.

Using ElementTree APIs to parse XML

XML Document

Python Code

Code Output

Code Snippet

Code Output

Learn, Build, Fork, and Share with Our Instant IDE.

Hit the Green Play Button to Execute.

Using Minidom Class to parse XML

You can also use Minimal Document Object Model (or Mini DOM) module to parse XML documents, however, for security reasons, it is preferred to use the ElementTree module instead.

Using Minidom, you can achieve parsing in three simple steps.

  • Import xml.dom.minidom module
  • Utilize the function parse (i.e. minidom.parse) to parse the document (minidom.parse (“persons.xml”)
  • Get the XML Elements using doc.getElementsByTagName(“element”)

Python Code

Code Output

Learn, Build, Fork, and Share with Our Instant IDE.

Hit the Green Play Button to Execute.

JSON Parsing in Python

JavaScript Object Notation (or JSON) is language-agnostic is documented as its data encoding standard. It supports primitive types such as strings and numbers along with nested lists and objects.

Python includes a native JSON package that you can use to both encode and decode data. You can use “import json” to import the entire package and parse JSON data into a python dictionary or list. You can parse the JSON file using the json.load() into python dictionary data structure which is organized in key-value pairs. You can also read and write JSON strings using json.loads() and json.dumps methods respectively.

JSON Document

Python Code

Code Output

Learn, Build, Fork, and Share with Our Instant IDE.

Hit the Green Play Button to Execute.


YAML Parsing in Python

YAML Ain’t Markup Language (or YAML) is the most human-friendly data encoding or serialization standard out there. Much like JSON, it is also a language-agnostic data encoding method. You can use the PyYAML library to read and write YAML data.

You can import pyYAML library using “import yaml” and then load YAML file into python dictionary object or data structure using yaml.safe_load() method. You can use yaml.dump() method to write YAML.

YAML Document

Python Code

Code Output

Learn, Build, Fork, and Share with Our Instant IDE.

Hit the Green Play Button to Execute.

Further Reading

Python syntax, I/O, conditionals, and functions

Python data structures and loops

Parsing JSON using Python

XML Basics