Fun to Program -- XML | Goofing Osamu

This was originally written and created around 2013 and may require to be updated. (2021)

XML

XML is a representation of the textual data structure. For example, it is used for web pages (XHTML) and DocBook source files (DocBook).

The tag pair “<tag> ... </tag>” or selfclosing tag “<tag />” are used to markup the text data. This simple XML data structure allows to create its generic data processing tools such as XSLT, DOM, SAX, … .

XSLT

I used XSLT (Extensible Stylesheet Language Transformations) to manupulate DocBook XML source files for my Debian Reference.

It automated inserting popcon data and reformatted URLs suitable for po4a translation framework.

Its source is available at alioth subversion repo for debian-reference.

My impression is, “yes we can write … but debugging is hell and code is un-intuitive.”.

So my next update of these build scripts may not use XSLT.

Python

There are several Python XML processing modules to chose from.

DOM (generic XML parser to read a whole document first and to organoze its objects in a tree structure.)
Sax (generic XML parser to read it sequentially with an event-driven API.)
ElementTree (new easy-to-use pythonic API, now lxml is merged.)
- parse() is like DOM
- iterparse() is like SAX
- Limitted Xpath support (no parent node)
… others

When I wanted to extract the list of the jōyō kanji (literally “regular-use Chinese characters”) data from the 常用漢字一覧 page of the Japanese Wikipedia site, I chose Elementtree since it looked easier to program.

Download data

Let’s download target HTML page (in XHTML).

Download XML file

$ wget -4 -O - http://ja.wikipedia.org/wiki/%E5%B8%B8%E7%94%A8%E6%BC%A2%E5%AD%97
%E4%B8%80%E8%A6%A7 |zcat - > jouyou.xml
--2013-08-18 10:17:35--  http://ja.wikipedia.org/wiki/%E5%B8%B8%E7%94%A8%E6%BC%A
2%E5%AD%97%E4%B8%80%E8%A6%A7
Resolving ja.wikipedia.org (ja.wikipedia.org)... 208.80.154.225
Connecting to ja.wikipedia.org (ja.wikipedia.org)|208.80.154.225|:80... connecte
d.
HTTP request sent, awaiting response... 200 OK
Length: 102499 (100K) [text/html]
Saving to: ‘STDOUT’

     0K .......... .......... .......... .......... .......... 49%  134K 0s
    50K .......... .......... .......... .......... .......... 99%  266K 0s
   100K                                                       100%  184G=0.6s

2013-08-18 10:17:36 (179 KB/s) - written to stdout [102499/102499]

Please note the original downloaded data was compressed.

Find XPath to target

Since it is very awkward to count the XPath location manually, I created a text pattern to XPath conversion tool. (This works for XML somewhat like what grep does for the plain text.)

Program to search XPath (search-xpath.py)

#!/usr/bin/python3
# vim: set sts=4 expandtab:
#
import xml.etree.ElementTree as ET
import re

def search_retext(element, text, xpath=''):
    retext = re.compile(text)
    if xpath == '':
        xpath = element.tag
    subelements = element.findall('*')
    tagcount = {}
    if subelements != []:
        for s in subelements:
            if s.tag in tagcount:
                tagcount[s.tag] += 1
            else:
                tagcount[s.tag] = 1
            subxpath = xpath + '/' + s.tag + '[' + str(tagcount[s.tag]) + ']'
            if not (s.text is None):
                if retext.search(s.text.strip()):
                    print("=> %s" % subxpath)
                    print("with '%s': '%s'" % (text, s.text.strip()))
            search_retext(s, text,  subxpath)
    return
if __name__ == '__main__':
    tree = ET.parse('jouyou.xml')
    root = tree.getroot()
    print("search_retext(root, '^一覧')")
    search_retext(root, '^一覧')
    print("search_retext(root, '^本表')")
    search_retext(root, '^本表')
    print("search_retext(root, '^亜')")
    search_retext(root, '^亜')
    print("search_retext(root, '^葛')")
    search_retext(root, '^葛')

Let’s run this script to find the XPath to key locations having specific texts such as “一覧”.

Saerching XPath

$ ./search-xpath.py
search_retext(root, '^一覧')
=> html/body[1]/div[3]/div[3]/div[4]/div[1]/ul[1]/li[1]/a[1]/span[2]
with '^一覧': '一覧'
=> html/body[1]/div[3]/div[3]/div[4]/h2[1]/span[1]
with '^一覧': '一覧'
search_retext(root, '^本表')
=> html/body[1]/div[3]/div[3]/div[4]/div[1]/ul[1]/li[1]/ul[1]/li[1]/a[1]/span[2]
with '^本表': '本表'
=> html/body[1]/div[3]/div[3]/div[4]/h3[1]/span[1]
with '^本表': '本表'
search_retext(root, '^亜')
=> html/body[1]/div[3]/div[3]/div[4]/table[2]/tr[2]/td[2]/a[1]
with '^亜': '亜'
search_retext(root, '^葛')
=> html/body[1]/div[3]/div[3]/div[4]/ul[1]/li[7]/ul[1]/li[20]/span[1]
with '^葛': '葛'
=> html/body[1]/div[3]/div[3]/div[4]/table[2]/tr[239]/td[2]/a[1]
with '^葛': '葛'

Extract data from XML

Now that we know the kanji data is under html/body[1]/div[3]/div[3]/div[4]/table[2]/tr[*]/td[2]/a[1] where * iterates over lines. (This works for XML somewhat like what sed does for the plain text.)

Program to extract data (kanji.py)

#!/usr/bin/python3
# vim: set sts=4 expandtab:

import xml.etree.ElementTree as ET

def search_kanji(element):
    kanji = [] 
    for row in element.findall('body[1]/div[3]/div[3]/div[4]/table[2]/tr'):
        # processing table for each row (expandable!)
        k = row.find('td[2]/a')
        if k is None:
            continue
        kanji.append(k.text)
    return kanji

if __name__ == '__main__':
    element = ET.parse('jouyou.xml')
    root = element.getroot()
    print(search_kanji(root))

Let’s run this script to list all Jouyou kanjis.

Lists all jouyou kanjis (trancated).

$ ./kanji.py
['亜', '哀', '挨', '愛', '曖', '悪', '握', '圧', '扱', '宛', '嵐', '安', '案', '暗', '以', '衣', ...

XML with namespace

For example, this fun2prog web page is in the typical XHTML format having namespace embedded in the “<html>” tag attiribute as “<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">”. This declaration applies to child elements until it is overriden by another definition.

In an Element tree, such qualified names are stored as universal names in Clark’s notatioion. So if “<tag>” apears in such XML, Elementtree read it as:

{http://www.w3.org/1999/xhtml}tag

For example, let’s consider that a copy of this web page fun2prog.html in this current directory. The first thing to do is to change above example code to the following.

..
    tree = ET.parse('jouyou.xml')
    root = tree.getroot()
..

Running this will produce very noisy output (folded to fit to the screen).

{http://www.w3.org/1999/xhtml}html/ \
{[1](http://www.w3.org/1999/xhtml}body)/ \
{[2](http://www.w3.org/1999/xhtml}div)/ ...

The following should let you supress these curly brakets.

..
    f = open('fun2prog.html', 'r')
    xhtmlstring = f.read()
    f.close()
    xhtmlstring = re.sub(' xmlns="[^"]+"', '', xhtmlstring, count=1)
    root = ET.fromstring(xhtmlstring)
..

Top

Fun to Program – XML

Date: 2013/08/19 (initial publish), 2021/08/02 (last update)

Source: fun2-00019

TOC