Previous Post | Top | Next Post |
TOC
This was originally written and created around 2013 and may require to be updated. (2021)
XML
XML is a representation of the textual data structure. For example, it is used for web pages (XHTML) and DocBook source files (DocBook).
The tag pair “<tag> ... </tag>
” or selfclosing tag “<tag />
” are used to markup the text data. This simple XML data structure allows to create its generic data processing tools such as XSLT, DOM, SAX, … .
XSLT
I used XSLT (Extensible Stylesheet Language Transformations) to manupulate DocBook XML source files for my Debian Reference.
It automated inserting popcon data and reformatted URLs suitable for po4a translation framework.
Its source is available at alioth subversion repo for debian-reference.
My impression is, “yes we can write … but debugging is hell and code is un-intuitive.”.
So my next update of these build scripts may not use XSLT.
Python
There are several Python XML processing modules to chose from.
- DOM (generic XML parser to read a whole document first and to organoze its objects in a tree structure.)
- Sax (generic XML parser to read it sequentially with an event-driven API.)
- ElementTree (new easy-to-use pythonic API, now lxml is merged.)
parse()
is like DOMiterparse()
is like SAX- Limitted Xpath support (no parent node)
- … others
When I wanted to extract the list of the jōyō kanji (literally “regular-use Chinese characters”) data from the 常用漢字一覧 page of the Japanese Wikipedia site, I chose Elementtree since it looked easier to program.
Download data
Let’s download target HTML page (in XHTML).
Download XML file
$ wget -4 -O - http://ja.wikipedia.org/wiki/%E5%B8%B8%E7%94%A8%E6%BC%A2%E5%AD%97
%E4%B8%80%E8%A6%A7 |zcat - > jouyou.xml
--2013-08-18 10:17:35-- http://ja.wikipedia.org/wiki/%E5%B8%B8%E7%94%A8%E6%BC%A
2%E5%AD%97%E4%B8%80%E8%A6%A7
Resolving ja.wikipedia.org (ja.wikipedia.org)... 208.80.154.225
Connecting to ja.wikipedia.org (ja.wikipedia.org)|208.80.154.225|:80... connecte
d.
HTTP request sent, awaiting response... 200 OK
Length: 102499 (100K) [text/html]
Saving to: ‘STDOUT’
0K .......... .......... .......... .......... .......... 49% 134K 0s
50K .......... .......... .......... .......... .......... 99% 266K 0s
100K 100% 184G=0.6s
2013-08-18 10:17:36 (179 KB/s) - written to stdout [102499/102499]
Please note the original downloaded data was compressed.
Find XPath to target
Since it is very awkward to count the XPath location manually, I created a text pattern to XPath conversion tool. (This works for XML somewhat like what grep
does for the plain text.)
Program to search XPath (search-xpath.py)
#!/usr/bin/python3
# vim: set sts=4 expandtab:
#
import xml.etree.ElementTree as ET
import re
def search_retext(element, text, xpath=''):
retext = re.compile(text)
if xpath == '':
xpath = element.tag
subelements = element.findall('*')
tagcount = {}
if subelements != []:
for s in subelements:
if s.tag in tagcount:
tagcount[s.tag] += 1
else:
tagcount[s.tag] = 1
subxpath = xpath + '/' + s.tag + '[' + str(tagcount[s.tag]) + ']'
if not (s.text is None):
if retext.search(s.text.strip()):
print("=> %s" % subxpath)
print("with '%s': '%s'" % (text, s.text.strip()))
search_retext(s, text, subxpath)
return
if __name__ == '__main__':
tree = ET.parse('jouyou.xml')
root = tree.getroot()
print("search_retext(root, '^一覧')")
search_retext(root, '^一覧')
print("search_retext(root, '^本表')")
search_retext(root, '^本表')
print("search_retext(root, '^亜')")
search_retext(root, '^亜')
print("search_retext(root, '^葛')")
search_retext(root, '^葛')
Let’s run this script to find the XPath to key locations having specific texts such as “一覧”.
Saerching XPath
$ ./search-xpath.py
search_retext(root, '^一覧')
=> html/body[1]/div[3]/div[3]/div[4]/div[1]/ul[1]/li[1]/a[1]/span[2]
with '^一覧': '一覧'
=> html/body[1]/div[3]/div[3]/div[4]/h2[1]/span[1]
with '^一覧': '一覧'
search_retext(root, '^本表')
=> html/body[1]/div[3]/div[3]/div[4]/div[1]/ul[1]/li[1]/ul[1]/li[1]/a[1]/span[2]
with '^本表': '本表'
=> html/body[1]/div[3]/div[3]/div[4]/h3[1]/span[1]
with '^本表': '本表'
search_retext(root, '^亜')
=> html/body[1]/div[3]/div[3]/div[4]/table[2]/tr[2]/td[2]/a[1]
with '^亜': '亜'
search_retext(root, '^葛')
=> html/body[1]/div[3]/div[3]/div[4]/ul[1]/li[7]/ul[1]/li[20]/span[1]
with '^葛': '葛'
=> html/body[1]/div[3]/div[3]/div[4]/table[2]/tr[239]/td[2]/a[1]
with '^葛': '葛'
Extract data from XML
Now that we know the kanji data is under html/body[1]/div[3]/div[3]/div[4]/table[2]/tr[*]/td[2]/a[1]
where * iterates over lines. (This works for XML somewhat like what sed
does for the plain text.)
Program to extract data (kanji.py)
#!/usr/bin/python3
# vim: set sts=4 expandtab:
import xml.etree.ElementTree as ET
def search_kanji(element):
kanji = []
for row in element.findall('body[1]/div[3]/div[3]/div[4]/table[2]/tr'):
# processing table for each row (expandable!)
k = row.find('td[2]/a')
if k is None:
continue
kanji.append(k.text)
return kanji
if __name__ == '__main__':
element = ET.parse('jouyou.xml')
root = element.getroot()
print(search_kanji(root))
Let’s run this script to list all Jouyou kanjis.
Lists all jouyou kanjis (trancated).
$ ./kanji.py
['亜', '哀', '挨', '愛', '曖', '悪', '握', '圧', '扱', '宛', '嵐', '安', '案', '暗', '以', '衣', ...
XML with namespace
For example, this fun2prog web page is in the typical XHTML format having namespace embedded in the “<html>
” tag attiribute as “<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
”. This declaration applies to child elements until it is overriden by another definition.
In an Element tree, such qualified names are stored as universal names in Clark’s notatioion. So if “<tag>
” apears in such XML, Elementtree read it as:
{http://www.w3.org/1999/xhtml}tag
For example, let’s consider that a copy of this web page fun2prog.html in this current directory. The first thing to do is to change above example code to the following.
..
tree = ET.parse('jouyou.xml')
root = tree.getroot()
..
Running this will produce very noisy output (folded to fit to the screen).
{http://www.w3.org/1999/xhtml}html/ \
{[1](http://www.w3.org/1999/xhtml}body)/ \
{[2](http://www.w3.org/1999/xhtml}div)/ ...
The following should let you supress these curly brakets.
..
f = open('fun2prog.html', 'r')
xhtmlstring = f.read()
f.close()
xhtmlstring = re.sub(' xmlns="[^"]+"', '', xhtmlstring, count=1)
root = ET.fromstring(xhtmlstring)
..
Previous Post | Top | Next Post |