Skip to content

Latest commit

 

History

History
138 lines (117 loc) · 5.32 KB

README.md

File metadata and controls

138 lines (117 loc) · 5.32 KB

mediawiki-export-parser

A simple tool for retrieval of specific information from a MediaWiki XML export.

Why bother?

I was performing an upgrade and heavy modification of a MediaWiki instance, and after corrupting the database and having to start from scratch, I had difficulty getting the MediaWiki export XML to import properly (possibly due to the jump of several major revisions of MediaWiki and the way the current instance is set up).

MediaWiki documentation is rather byzantine, and rather than go down that rabbit hole, this script provides users a fairly straightforward way to get the latest or other specific revisions of all or select pages in their MediaWiki XML export file.

Basic implementation

To return the name of the page and its latest revision from the MediaWiki XML in an easily readable manner:

rev_list = get_latest_revisions('C:\\path\\to\\export.xml')

for key, val in rev_list.items():
    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    print(key)
    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    for subkey, subval in val.items():
        if subkey == 'latest_text':
            print(subval)

...which outputs:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Royal Enfield
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Royal Enfield was a brand name under which...

Alternately, to dump all the information which get_latest_revisions() considers relevant:

import json

rev_list = get_latest_revisions('C:\\path\\to\\export.xml')
print(json.dumps(rev_list, sort_keys=True, indent=4))

...which returns:

"Royal Enfield": {
    "latest_rev": "2018-06-24T00:44:37Z",
    "latest_text": "Royal Enfield was a brand name under which...",
    "rev_timestamps": [
        "2018-06-23T21:48:06Z",
        "2018-06-23T23:37:52Z",
        "2018-06-23T23:38:26Z",
        "2018-06-24T00:42:55Z",
        "2018-06-24T00:43:56Z",
        "2018-06-24T00:44:10Z",
        "2018-06-24T00:44:37Z"
    ]
},
...

MediaWiki XML format

Instances of MediaWiki backups, where there is only a single revision (one entry) are structured as follows:

<mediawiki>
    <page>
        <title>Royal Enfield</title>
        <ns>0</ns>
        <id>100</id>
        <revision>
            <id>20</id>
            <timestamp>2018-05-12T02:40:17Z</timestamp>
            <contributor>
                <username>admin</username>
                <id>1</id>
            </contributor>
            <comment>Created page with &quot;lorem ipsum...&quot;</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="288">
                Royal Enfield was a brand name under which...
            </text>
            <sha1>64l7o1kg4c4j222oqktfcf5446y3a2l</sha1>
        </revision>
    </page>
    <page>
        ...
</mediawiki>

Where there are multiple revisions, the file has multiple instances of the <revision> tag, but are otherwise identical:

<mediawiki>
    <page>
        <title>Royal Enfield</title>
        <ns>0</ns>
        <id>100</id>
        <revision>
            <id>20</id>
            <parentid>90</parentid>
            <timestamp>2018-05-12T02:40:17Z</timestamp>
            ...
        </revision>
        <revision>
            <id>146</id>
            <parentid>147</parentid>
            <timestamp>2019-08-24T00:33:45Z</timestamp>
            ...
        </revision>
        <revision>
            <id>147</id>
            <parentid>148</parentid>
            <timestamp>2019-08-24T01:37:54Z</timestamp>
            ...
        </revision>
    </page>
    ...
</mediawiki>        

Note that the <id> tag inside <revision> is not sequential to the previous revision; in my example above, the first is 20, the second 146, but there is some relationship between id and parentid.

From this, I would assume revision IDs are globally sequential, but I haven't looked into it, and that is not relevant to current scope - what we're looking for, with latest_revision=True, is the most recent <timestamp> within a <revision>.

To-do

Built from immediate necessity, this is fairly rudimentary, but - I hope - helpful. Below is a list of specific items to add, as well as a "wish list" of medium-term additions necessary to make this a standalone application, and/or because, as a data geek, I have ridiculous aspirations to see the depth of others' projects.

Functions and parameters to add

  • get_latest_revisions()
    • return_errors=False If true, this returns a list of errors where the latest revision is not the last (in order) revision.
    • return_latest=False This function currently defaults to return the latest entry. It is possible that one may want to retrieve an earlier revision.
    • revision=0 This should return a specific revision number, if return_latest=True.
  • revision_errors() Call to return errors from get_latest_revisions()
  • revisions_to_csv() Rather than just dumping the data out with print(), ideally, this would be exported as a CSV file, with each(?) XML tag as its own field (or at least <revision>, <timestamp>, and <page>).

Wish list

  • It'd be cool to see the revision IDs visualized somehow, showing work throughout one's wiki over time. Maybe char count? Revision count? Complexity of entry? Sentiment analysis? I should understand MediaWiki IDs better...