Skip to content
Steve edited this page Jan 30, 2018 · 11 revisions

We're getting the Infoboxes from parse tree XML, and converting it to a dict with wptools.utils.template_to_dict().

Getting data from Infoboxes may be unavoidable, but getting Wikidata (via page.get_wikidata()) is preferred. Wikidata is structured but sometimes data poor, while Infoboxen are unstructured and frequently data rich. Please consider updating Wikidata if the information you want is only available in a MediaWiki instance so that others may benefit from open, linked data.

Example:

>>> page = wptools.page('Gandhi')
>>> page.get_parse()
en.wikipedia.org (parse) Gandhi
en.wikipedia.org (imageinfo) File:MKGandhi.jpg
Mahatma Gandhi (en) data
{
  image: <list(1)> {u'size': 2951123, 'kind': 'parse-image', u'des...
  infobox: <dict(25)> known_for, other_names, image, signature, bi...
  iwlinks: <list(10)> https://biblio.wiki/wiki/Mohandas_K._Gandhi,...
  pageid: 19379
  parsetree: <str(330951)> <root><template><title>Redirect</title>...
  requests: <list(2)> parse, imageinfo
  title: Mahatma Gandhi
  wikibase: Q1001
  wikidata_url: https://www.wikidata.org/wiki/Q1001
  wikitext: <str(260607)> {{Redirect|Gandhi}}{{pp-move-indef}}{{pp...
}
>>> page.data['infobox']
{'alma_mater': [[University College London]]<ext><name>ref</name><attr/><inner>{{cite book|author1=Jeffrey M. Shaw |author2=Timothy J. Demy |title=War and Religion: An Encyclopedia of Faith and Conflict |url=https://books.google.com/books?id=KDlFDgAAQBAJ&amp;pg=PA309|year=2017|publisher=ABC-CLIO|isbn=978-1-61069-517-6|pages=309 }}</inner><close>&lt;/ref&gt;</close></ext><br>[[Inner Temple]]',
 'alt': u'The face of Gandhi in old age\u2014smiling, wearing glasses, and with a white sash over his right shoulder',
 'birth_date': '{{Birth date|df|=|yes|1869|10|2}}',
 'birth_name': 'Mohandas Karamchand Gandhi',
 'birth_place': [[Porbandar]], [[Porbandar State]], [[Kathiawar Agency]], [[Bombay Presidency]], [[British Raj|British India]]<ext><name>ref</name><attr> name="Gandhi DOB"</attr><inner>[[#Rajmohan|Gandhi, Rajmohan (2006)]] [https://books.google.com/?id=FauJL7LKXmkC pp. 1&#8211;3].</inner><close>&lt;/ref&gt;</close></ext><br />(present-day [[Gujarat]], India)',
 'children': '{{hlist|[[Harilal Gandhi|Harilal]]|[[Manilal Gandhi|Manilal]]|[[Ramdas Gandhi|Ramdas]]|[[Devdas Gandhi|Devdas]]}}',
 'death_cause': '[[Assassination of Mahatma Gandhi|Assassination]]',
 'death_date': '{{Death date and age|df|=|yes|1948|1|30|1869|10|2}}',
 'death_place': [[New Delhi]], [[Delhi]], [[Dominion of India]] (present-day India)',
 'father': '[[Karamchand Uttamchand Gandhi|Karamchand Gandhi]]',
 'honorific_prefix': u'[[Mah\u0101tm\u0101]]',
 'image': 'MKGandhi.jpg',
 'known_for': [[Indian Independence Movement]],<br>[[Peace movement]]',
 'mother': 'Putlibai Gandhi',
 'movement': '[[Indian independence movement]]',
 'name': 'Mohandas Karamchand Gandhi',
 'nationality': '[[Indian people|Indian]]',
 'native_name': u'\u0aae\u0acb\u0ab9\u0aa8\u0aa6\u0abe\u0ab8 \u0a95\u0ab0\u0aae\u0a9a\u0a82\u0aa6 \u0a97\u0abe\u0a82\u0aa7\u0ac0',
 'native_name_lang': 'Gujarati',
 'occupation': '{{hlist|Lawyer|Politician|Activist|Writer}}',
 'other_names': 'Mahatma Gandhi, Bapu ji, Gandhi ji',
 'party': '[[Indian National Congress]]',
 'resting_place': [[Raj Ghat and associated memorials|Raj Ghat]], [[Delhi]], India',
 'signature': 'Mohandas K. Gandhi signature.svg',
 'spouse': '{{marriage|[[Kasturba Gandhi]]|1883|1944|end|=|died}}'}

Alternate parser

Sometimes the wikitext in an Infobox is not easily transformed into a dict. In those cases, we try a more general approach that results in a pretty verbose data structure, but it's probably better than nothing:

>>> page = wptools.page('Okapi', lang='fr')
>>> page.get_parse()
fr.wikipedia.org (parse) Okapi
Okapi (fr) data
{
  infobox: <dict(2)> count, boxes
  ...
}
>>> page.data['infobox']['count']
13

>>> page.data['infobox']['boxes']
[{u'Taxobox d\xe9but': [[{'index': '1'}, 'animal'],
   [{'index': '2'}, "''Okapia johnstoni''"],
   [{'index': '3'}, 'Okapi2.jpg'],
   [{'index': '4'}, 'Okapi']]},
 {'Taxobox': [[{'index': '1'}, 'embranchement'],
   [{'index': '2'}, 'Chordata']]},
 {'Taxobox': [[{'index': '1'}, 'classe'], [{'index': '2'}, 'Mammalia']]},
 {'Taxobox': [[{'index': '1'}, 'sous-classe'], [{'index': '2'}, 'Theria']]},
 {'Taxobox': [[{'index': '1'}, 'ordre'], [{'index': '2'}, 'Artiodactyla']]},
 {'Taxobox': [[{'index': '1'}, 'famille'], [{'index': '2'}, 'Giraffidae']]},
 {'Taxobox taxon': [[{'index': '1'}, 'animal'],
   [{'index': '2'}, 'genre'],
   [{'index': '3'}, 'Okapia'],
   [{'index': '4'}, [[Edwin Ray Lankester|Lankester]], [[1901]]']]},
 {'Taxobox taxon': [[{'index': '1'}, 'animal'],
   [{'index': '2'}, u'esp\xe8ce'],
   [{'index': '3'}, 'Okapia johnstoni'],
   [{'index': '4'}, '([[Philip Lutley Sclater|Sclater]], [[1901]])']]},
 {'Taxobox synonymes': [[{'index': '1'},
    "* ''Equus johnstoni'' <small>P.L. Sclater, 1901</small>"]]},
 {'Taxobox UICN': [[{'index': '1'}, 'EN'], [{'index': '2'}, 'A2abcd+4abcd']]},
 {u'Taxobox r\xe9partition': [[{'index': '1'}, 'Okapi map.jpg']]},
 {u'Taxobox r\xe9partition': [[{'index': '1'}, 'Okapi distribution.PNG']]},
 {'Taxobox fin': []}]
Clone this wiki locally