Skip to content

Commit

Permalink
readme
Browse files Browse the repository at this point in the history
  • Loading branch information
spencermountain committed Sep 20, 2017
1 parent 446832e commit 8fd4e46
Showing 1 changed file with 17 additions and 20 deletions.
37 changes: 17 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,28 +19,17 @@ wp2mongo({file:'./enwiki-latest-pages-articles.xml.bz2', db: 'enwiki'}, callback
```

then check out the articles in mongo:
````javascript
mongo //enter the mongo shell
use enwiki //grab the database
````bash
$ mongo //enter the mongo shell
use enwiki //grab the database

db.wikipedia.find({title:"Toronto"})[0].categories
//[ "Former colonial capitals in Canada",
// "Populated places established in 1793",
// ...]
#[ "Former colonial capitals in Canada",
# "Populated places established in 1793" ...]
db.wikipedia.count({type:"redirect"})
// 124,999...
# 124,999...
````

### how it works:
this library uses:
* [unbzip2-stream](https://github.com/regular/unbzip2-stream) to stream-uncompress the gnarly bz2 file

* [xml-stream](https://github.com/assistunion/xml-stream) to stream-parse its xml format

* [wtf_wikipedia](https://github.com/spencermountain/wtf_wikipedia) to brute-parse the article wikiscript contents into JSON.

* [redis](http://redis.io/) to (optionally) put wikiscript parsing on separate threads :metal:

# 1)
you can do this.
a few Gb. you can do this.
Expand Down Expand Up @@ -84,12 +73,10 @@ db.wikipedia.count({type:"redirect"})
db.wikipedia.findOne({title:"Toronto"}).categories
````


## Same for the English wikipedia:
### Same for the English wikipedia:
the english wikipedia will work under the same process, but
the download will take an afternoon, and the loading/parsing a couple hours. The en wikipedia dump is a 4gb download and becomes a pretty legit mongo collection uncompressed. It's something like 40gb, but mongo can do it... You can do it!


### Options
#### human-readable plaintext **--plaintext**
```js
Expand Down Expand Up @@ -123,6 +110,16 @@ node src/worker.js
node node_modules/kue/bin/kue-dashboard -p 3000
````

### how it works:
this library uses:
* [unbzip2-stream](https://github.com/regular/unbzip2-stream) to stream-uncompress the gnarly bz2 file

* [xml-stream](https://github.com/assistunion/xml-stream) to stream-parse its xml format

* [wtf_wikipedia](https://github.com/spencermountain/wtf_wikipedia) to brute-parse the article wikiscript contents into JSON.

* [redis](http://redis.io/) to (optionally) put wikiscript parsing on separate threads :metal:

### Addendum:
#### \_ids
since wikimedia makes all pages have globally unique titles, we also use them for the mongo `_id` fields.
Expand Down

0 comments on commit 8fd4e46

Please sign in to comment.