Skip to content

Commit

Permalink
Contribution instructions, IS paragraphed (#52)
Browse files Browse the repository at this point in the history
* Bump version

* Use paragraph level join for IS

* Add finscraper.gif

* Added contribution instructions

* Remove useless imports

* Add setup.py classifiers
  • Loading branch information
jmyrberg authored May 24, 2020
1 parent 036ddc0 commit 29d7d92
Show file tree
Hide file tree
Showing 9 changed files with 47 additions and 13 deletions.
27 changes: 27 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Contributing

When websites change, spiders tend to break. I can't make a promise to keep this
repository up-to-date all by myself, so pull requests are more than welcome!

## Adding a new spider

1. Create a branch for the spider, e.g. ``mtvarticle``

2. Add the spider name in [pytest.ini](https://github.com/jmyrberg/finscraper/blob/master/pytest.ini) and [.travis.yml](https://github.com/jmyrberg/finscraper/blob/master/.travis.yml)

3. Add the spider in [tests/test_spiders.py](https://github.com/jmyrberg/finscraper/blob/master/tests/test_spiders.py) similar to others

4. Add the spider API in [finscraper/spiders.py](https://github.com/jmyrberg/finscraper/blob/master/finscraper/spiders.py)

5. Write the Scrapy spider under [finscraper/scrapy_spiders](https://github.com/jmyrberg/finscraper/blob/master/finscraper/scrapy_spiders) by naming the spider exactly with the spider name, e.g. ``mtvarticle.py`` - use of flake8 linting and [Google style docstrings](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) is recommended

6. Make sure the spider passes all non-benchmark tests within [test_spiders.py](https://github.com/jmyrberg/finscraper/blob/master/tests/test_spiders.py)

7. Push your branch into Github and make a pull request against master

8. *(OPTIONAL)*: Bump up the version in [VERSION](https://github.com/jmyrberg/finscraper/blob/master/VERSION) and [re-build the documentation](https://github.com/jmyrberg/finscraper/blob/master/scripts/build-documentation.sh)


## Updating an existing spider

Steps 5. - 8. above.
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,14 @@ spider = ISArticle().scrape(10)
articles = spider.get()
```

The API is similar for all the spiders:

![Finscraper in action](https://github.com/jmyrberg/finscraper/blob/master/docs/finscraper.gif)


## Contributing

When websites change, spiders tend to break. I can't make a promise to keep this
repository up-to-date all by myself - pull requests are more than welcome!
Please see [CONTRIBUTING.md](https://github.com/jmyrberg/finscraper/blob/master/CONTRIBUTING.md) for more information.


---
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.1.0a
0.1.0b
Binary file added docs/finscraper.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/contributing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. mdinclude:: ../../CONTRIBUTING.md
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
installation
spiders
usage
contributing


.. toctree::
Expand Down
2 changes: 0 additions & 2 deletions finscraper/scrapy_spiders/ilarticle.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@

import time

from functools import partial

from scrapy import Item, Field, Selector
from scrapy.crawler import Spider
from scrapy.linkextractors import LinkExtractor
Expand Down
11 changes: 6 additions & 5 deletions finscraper/scrapy_spiders/isarticle.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@

import time

from functools import partial

from scrapy import Item, Field, Selector
from scrapy.crawler import Spider
from scrapy.linkextractors import LinkExtractor
Expand Down Expand Up @@ -59,9 +57,12 @@ def _parse_item(self, resp):
il.add_xpath(
'ingress',
'//section//article//p[contains(@class, "ingress")]//text()')
il.add_xpath(
'content',
'//article//p[contains(@class, "body")]//text()')

pgraphs_xpath = '//article//p[contains(@class, "body")]'
content = [''.join(Selector(text=pgraph).xpath('//text()').getall())
for pgraph in resp.xpath(pgraphs_xpath).getall()]
il.add_value('content', content)

il.add_xpath(
'published',
'//article//div[contains(@class, "timestamp")]//text()')
Expand Down
8 changes: 5 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,11 @@
packages=setuptools.find_packages(),
include_package_data=True,
classifiers=[
'Development Status :: 3 - Alpha',
'Development Status :: 4 - Beta',
'Programming Language :: Python :: 3',
'Framework :: Scrapy',
'Natural Language :: Finnish',
'Intended Audience :: Developers',
'License :: OSI Approved :: MIT License',
'Programming Language :: Python :: 3'
'License :: OSI Approved :: MIT License'
]
)

0 comments on commit 29d7d92

Please sign in to comment.