Contribution instructions, IS paragraphed (#52)

* Bump version * Use paragraph level join for IS * Add finscraper.gif * Added contribution instructions * Remove useless imports * Add setup.py classifiers
jmyrberg · May 24, 2020 · 29d7d92 · 29d7d92
1 parent 036ddc0
commit 29d7d92
Show file tree

Hide file tree

Showing 9 changed files with 47 additions and 13 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,27 @@
+# Contributing
+
+When websites change, spiders tend to break. I can't make a promise to keep this
+repository up-to-date all by myself, so pull requests are more than welcome!
+
+## Adding a new spider
+
+1. Create a branch for the spider, e.g. ``mtvarticle``
+
+2. Add the spider name in [pytest.ini](https://github.com/jmyrberg/finscraper/blob/master/pytest.ini) and [.travis.yml](https://github.com/jmyrberg/finscraper/blob/master/.travis.yml)
+
+3. Add the spider in [tests/test_spiders.py](https://github.com/jmyrberg/finscraper/blob/master/tests/test_spiders.py) similar to others
+
+4. Add the spider API in [finscraper/spiders.py](https://github.com/jmyrberg/finscraper/blob/master/finscraper/spiders.py)
+
+5. Write the Scrapy spider under [finscraper/scrapy_spiders](https://github.com/jmyrberg/finscraper/blob/master/finscraper/scrapy_spiders) by naming the spider exactly with the spider name, e.g. ``mtvarticle.py`` - use of flake8 linting and [Google style docstrings](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) is recommended
+
+6. Make sure the spider passes all non-benchmark tests within [test_spiders.py](https://github.com/jmyrberg/finscraper/blob/master/tests/test_spiders.py)
+
+7. Push your branch into Github and make a pull request against master
+
+8. *(OPTIONAL)*: Bump up the version in [VERSION](https://github.com/jmyrberg/finscraper/blob/master/VERSION) and [re-build the documentation](https://github.com/jmyrberg/finscraper/blob/master/scripts/build-documentation.sh)
+
+
+## Updating an existing spider
+
+Steps 5. - 8. above.
diff --git a/README.md b/README.md
@@ -37,10 +37,14 @@ spider = ISArticle().scrape(10)
 articles = spider.get()
 ```
 
+The API is similar for all the spiders:
+
+![Finscraper in action](https://github.com/jmyrberg/finscraper/blob/master/docs/finscraper.gif)
+
+
 ## Contributing
 
-When websites change, spiders tend to break. I can't make a promise to keep this
-repository up-to-date all by myself - pull requests are more than welcome!
+Please see [CONTRIBUTING.md](https://github.com/jmyrberg/finscraper/blob/master/CONTRIBUTING.md) for more information.
 
 
 ---

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-0.1.0a
+0.1.0b
diff --git a/docs/finscraper.gif b/docs/finscraper.gif
diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst
@@ -0,0 +1 @@
+.. mdinclude:: ../../CONTRIBUTING.md
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -9,6 +9,7 @@
    installation
    spiders
    usage
+   contributing
 
 
 .. toctree::

diff --git a/finscraper/scrapy_spiders/ilarticle.py b/finscraper/scrapy_spiders/ilarticle.py
@@ -3,8 +3,6 @@
 
 import time
 
-from functools import partial
-
 from scrapy import Item, Field, Selector
 from scrapy.crawler import Spider
 from scrapy.linkextractors import LinkExtractor

diff --git a/finscraper/scrapy_spiders/isarticle.py b/finscraper/scrapy_spiders/isarticle.py
@@ -3,8 +3,6 @@
 
 import time
 
-from functools import partial
-
 from scrapy import Item, Field, Selector
 from scrapy.crawler import Spider
 from scrapy.linkextractors import LinkExtractor
@@ -59,9 +57,12 @@ def _parse_item(self, resp):
         il.add_xpath(
             'ingress',
             '//section//article//p[contains(@class, "ingress")]//text()')
-        il.add_xpath(
-            'content',
-            '//article//p[contains(@class, "body")]//text()')
+
+        pgraphs_xpath = '//article//p[contains(@class, "body")]'
+        content = [''.join(Selector(text=pgraph).xpath('//text()').getall())
+                   for pgraph in resp.xpath(pgraphs_xpath).getall()]
+        il.add_value('content', content)
+
         il.add_xpath(
             'published',
             '//article//div[contains(@class, "timestamp")]//text()')

diff --git a/setup.py b/setup.py
@@ -29,9 +29,11 @@
     packages=setuptools.find_packages(),
     include_package_data=True,
     classifiers=[
-        'Development Status :: 3 - Alpha',
+        'Development Status :: 4 - Beta',
+        'Programming Language :: Python :: 3',
+        'Framework :: Scrapy',
+        'Natural Language :: Finnish',
         'Intended Audience :: Developers',
-        'License :: OSI Approved :: MIT License',
-        'Programming Language :: Python :: 3'
+        'License :: OSI Approved :: MIT License'
     ]
 )
-Original file line number
+Diff line change
@@ Expand Up / @@ -9,6 +9,7 @@ @@
        installation
        spiders
        usage
+       contributing
     .. toctree::
@@ Expand Down @@