Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spider: Illinois Finance Authority #914

Open
pjsier opened this issue Oct 24, 2019 · 18 comments
Open

Spider: Illinois Finance Authority #914

pjsier opened this issue Oct 24, 2019 · 18 comments

Comments

@pjsier
Copy link
Collaborator

pjsier commented Oct 24, 2019

URL: https://www.il-fa.com/
Documents URL: https://www.il-fa.com/public-access/board-documents/
Spider Name: il_finance_authority
Agency Name: Illinois Finance Authority

See the contribution guide for information on how to get started

@aneesh404
Copy link

I'd like to work on this issue

@pjsier
Copy link
Collaborator Author

pjsier commented Oct 24, 2019

@aneesh404 sounds good!

@aneesh404
Copy link

Hi! I'm sorry I'm not getting time to work on this issue. Please feel free to assign it to someone else.

@janeskim
Copy link

Hi! It looks like this issue isn't claimed. Is it ok if I work on this issue?

@pjsier
Copy link
Collaborator Author

pjsier commented Oct 30, 2019

@janeskim all yours!

@mesterhammerfic
Copy link

If this one is open I'm going to work on it

@pjsier
Copy link
Collaborator Author

pjsier commented Mar 9, 2020

@mesterhammerfic sounds great! Assigning you now

@pjsier pjsier added claimed and removed help wanted labels Mar 9, 2020
@ledaliang
Copy link
Contributor

Hi, I was wondering If I could work on this issue if it hasn't been active recently. Thanks!

@pjsier
Copy link
Collaborator Author

pjsier commented Jun 16, 2020

@ledaliang thanks for your interest! We try to limit contributors to one issue at a time, but once your other PR is merged you can feel free to work on this one

@PatrickKlingler
Copy link

Hi there I've only just asked for a Slack invite, but could I start working on this now?

@pjsier
Copy link
Collaborator Author

pjsier commented Jun 26, 2020

@PatrickKlingler sure! Marking it claimed now

@PatrickKlingler
Copy link

Hey Patrick would it be possible to add another PDF parser?

The PyPDF2 parser does not seem to work for the PDFs on IFA's website, i.e. it returns an empty string. I copied this code to parse the PDF: https://github.com/City-Bureau/city-scrapers/blob/main/city_scrapers/spiders/il_pollution_control.py#L103

Apparently PyPDF2 is limited to certain kinds of PDF encodings: https://stackoverflow.com/questions/30272269/python-text-extraction-does-not-work-on-some-pdfs

I ended up using pdfplumber and that works but it would introduce another dependency.

@pjsier
Copy link
Collaborator Author

pjsier commented Jun 27, 2020

@PatrickKlingler gotcha, we've run into issues with PyPDF2 so I think it's fine to add something additional here, but on other projects we've been working with pdfminer.six directly. If it works for you I'm fine with adding pdfminer.six as a dependency here since we'll try to eventually remove PyPDF2. We have an example of using it here https://github.com/City-Bureau/city-scrapers-cle/blob/46cf904f87f7c78fe2733eafc4ac97a68ce47d02/city_scrapers/spiders/cuya_developmental_disabilities.py#L36-L44

@pjsier
Copy link
Collaborator Author

pjsier commented Jul 14, 2020

@PatrickKlingler wanted to follow up on this, we just replaced PyPDF2 with pdfminer.six throughout all of our repos so hopefully that makes this easier!

@PatrickKlingler
Copy link

PatrickKlingler commented Jul 14, 2020 via email

@solisedwin
Copy link

Hey, seems like this issue has been opened for a while. I would like to tackle on this issue as my first contrib. Also seems like a good opportunity since I have built projects using Scrapy before. If that's fine by you.

@pjsier
Copy link
Collaborator Author

pjsier commented Sep 29, 2020

@solisedwin yep, this has been inactive more than 30 days so it's all yours if you're interested! I can assign you now

@solisedwin
Copy link

Hey I'm still working on this web crawler. Just been rewriting it and fine tuning it for better code readability. Should have it done soon. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants