Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spider: ​​​​​Illinois Health Facilities and Services Review Board #1001

Open
pjsier opened this issue Jan 25, 2021 · 20 comments
Open
Assignees

Comments

@pjsier
Copy link
Collaborator

pjsier commented Jan 25, 2021

URL: https://www2.illinois.gov/sites/hfsrb/events/Pages/Board-Meetings.aspx
Spider Name: il_health_facilities
Agency Name: ​​​​​Illinois Health Facilities and Services Review Board

@masoodqq
Copy link

masoodqq commented Feb 6, 2021

I would like to work on this issu.

@pjsier
Copy link
Collaborator Author

pjsier commented Feb 8, 2021

@masoodqq sounds great! Assigning you now

@pjsier pjsier removed the help wanted label Feb 8, 2021
@masoodqq masoodqq removed their assignment Feb 23, 2021
@Ni3dzwi3dz
Copy link

Hi,
is this issue still opened? If yes, i`m willing to help.

@palakshivlani-11
Copy link

Hi,
is this issue still open

@haileyhoyat
Copy link
Collaborator

@palakshivlani-11 Hi. Thanks so much for checking out the project. Go for it.

@mohdyawars
Copy link

willing to help

@haileyhoyat
Copy link
Collaborator

@yawar1101 hi. thanks so much for checking us out. go for it.

@godclause
Copy link

Hi @haileyhoyat!

I'd like to try my hand on this one.

@haileyhoyat
Copy link
Collaborator

@godclause Hello! Go for it. Cheers.

@haileyhoyat haileyhoyat assigned godclause and unassigned mohdyawars Dec 14, 2023
@godclause
Copy link

godclause commented Dec 30, 2023

@haileyhoyat @palakshivlani-11 @yawar1101 @pjsier

Hi!

I hope I'm not overthinking on this question(s):

The challenging issue with URL: https://www2.illinois.gov/sites/hfsrb/events/Pages/Board-Meetings.aspx is that "juicy" meeting details are only available as downloadable PDFs via hyperlinks on ASP.NET web pages.

I have 'discovered' only a few Python libraries useful for scraping PDFs, but none seem to work for remote scraping,if that makes sense.

  1. Is the general consensus around scraping PDF documents associated to an idea that they absolutely must be downloaded, stored and locally scraped?

  2. Is there not an available Python solution, i.e. library, etc. to scrape PDFs as they live on the web, without storing them locally?

  3. Is there a standard procedure / code of conduct, etc. that I should follow for applying security and dependency updates for this repo?

@appills
Copy link

appills commented Jan 9, 2024

  1. Is the general consensus around scraping PDF documents associated to an idea that they absolutely must be downloaded, stored and locally scraped?

No need to store them locally. You're going to get a stream of bytes regardless, so you can just read the byte stream directly from the request. I don't know how package management is applied here (e.g. pypdf)

from io import BytesIO
from pypdf import PdfReader

resp = requests.get("https://hfsrb.illinois.gov/content/dam/soi/en/web/hfsrb/events/documents/2024/january-23-2024-state-board-meeting-/21-007(4)%20Permit%20Renewal%20Winchester%20ASTC.pdf")
print(resp.headers)
reader = PdfReader(BytesIO(resp.content))
print(reader.pages[0].extract_text())

@godclause
Copy link

godclause commented Jan 17, 2024

  1. Is the general consensus around scraping PDF documents associated to an idea that they absolutely must be downloaded, stored and locally scraped?

No need to store them locally. You're going to get a stream of bytes regardless, so you can just read the byte stream directly from the request. I don't know how package management is applied here (e.g. pypdf)

from io import BytesIO
from pypdf import PdfReader

resp = requests.get("https://hfsrb.illinois.gov/content/dam/soi/en/web/hfsrb/events/documents/2024/january-23-2024-state-board-meeting-/21-007(4)%20Permit%20Renewal%20Winchester%20ASTC.pdf")
print(resp.headers)
reader = PdfReader(BytesIO(resp.content))
print(reader.pages[0].extract_text())

@appills @haileyhoyat

I have sustained a perspective that running python operations for city-scrapers is not wholly shell agnostic (e.g. zsh, bash, csh). With this in mind, I believe package management has proven to be problematic and ought to be considered when starting a project.

The problem is that zsh is now the default shell on macs. The commands provided in our docs support bash.

For the folks running zsh, should there be verbiage included in our docs to remind us to consider changing our shell if we're running Catalina and beyond?

If not, should there be language updated in the docs that explains the following scenarios:

For Macs

  1. Switch to bash shell and run all commands from there.
  2. Build a Linux Virtual Machine [Ubuntu, Mint, etc.] and run your terminal / CLI commands via bash.

For Windows based machines

  1. Build a Linux Virtual Machine [Ubuntu, Mint, etc.] and run terminal / CLI commands via bash.

Is any of this at all necessary?

@onyangojerry
Copy link

@godclause

yes, there should be verbiage guiding us and or reminding us to either switch to bash or use a virtual machine just so to save us the the agony that comes with continuous frustrations. I believe it is necessary.

@haileyhoyat
Copy link
Collaborator

haileyhoyat commented Jan 31, 2024

@godclause @appills @onyangojerry

Hi All. I want to introduce you to Dan (@SimmonsRitchie ). Dan has officially taken over the role as project lead for the entire City Scrapers project.

Dan, idk if this conversation is relevant for you, particularly as you fix a lot of infrastructure things.

Cheers, all.

@godclause
Copy link

godclause commented Jan 31, 2024

@godclause @appills @onyangojerry

Hi All. I want to introduce you to Dan (@SimmonsRitchie ). Dan has officially taken over the role as project lead for the entire City Scrapers project.

Dan, idk if this conversation is relevant for you, particularly as you fix a lot of infrastructure things.

Cheers, all.

Thank you for the introduction @hails. Hi Dan, nice to meet you here.

@SimmonsRitchie
Copy link
Contributor

Hi there, @godclause! Nice to meet you too! And thanks for the intro, @haileyhoyat.

@godclause My apologies, I took over the city-scrapers project very recently and I'm juggling a lot of fixes and upgrades right now across the project's 15 repos. I overlooked this issue and conversation.

Re: saving PDFs as files before parsing
I think @appills may have already answered your question, but yes, you can just parse the in-memory byte sequence of the PDF rather than writing it to a file and then parsing it. This is generally more efficient.

Re: shell/OS issues
I am a Mac user but I have experienced my own headaches with a number of the city-scraper repos. To my mind I think it may make a lot of sense to dockerize all the projects. I hope this will make them OS-agonistic and improve the dev experience overall (especially for newcomers). I'd be very interested in any feedback on this subject though. Let me know if you have thoughts!

@godclause
Copy link

godclause commented Feb 1, 2024

Hi there, @godclause! Nice to meet you too! And thanks for the intro, @haileyhoyat.

@godclause My apologies, I took over the city-scrapers project very recently and I'm juggling a lot of fixes and upgrades right now across the project's 15 repos. I overlooked this issue and conversation.

Re: saving PDFs as files before parsing I think @appills may have already answered your question, but yes, you can just parse the in-memory byte sequence of the PDF rather than writing it to a file and then parsing it. This is generally more efficient.

Re: shell/OS issues I am a Mac user but I have experienced my own headaches with a number of the city-scraper repos. To my mind I think it may make a lot of sense to dockerize all the projects. I hope this will make them OS-agonistic and improve the dev experience overall (especially for newcomers). I'd be very interested in any feedback on this subject though. Let me know if you have thoughts!

@SimmonsRitchie Hi!

I have 'some' thoughts...

  1. For my parsing issue, @appills's answer did prompt my initial inquiry into a need for clarity about OS (mac) updates and how python depedency installation is affected.

  2. How does Docker compare / contrast in 'usability' versus Vagrant for platform (OS) agnosticism? What's the expected long-term benefit attributed to Docker for City Scrapers projects versus Vagrant, from a support perspective? I mean, I'm all for improved performance, a better experience for newcomers, etc., but what will implementing Docker instead of Vagrant and vice versa cost City Scrapers' repos?

I'm hoping I'm within scope on these concerns.

@appills
Copy link

appills commented Feb 1, 2024 via email

@godclause
Copy link

godclause commented Feb 2, 2024

Python module/package dependencies should work regardless of platform, are you having problems?

On Thu, Feb 1, 2024 at 5:22 PM shinda @.> wrote: Hi there, @godclause https://github.com/godclause! Nice to meet you too! And thanks for the intro, @haileyhoyat https://github.com/haileyhoyat. @godclause https://github.com/godclause My apologies, I took over the city-scrapers project very recently and I'm juggling a lot of fixes and upgrades right now across the project's 15 repos. I overlooked this issue and conversation. Re: saving PDFs as files before parsing I think @appills https://github.com/appills may have already answered your question, but yes, you can just parse the in-memory byte sequence of the PDF rather than writing it to a file and then parsing it. This is generally more efficient. Re: shell/OS issues I am a Mac user but I have experienced my own headaches with a number of the city-scraper repos. To my mind I think it may make a lot of sense to dockerize all the projects. I hope this will make them OS-agonistic and improve the dev experience overall (especially for newcomers). I'd be very interested in any feedback on this subject though. Let me know if you have thoughts! @SimmonsRitchie https://github.com/SimmonsRitchie Hi! I have 'some' thoughts... 1. For my parsing issue, @appills https://github.com/appills's answer did result in a initial inquiry into a need for clarity about OS (mac) updates and how those are affecting python depedencies. 2. How does Docker compare / contrast in 'usability' versus Vagrant for platform (OS) agnosticism? What's the expected long-term benefit attributed to Docker for City Scrapers projects versus Vagrant, from a support perspective? I mean, I'm all for improved performance, a better experience for newcomers, etc., but what will implementing Docker instead of Vagrant and vice versa cost City Scrapers' repos? I'm hoping I'm within scope on these concerns. — Reply to this email directly, view it on GitHub <#1001 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE5B6LGMT2UPGWJKGKG7LYTYRQIR3AVCNFSM4WSC2WYKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJSGIZTMOBWGU3Q . You are receiving this because you were mentioned.Message ID: @.>

@appills I have edited my comment above. Please excuse the error. Thank you in advance.

To your question, I do not believe there to be personal problems associated to module / package dependencies. Dependencies 'should' work regardless of platform (OS), shell environment. The case I did encounter will suggest otherwise for zsh.

Also, there is an evolving consensus that containerizing city-scrapers addresses that concern.

@godclause
Copy link

  1. Is the general consensus around scraping PDF documents associated to an idea that they absolutely must be downloaded, stored and locally scraped?

No need to store them locally. You're going to get a stream of bytes regardless, so you can just read the byte stream directly from the request. I don't know how package management is applied here (e.g. pypdf)

from io import BytesIO
from pypdf import PdfReader

resp = requests.get("https://hfsrb.illinois.gov/content/dam/soi/en/web/hfsrb/events/documents/2024/january-23-2024-state-board-meeting-/21-007(4)%20Permit%20Renewal%20Winchester%20ASTC.pdf")
print(resp.headers)
reader = PdfReader(BytesIO(resp.content))
print(reader.pages[0].extract_text())

@appills @onyangojerry

Hello:

It seems the expected behavior on this code snippet is to parse text from only a single file.

How does our spider parse pdf's for all future / additional meetings, considering start URL as 'https://www2.illinois.gov/sites/hfsrb/events/Pages/Board-Meetings.aspx'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants