Spider: Illinois Health Facilities and Services Review Board #1001

pjsier · 2021-01-25T17:26:33Z

URL: https://www2.illinois.gov/sites/hfsrb/events/Pages/Board-Meetings.aspx
Spider Name: il_health_facilities
Agency Name: Illinois Health Facilities and Services Review Board

masoodqq · 2021-02-06T14:59:21Z

I would like to work on this issu.

pjsier · 2021-02-08T15:47:46Z

@masoodqq sounds great! Assigning you now

Ni3dzwi3dz · 2021-03-29T13:19:34Z

Hi,
is this issue still opened? If yes, i`m willing to help.

palakshivlani-11 · 2023-04-16T14:08:59Z

Hi,
is this issue still open

haileyhoyat · 2023-04-16T17:55:48Z

@palakshivlani-11 Hi. Thanks so much for checking out the project. Go for it.

mohdyawars · 2023-07-11T05:07:00Z

willing to help

haileyhoyat · 2023-07-13T23:00:31Z

@yawar1101 hi. thanks so much for checking us out. go for it.

godclause · 2023-12-10T01:40:33Z

Hi @haileyhoyat!

I'd like to try my hand on this one.

haileyhoyat · 2023-12-14T00:42:21Z

@godclause Hello! Go for it. Cheers.

godclause · 2023-12-30T09:38:03Z

@haileyhoyat @palakshivlani-11 @yawar1101 @pjsier

Hi!

I hope I'm not overthinking on this question(s):

The challenging issue with URL: https://www2.illinois.gov/sites/hfsrb/events/Pages/Board-Meetings.aspx is that "juicy" meeting details are only available as downloadable PDFs via hyperlinks on ASP.NET web pages.

I have 'discovered' only a few Python libraries useful for scraping PDFs, but none seem to work for remote scraping,if that makes sense.

Is the general consensus around scraping PDF documents associated to an idea that they absolutely must be downloaded, stored and locally scraped?
Is there not an available Python solution, i.e. library, etc. to scrape PDFs as they live on the web, without storing them locally?
Is there a standard procedure / code of conduct, etc. that I should follow for applying security and dependency updates for this repo?

appills · 2024-01-09T17:24:41Z

Is the general consensus around scraping PDF documents associated to an idea that they absolutely must be downloaded, stored and locally scraped?

No need to store them locally. You're going to get a stream of bytes regardless, so you can just read the byte stream directly from the request. I don't know how package management is applied here (e.g. pypdf)

from io import BytesIO
from pypdf import PdfReader

resp = requests.get("https://hfsrb.illinois.gov/content/dam/soi/en/web/hfsrb/events/documents/2024/january-23-2024-state-board-meeting-/21-007(4)%20Permit%20Renewal%20Winchester%20ASTC.pdf")
print(resp.headers)
reader = PdfReader(BytesIO(resp.content))
print(reader.pages[0].extract_text())

godclause · 2024-01-17T12:27:33Z

Is the general consensus around scraping PDF documents associated to an idea that they absolutely must be downloaded, stored and locally scraped?

No need to store them locally. You're going to get a stream of bytes regardless, so you can just read the byte stream directly from the request. I don't know how package management is applied here (e.g. pypdf)
from io import BytesIO
from pypdf import PdfReader

resp = requests.get("https://hfsrb.illinois.gov/content/dam/soi/en/web/hfsrb/events/documents/2024/january-23-2024-state-board-meeting-/21-007(4)%20Permit%20Renewal%20Winchester%20ASTC.pdf")
print(resp.headers)
reader = PdfReader(BytesIO(resp.content))
print(reader.pages[0].extract_text())

@appills @haileyhoyat

I have sustained a perspective that running python operations for city-scrapers is not wholly shell agnostic (e.g. zsh, bash, csh). With this in mind, I believe package management has proven to be problematic and ought to be considered when starting a project.

The problem is that zsh is now the default shell on macs. The commands provided in our docs support bash.

For the folks running zsh, should there be verbiage included in our docs to remind us to consider changing our shell if we're running Catalina and beyond?

If not, should there be language updated in the docs that explains the following scenarios:

For Macs

Switch to bash shell and run all commands from there.
Build a Linux Virtual Machine [Ubuntu, Mint, etc.] and run your terminal / CLI commands via bash.

For Windows based machines

Build a Linux Virtual Machine [Ubuntu, Mint, etc.] and run terminal / CLI commands via bash.

Is any of this at all necessary?

onyangojerry · 2024-01-31T05:23:35Z

@godclause

yes, there should be verbiage guiding us and or reminding us to either switch to bash or use a virtual machine just so to save us the the agony that comes with continuous frustrations. I believe it is necessary.

haileyhoyat · 2024-01-31T20:10:39Z

@godclause @appills @onyangojerry

Hi All. I want to introduce you to Dan (@SimmonsRitchie ). Dan has officially taken over the role as project lead for the entire City Scrapers project.

Dan, idk if this conversation is relevant for you, particularly as you fix a lot of infrastructure things.

Cheers, all.

godclause · 2024-01-31T20:18:18Z

@godclause @appills @onyangojerry

Hi All. I want to introduce you to Dan (@SimmonsRitchie ). Dan has officially taken over the role as project lead for the entire City Scrapers project.

Dan, idk if this conversation is relevant for you, particularly as you fix a lot of infrastructure things.

Cheers, all.

Thank you for the introduction @hails. Hi Dan, nice to meet you here.

SimmonsRitchie · 2024-02-01T21:10:27Z

Hi there, @godclause! Nice to meet you too! And thanks for the intro, @haileyhoyat.

@godclause My apologies, I took over the city-scrapers project very recently and I'm juggling a lot of fixes and upgrades right now across the project's 15 repos. I overlooked this issue and conversation.

Re: saving PDFs as files before parsing
I think @appills may have already answered your question, but yes, you can just parse the in-memory byte sequence of the PDF rather than writing it to a file and then parsing it. This is generally more efficient.

Re: shell/OS issues
I am a Mac user but I have experienced my own headaches with a number of the city-scraper repos. To my mind I think it may make a lot of sense to dockerize all the projects. I hope this will make them OS-agonistic and improve the dev experience overall (especially for newcomers). I'd be very interested in any feedback on this subject though. Let me know if you have thoughts!

godclause · 2024-02-01T22:22:09Z

Hi there, @godclause! Nice to meet you too! And thanks for the intro, @haileyhoyat.

@godclause My apologies, I took over the city-scrapers project very recently and I'm juggling a lot of fixes and upgrades right now across the project's 15 repos. I overlooked this issue and conversation.

Re: saving PDFs as files before parsing I think @appills may have already answered your question, but yes, you can just parse the in-memory byte sequence of the PDF rather than writing it to a file and then parsing it. This is generally more efficient.

Re: shell/OS issues I am a Mac user but I have experienced my own headaches with a number of the city-scraper repos. To my mind I think it may make a lot of sense to dockerize all the projects. I hope this will make them OS-agonistic and improve the dev experience overall (especially for newcomers). I'd be very interested in any feedback on this subject though. Let me know if you have thoughts!

@SimmonsRitchie Hi!

I have 'some' thoughts...

For my parsing issue, @appills's answer did prompt my initial inquiry into a need for clarity about OS (mac) updates and how python depedency installation is affected.
How does Docker compare / contrast in 'usability' versus Vagrant for platform (OS) agnosticism? What's the expected long-term benefit attributed to Docker for City Scrapers projects versus Vagrant, from a support perspective? I mean, I'm all for improved performance, a better experience for newcomers, etc., but what will implementing Docker instead of Vagrant and vice versa cost City Scrapers' repos?

I'm hoping I'm within scope on these concerns.

appills · 2024-02-01T22:27:58Z

Python module/package dependencies should work regardless of platform, are you having problems?

…

On Thu, Feb 1, 2024 at 5:22 PM shinda ***@***.***> wrote: Hi there, @godclause <https://github.com/godclause>! Nice to meet you too! And thanks for the intro, @haileyhoyat <https://github.com/haileyhoyat>. @godclause <https://github.com/godclause> My apologies, I took over the city-scrapers project very recently and I'm juggling a lot of fixes and upgrades right now across the project's 15 repos. I overlooked this issue and conversation. *Re: saving PDFs as files before parsing* I think @appills <https://github.com/appills> may have already answered your question, but yes, you can just parse the in-memory byte sequence of the PDF rather than writing it to a file and then parsing it. This is generally more efficient. *Re: shell/OS issues* I am a Mac user but I have experienced my own headaches with a number of the city-scraper repos. To my mind I think it may make a lot of sense to dockerize all the projects. I hope this will make them OS-agonistic and improve the dev experience overall (especially for newcomers). I'd be very interested in any feedback on this subject though. Let me know if you have thoughts! @SimmonsRitchie <https://github.com/SimmonsRitchie> Hi! I have 'some' thoughts... 1. For my parsing issue, @appills <https://github.com/appills>'s answer did result in a initial inquiry into a need for clarity about OS (mac) updates and how those are affecting python depedencies. 2. How does Docker compare / contrast in 'usability' versus Vagrant for platform (OS) agnosticism? What's the expected long-term benefit attributed to Docker for City Scrapers projects versus Vagrant, from a support perspective? I mean, I'm all for improved performance, a better experience for newcomers, etc., but what will implementing Docker instead of Vagrant and vice versa cost City Scrapers' repos? I'm hoping I'm within scope on these concerns. — Reply to this email directly, view it on GitHub <#1001 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE5B6LGMT2UPGWJKGKG7LYTYRQIR3AVCNFSM4WSC2WYKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJSGIZTMOBWGU3Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

godclause · 2024-02-02T10:34:54Z

Python module/package dependencies should work regardless of platform, are you having problems?
…
On Thu, Feb 1, 2024 at 5:22 PM shinda @.> wrote: Hi there, @godclause https://github.com/godclause! Nice to meet you too! And thanks for the intro, @haileyhoyat https://github.com/haileyhoyat. @godclause https://github.com/godclause My apologies, I took over the city-scrapers project very recently and I'm juggling a lot of fixes and upgrades right now across the project's 15 repos. I overlooked this issue and conversation. Re: saving PDFs as files before parsing I think @appills https://github.com/appills may have already answered your question, but yes, you can just parse the in-memory byte sequence of the PDF rather than writing it to a file and then parsing it. This is generally more efficient. Re: shell/OS issues I am a Mac user but I have experienced my own headaches with a number of the city-scraper repos. To my mind I think it may make a lot of sense to dockerize all the projects. I hope this will make them OS-agonistic and improve the dev experience overall (especially for newcomers). I'd be very interested in any feedback on this subject though. Let me know if you have thoughts! @SimmonsRitchie https://github.com/SimmonsRitchie Hi! I have 'some' thoughts... 1. For my parsing issue, @appills https://github.com/appills's answer did result in a initial inquiry into a need for clarity about OS (mac) updates and how those are affecting python depedencies. 2. How does Docker compare / contrast in 'usability' versus Vagrant for platform (OS) agnosticism? What's the expected long-term benefit attributed to Docker for City Scrapers projects versus Vagrant, from a support perspective? I mean, I'm all for improved performance, a better experience for newcomers, etc., but what will implementing Docker instead of Vagrant and vice versa cost City Scrapers' repos? I'm hoping I'm within scope on these concerns. — Reply to this email directly, view it on GitHub <#1001 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE5B6LGMT2UPGWJKGKG7LYTYRQIR3AVCNFSM4WSC2WYKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJSGIZTMOBWGU3Q . You are receiving this because you were mentioned.Message ID: @.>

@appills I have edited my comment above. Please excuse the error. Thank you in advance.

To your question, I do not believe there to be personal problems associated to module / package dependencies. Dependencies 'should' work regardless of platform (OS), shell environment. The case I did encounter will suggest otherwise for zsh.

Also, there is an evolving consensus that containerizing city-scrapers addresses that concern.

godclause · 2024-04-04T10:38:26Z

Is the general consensus around scraping PDF documents associated to an idea that they absolutely must be downloaded, stored and locally scraped?

No need to store them locally. You're going to get a stream of bytes regardless, so you can just read the byte stream directly from the request. I don't know how package management is applied here (e.g. pypdf)
from io import BytesIO
from pypdf import PdfReader

resp = requests.get("https://hfsrb.illinois.gov/content/dam/soi/en/web/hfsrb/events/documents/2024/january-23-2024-state-board-meeting-/21-007(4)%20Permit%20Renewal%20Winchester%20ASTC.pdf")
print(resp.headers)
reader = PdfReader(BytesIO(resp.content))
print(reader.pages[0].extract_text())

@appills @onyangojerry

Hello:

It seems the expected behavior on this code snippet is to parse text from only a single file.

How does our spider parse pdf's for all future / additional meetings, considering start URL as 'https://www2.illinois.gov/sites/hfsrb/events/Pages/Board-Meetings.aspx'?

pjsier added good first issue help wanted labels Jan 25, 2021

pjsier removed the help wanted label Feb 8, 2021

pjsier assigned masoodqq Feb 8, 2021

masoodqq removed their assignment Feb 23, 2021

haileyhoyat assigned palakshivlani-11 Apr 16, 2023

haileyhoyat assigned mohdyawars and unassigned palakshivlani-11 Jul 13, 2023

haileyhoyat assigned godclause and unassigned mohdyawars Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spider: Illinois Health Facilities and Services Review Board #1001

Spider: Illinois Health Facilities and Services Review Board #1001

pjsier commented Jan 25, 2021

masoodqq commented Feb 6, 2021

pjsier commented Feb 8, 2021

Ni3dzwi3dz commented Mar 29, 2021

palakshivlani-11 commented Apr 16, 2023

haileyhoyat commented Apr 16, 2023

mohdyawars commented Jul 11, 2023

haileyhoyat commented Jul 13, 2023

godclause commented Dec 10, 2023

haileyhoyat commented Dec 14, 2023

godclause commented Dec 30, 2023 •

edited

Loading

appills commented Jan 9, 2024 •

edited

Loading

godclause commented Jan 17, 2024 •

edited

Loading

onyangojerry commented Jan 31, 2024

haileyhoyat commented Jan 31, 2024 •

edited

Loading

godclause commented Jan 31, 2024 •

edited

Loading

SimmonsRitchie commented Feb 1, 2024

godclause commented Feb 1, 2024 •

edited

Loading

appills commented Feb 1, 2024 via email

godclause commented Feb 2, 2024 •

edited

Loading

godclause commented Apr 4, 2024

Spider: ​​​​​Illinois Health Facilities and Services Review Board #1001

Spider: ​​​​​Illinois Health Facilities and Services Review Board #1001

Comments

pjsier commented Jan 25, 2021

masoodqq commented Feb 6, 2021

pjsier commented Feb 8, 2021

Ni3dzwi3dz commented Mar 29, 2021

palakshivlani-11 commented Apr 16, 2023

haileyhoyat commented Apr 16, 2023

mohdyawars commented Jul 11, 2023

haileyhoyat commented Jul 13, 2023

godclause commented Dec 10, 2023

haileyhoyat commented Dec 14, 2023

godclause commented Dec 30, 2023 • edited Loading

appills commented Jan 9, 2024 • edited Loading

godclause commented Jan 17, 2024 • edited Loading

onyangojerry commented Jan 31, 2024

haileyhoyat commented Jan 31, 2024 • edited Loading

godclause commented Jan 31, 2024 • edited Loading

SimmonsRitchie commented Feb 1, 2024

godclause commented Feb 1, 2024 • edited Loading

appills commented Feb 1, 2024 via email

godclause commented Feb 2, 2024 • edited Loading

godclause commented Apr 4, 2024

Spider: Illinois Health Facilities and Services Review Board #1001

Spider: Illinois Health Facilities and Services Review Board #1001

godclause commented Dec 30, 2023 •

edited

Loading

appills commented Jan 9, 2024 •

edited

Loading

godclause commented Jan 17, 2024 •

edited

Loading

haileyhoyat commented Jan 31, 2024 •

edited

Loading

godclause commented Jan 31, 2024 •

edited

Loading

godclause commented Feb 1, 2024 •

edited

Loading

godclause commented Feb 2, 2024 •

edited

Loading