-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update fars table #8
base: main
Are you sure you want to change the base?
Conversation
import os | ||
|
||
# place to save data locally | ||
data_folder = "/mnt/c/Users/tputta/OneDrive - Toole Design/Desktop/SSPF/FARS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn't matter, but since this is a public repo, do you think that you should put your local filepath in here? Maybe instead make a data
folder either in the top level repo directory or just in this folder, and save it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is a one-time thing I didn't bother creating a data
folder and adding it to gitignore.
if not os.path.exists(f"{data_folder}/{yr}"): | ||
os.mkdir(f"{data_folder}/{yr}") | ||
request.urlretrieve(f"https://static.nhtsa.gov/nhtsa/downloads/FARS/{yr}/National/FARS{yr}NationalCSV.zip", f"{data_folder}/{yr}/FARS{yr}NationalCSV.zip") | ||
yr += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we incrementing yr
here? Won't it just go to the next value once it goes back to the top of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with while loop and this is a vestige of that that I forgot to remove. It does not really harm anything in this case, but I will remove it
from io import StringIO | ||
|
||
# get environment variables | ||
load_dotenv("rds_conn_vars.env") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there documentation somewhere that says how to create this? If not, should we add it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .env file itself is not very different from how we define variables in a shell script. There are plenty of resources online for the proper syntax of it and it is pretty straightforward. I don't think we need to add any additional documentation details for this.
As new years of FARS data comes out, what would be the workflow for updating these scripts? Can year range variables be pulled out to the top of files or separated into adjacent JSON or .py files as global variables to simplify updates? |
Assuming that the URL format stays the same, maybe just test to see what the highest year's URL that is valid?
|
Yes, we could set it up to be more dynamic, but since this is going to be no more than a once a year update, I just kept it simple. All we need is to just change the |
SELECT | ||
st_case, | ||
year, | ||
array_agg(per_typ::INT) AS per_typ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be overkill, but it could be worth replicating the logic shown in Table 3-39 on pg 560 of https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813417, since the per_type
code changes little year to year.
Also, according to that table, we might want to assign 'motor vehicle' to the 'Driver' (1
) and 'Passenger' types (2
and 9
) and remove 3
as its listed as 'Other non-occupant'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, ran the numbers, the removing 3
and adding 9
to assign 'motor vehicle'
removes all instances of 'other'
and assigns them to 'motor vehicle'
. For 2015-2019 this was 167 crashes (0.10% of all crashes), and for 2017-2021 it was 133 crashes (0.08% of all crashes). So pretty minor, but still might be worth changing.
) | ||
; | ||
|
||
ALTER TABLE automated.fars_processed_{start_yr}_{end_yr} ADD pkey SERIAL PRIMARY KEY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the unique ID is comprised on st_case
which is the 2 letter state code + sequential case number (which resets each year) and crash_year
, should the primary key be (st_case, crash_year)
?
cur = conn.cursor() | ||
|
||
query = f""" | ||
CREATE TABLE static.fars_processed_2015_2019_backup (LIKE static.fars_processed INCLUDING ALL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use INCLUDING ALL
when making the table, will that make static.fars_processed_2015_2019_backup
a dependent and create an issue if static.fars_processed
is deleted when the new dataset is uploaded in the next script?
# Export env vars | ||
export $(grep -v '^#' rds_conn_vars.env | xargs) | ||
|
||
ogr2ogr \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this script copything things from our server to the SSPF server just meant for testing, or is it meant to be how its going to be done going forward?
@theja Okay, I finished reviewing my review of things. In gerenal it's good, but I had a few specific questions that I left comments about. |
@tariqshihadah @Jacob816
Here is the code for updating the FARS data in the SSPF database. I have not replaced the final table yet, but the updated FARS table is saved to
scratch._tmp_fars_processed_2017_2021
. Once you have a chance to look at this, I can copy it over to swap out thestatic.fars_processed
table