Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix some typos in the version column of the efiler_master_concordance.csv file #35

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

frisson
Copy link

@frisson frisson commented Jul 27, 2020

Description

A cursory glance at the version column showed some anomalies in the return versions column.

Running the pandas code below on the file this pr branches off of shows some of the anomalies.

concordance = pd.read_csv(
    "./efiler_master_concordance.csv",
    dtype=dict(
        variable_name="category",
        description="object",
        scope="category",
        location_code="category",
        form="category",
        part="category",
        data_type="category",
        required="boolean",
        cardinality="float64",
        rdb_table="float64",
        xpath="object",
        version="category",
        production_rule="float64",
        last_version_modified="object",
    ),
)
# normalize versions by splitting the string on ';' and stripping each element
# before joining them again
concordance["version"] = concordance.version.apply(
    lambda x: ";".join([x.strip().lower() for x in str(x).split(";") if x.strip()])
)
concordance.head()
versions = sorted(
    {
        ver
        for sublist in list(
            [str(ver).split(";") for ver in list(concordance.version.unique())]
        )
        for ver in sublist
        if ver
    }
)
print(sorted(versions))

=

@lecy
Copy link
Member

lecy commented Jul 27, 2020

helpful, thank you! there are some major changes to the master concordance files coming shortly - mostly much better variable names, cleaner variable mapping, and extended documentation. if useful (since you are using the concordance) i can share the draft versions of these.

@frisson
Copy link
Author

frisson commented Jul 27, 2020

hey @lecy, glad to hear this is useful. it'd be great to get a look at those drafts. specially if the changes are coming soon.

@lecy
Copy link
Member

lecy commented Jul 27, 2020

Sharing one section here for a preview:

https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file/blob/master/emc-f990-part-01-v2.csv

And also the updated instructions that are being used to revise the concordance files:

https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file/blob/master/Instructions%20for%20Updating%20Concordance%20v3.1.pdf

In summary:

  1. Variable names are the biggest overhaul since version 1.0 were script generated and lacked human interpretability.
  2. Refinement of all of the xpath to variable mappings.
  3. Moving the SCOPE flag (whether variables occur on the 990 only, 990-EZ only, or PZ for both) from the variable name to a distinct column so it's easier to select as an attribute when selecting variables for a study.
  4. Adding table names to split form sections into a relational database separating one-to-one and one-to-many fields.
  5. The aggregated master concordance file with all xpaths is being split into separate CSV files for forms + parts to make them easier to validate and maintain (for example form-990-part-01 above: emc-f990-part-01-v2.csv).

We have all sections of the 990 (Part I to Part XII) complete, and a handful of schedules.

Working on finishing up schedules this summer, and getting started on foundation files (the 990-PF).

Then just need to update all of the documentation as part of the release.

If you are actively using the concordance I can share these with you directly before the official release. Just let me know.

@frisson
Copy link
Author

frisson commented Jul 28, 2020

Hey @lecy,

Thanks so much for sharing the preview. It's very very useful. I've successfully used the preview to parse and extract the fields for a number of IRS990 returns from the 2018 return index. Are there any previews handy for the IRS990EZ return types?

Example mapping/transform for reference: https://gist.github.com/frisson/f9eaf2f4ea60ee5de694114c0a26e3e3

@frisson
Copy link
Author

frisson commented Aug 10, 2020

hey @lecy, just pinging you here to let you know i'm back from account review purgatory.

@frisson
Copy link
Author

frisson commented Sep 16, 2020

Hey @lecy, i hope all is well. Just wanted to bump this thread.

@lecy
Copy link
Member

lecy commented Sep 16, 2020

Let me get back to you tonight - have a bunch of files to share (just wrapping up the 990, 990-ez, and schedules).

@lecy lecy closed this Sep 16, 2020
@lecy lecy reopened this Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants