Skip to content

Latest commit

 

History

History
210 lines (162 loc) · 9.6 KB

readme.md

File metadata and controls

210 lines (162 loc) · 9.6 KB

College Enrollment

The data this week comes from Data.World and Data.World and was originally from the NCES.

High school completion and bachelor's degree attainment among persons age 25 and over by race/ethnicity & sex 1910-2016

Fall enrollment in degree-granting historically Black colleges and universities (HBCU)

Consider donating to HBCUs, to help fund student's financial assistance programs.

Donation link: https://thehbcufoundation.org/donate/

There's other additional HBCU datasets at Data.World as well.

HBCU Donations Article

... Donation will be placed in an endowment for students to fund need-based scholarships. President Reynold Verret believes the donation will provide an opportunity for students who don't have the same financial support as others.

"Xavier has roughly more than half of our students who are Pell-eligible. Which means they are in the lowest fifth of the socioeconomic ladder in the country. The lowest quintile. So these students really have significant family needs," said Verret. "They're often the first generation in their families to attend college, and meeting the gap between what Pell and the small loans provide and making it affordable is where that need-based is, which is not just based on merit, on your highest ACT or GPA, but basically to qualify students who are able who have the talent and the ability to succeed at Xavier."

I've left the datasets relatively "untidy" this week so you can practice some of the pivot_longer() functions from tidyr. Note that all of the individual CSVs that are duplicates of the raw Excel files.

Get the data here

# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!

tuesdata <- tidytuesdayR::tt_load('2021-02-02')
tuesdata <- tidytuesdayR::tt_load(2021, week = 6)

hbcu_all <- tuesdata$hbcu_all

# Or read in the data manually

hbcu_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2021/2021-02-02/hbcu_all.csv')

Data Dictionary

hbcu.csv

hs_students.csv

  • The percentage of students broken down by race/ethnicity, aged 25 and over who have graduated HS.

bach_students, female_bach_students, female_hs_students, male_bach_students, male_hs_students:

  • Same as above, but for specific gender and education combination.
variable class description
Total double Year
Total, percent of all persons age 25 and over double Total combined population,
Standard Errors - Total, percent of all persons age 25 and over character Standard errors (SE)
White1 character White students
Standard Errors - White1 character SE
Black1 character Black students
Standard Errors - Black1 character SE
Hispanic character Hispanic students
Standard Errors - Hispanic character SE
Total - Asian/Pacific Islander character Asian Pacific Islander Total students
Standard Errors - Total - Asian/Pacific Islander character SE
Asian/Pacific Islander - Asian character Asian Pacific Islandar - Asian students
Standard Errors - Asian/Pacific Islander - Asian character SE
Asian/Pacific Islander - Pacific Islander character Asian/Pacific Islander - Pacific Islander
Standard Errors - Asian/Pacific Islander - Pacific Islander character SE
American Indian/ Alaska Native character American Indian/ Alaska Native Students
Standard Errors - American Indian/Alaska Native character SE
Two or more race character Two or more races students
Standard Errors - Two or more race character SE

hbcu_all.csv

  • Enrollment by year for types of HBCUs
  • Note that hbcu_black.csv has duplicate information, but specific to black-student enrollment only.
variable class description
Year double Year
Total enrollment double Total enrollment
Males double Male enrollment
Females double Female Enrollment
4-year double 4 Year college enrollment
2-year double 2 Year college enrollment
Total - Public double Total public school enrollment
4-year - Public double 4 year public school enrollment
2-year - Public double 2 Year public college enrollment
Total - Private double Total private college enrollment
4-year - Private double 4 year private school enrollment
2-year - Private double 2 Year private college enrollment

Cleaning Script

This is an optional cleaning script, but shows examples of how to take the raw data this week and prep it for analysis.

See more expansive walkthroughs of cleaning this data by Jack Davison and Alex Cookson.

library(tidyverse)
library(readxl)
library(glue)

# student data

hs_students <- read_excel("2021/2021-02-02/104.10.xlsx", sheet = 1)
bach_students <- read_excel("2021/2021-02-02/104.10.xlsx", sheet = 2)
male_hs_students <- read_excel("2021/2021-02-02/104.10.xlsx", sheet = 3)
male_bach_students <- read_excel("2021/2021-02-02/104.10.xlsx", sheet = 4)
female_hs_students <- read_excel("2021/2021-02-02/104.10.xlsx", sheet = 5)
female_bach_students <- read_excel("2021/2021-02-02/104.10.xlsx", sheet = 6)

# HBCU data
hbcu_all <- read_excel("2021/2021-02-02/tabn313.20.xls", sheet = 1)
hbcu_black <- read_excel("2021/2021-02-02/tabn313.20.xls", sheet = 2)


list(
  datasets = list(
    bach_students,
    female_bach_students,
    female_hs_students,
    hbcu_all,
    hbcu_black,
    hs_students,
    male_bach_students,
    male_hs_students
  ),
  names = ls()
) %>%
  pmap(
    .f = function(datasets, names) {
      write_csv(datasets, glue::glue("2021/2021-02-02/{names}.csv"))
    }
  )

# Example 1
hs_students %>% 
  mutate(Total = if_else(Total > 10000, str_sub(Total, 1, 4) %>% as.double(), Total)) %>% 
  rename(year = Total) %>% 
  select(!contains("Standard")) %>% 
  select(!contains("Total")) %>% 
  mutate(across(White1:last_col(), as.double)) %>% 
  pivot_longer(cols = 2:last_col(), names_to = "group", values_to = "percentage") %>% 
  filter(year >= 1980) %>% 
  ggplot(aes(x = year, y = percentage, color = group)) +
  geom_line()

# example 2
hbcu_all %>% 
  select(Year, `4-year`, `2-year`) %>% 
  pivot_longer(cols = `4-year`:`2-year`) %>% 
  ggplot(aes(x = Year, y = value, color = name)) +
  geom_line() +
  scale_x_continuous(breaks = seq(1980, 2020, by = 4))
  
# Alex Cookson Examples

### Load packages -------------------------------------------------------------
library(tidyverse) # General-purpose cleaning
library(janitor) # For the clean_names() function

### Import data ---------------------------------------------------------------
hbcu_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2021/2021-02-02/hbcu_all.csv') %>%
  # clean_names() converts field names to snake_case
  clean_names()


### Clean data ----------------------------------------------------------------
# We can separate by gender OR by program length and public/private, not both

### Gender breakdown
hbcu_by_gender <- hbcu_all %>%
  # We only need year and gender columns
  select(year, males, females) %>%
  # Convert to tidy format, collapsing male/female into one descriptor field
  pivot_longer(males:females,
               names_to = "gender",
               values_to = "students") %>%
  # Convert from plural to singular for cleaner data
  # "s%" specifies an s character at the end of a string
  # ("$" is end of string in regular expressions)
  mutate(gender = str_remove(gender, "s$"))


### Program breakdown
hbcu_by_program <- hbcu_all %>%
  # We need fields with "public" or "private" in the name
  # (They also have 2- vs 4-year)
  # We DON'T need fields with "total" in the name, since this is redundant
  select(year,
         contains(c("public", "private")),
         -contains("total")) %>%
  # names_pattern argument does the heavy lifting
  # It separates names into groups, as specified by parentheses "(group)"
  # Field names are structured so that program length is followed by public/private
  # We also specift "x_" as an optional argument using regular expressions
  pivot_longer(cols = x4_year_public:x2_year_private,
               names_pattern = "[x_]?(.*)_(.*)",
               names_to = c("program_length", "public_private"),
               values_to = "students") %>%
  mutate(program_length = paste(parse_number(program_length), "years"))