-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproject_charter.qmd
104 lines (87 loc) · 5.62 KB
/
project_charter.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
title: "Project Charter"
author: "Keenan Smith"
date: "19 Jan 2023"
format:
pdf:
df-print: paged
tbl-colwidths: auto
geometry:
- top=20mm
- left=10mm
- bottom=20mm
- right=10mm
editor_options:
chunk_output_type: inline
---
# Project Charter
## Project Background
- Who is the client, what business domain the client is in.
- Clients
- NCSU MEM Faculty
- Myself
- NCSU MEM Cohort
- The General Public
- Domain
- Natural Language Processing
- Politics
- History
- Bias
- What business problems are we trying to address?
- The project is addressing the current political climate in the United States of America. We currently live in an information rich and ideological polarized state specifically in the U.S. This project is to determined whether intellectual political speech can be accurately modeled to determine political bias.
## Scope
- I will be building a data science solution that encompasses a large corpus of political speech web scraped from top think tanks with clear ideological leanings. This project will encapsulate the entire data science spectrum from data collection, data cleaning, data exploration, data modeling, and model deployment.
- Data has already been collected from 4 sources and preliminary data modeling has occurred. The lessons learned from that initial analysis will be pulled through into this project.
- Additional data will be acquired through similar means as the first 4 data sources.
- The end product will most likely be a series of deployed models using the `vetiver` library
- If there is additional time, I will attempt to create a `shiny` application that allows ease of access to the models and the data for stakeholders to explore but this may go out of scope as time permits.
## Personnel
- Who are on this project:
- Student:
- Keenan Smith
- Advisor:
- Dr. Brandon McConnell
- Faculty Committee
## Metrics
- Qualitative objectives
- At least 4 deploy-able models using the Vetiver library that have been trained and optimized using grid search to find the highest accuracy model
- The modeling and code must be left in a state where it is able to be redeployed with an expanded corpus
- The hooks must be left so that there is potential for a generative model in the future
- What is a quantifiable metric (e.g. reduce the fraction of users with 4-week inactivity)
- At least one model with a classification accuracy of 70% or more (This is subject to change as modeling is performed)
- What is the baseline (current) value of the metric?
- The current baseline for this project is the null hypothesis that political speech does not indicate bias
- How will we measure the metric? (e.g. A/B test on a specified subset for a specified period; or comparison of performance after implementation to baseline)
- The metric will be measured by observing an accuracy score above 70%
## Plan
- This is a loose definition of the current plan
- Business Understanding
- Defining Objectives which is encapsulated in this project charter
- Identifying Data Sources
- Currently there are 4 but this will be expanded
- The criteria is ease of the ability to scrape a large corpus of articles from the think tank, the explicit nature of the political bias present in popular culture, the ability to get the corpus into similar data as the currently acquired corpus
- Data Acquisiton and Understanding
- Ingest the Data into R and the model workflow
- Writing Data cleaning functions to clean the data prior to vectorization
- Explore the data using a variety of popular NLP vectorization techniques
- Modeling
- Feature Engineering and Model Selection
- Modeling the data using lessons learned from Exploration of the data
- Evaluating the Model
- As a note, this project will utilize the `tidymodels` framework as much as possible as it uses a unified syntax across a variety of popular machine learning algorithms to output easily understood metrics and visualizations
- Deployment
- Deployed the selected models using the `vetiver` library.
- Optionally deploy the model using the `plumber` library to create an API for others to use the models
- The key areas that I foresee that will take the most time is data acquisition and understanding. Using prior experience in this area and experience from others, data acquisition and cleaning the data take up to 70% of the work of a data analysis project. A project plan will be developed using project planning software to help plan this project in more detail for the stakeholders.
## Architecture
- Data
- The data will be provided via webscraping using the `rvest` package in R
- The list of think tanks is currently being finalized at this current edition of the project charter
- What tools and data storage/analytics resources will be used in the solution e.g.,
- Currently the data will be stored as CSV or .rda compressed data files
- SQLite will be explored as a way to store the data for deployment
- How will the models be operationalized?
- The models will be deployed using the `vetiver` library and possibly expanded to an API using the `plumber` library
## Communication
- How will we keep in touch? Weekly meetings?
- There will be weekly update emails at a minimum to provide an update of current status. In-person or zoom meetings will be as required but at least occur on a bi-weekly basis.