Releases: BloomTech-Labs/dep-story-squad-ds-b
Releases · BloomTech-Labs/dep-story-squad-ds-b
v2.0
Final Release for 8-Sprint Product Timespan (for Labs26)
New features since Version 1.0:
- Squad Score 1.1 (#36), see v1.1 documentation
- Clustering / Gamification (#41)
- Generates clusters of 4 by Squad Score within each cohort for a given week, to be used to create the game play of Story Squad
- Visualizations (#39)
- Generates two different visualizations for the parent dashboard Progress screen for each child: one histogram comparing the child's submission to the rest of their grade's for that week, and one showing the child's progress in Squad Score over time.
New API Endpoints: (#44)
- Viz/linegraph
- Corresponds to Visualizations feature
- Viz/histogram
- Corresponds to Visualizations feature
- Cluster
- Corresponds to Clustering feature
v1.1
New Squad Score Version (#36)
New Squad Score version that still complies with restrictions of utilizing only features that are either representative of validated complexity models or requested by stakeholder and in a form that is least susceptible to errors we will encounter in stories and transcriptions (i.e. using characters for length rather than syllables or words).
- New changes
- Added feature of
adj_num
- Formula now requires an external library (
nltk
)
- Added feature of
- Reasoning
- Per stakeholder, one qualitative indication of skilled creative writing is descriptive language, specifically sensory descriptions. While this specific feature implementation is limited, in that the mere presence of an adjective does not inherently indicate sensory descriptions, this method was selected as a way to access the spirit of the creative writing feature while still respecting the limitations of this use case. "In most cases, NLTK correctly tags words that have typos" (p. 52) so it is a good candidate for use in our environment which likely will contain misspellings and mis-transcriptions.
- Dependencies
nltk
librarynltk
downloads:punkt
andaveraged_perceptron_tagger
, both included in DockerfileRUN
commands
- Features
- sl: story length (in characters)
- awl: average word length (in characters)
- qn: quotes number
- uw: unique words count (over two characters)
- an: number of adjectives
- Weights (no change from v1.0)
- Squad Score is initiated with only weights of 1 for each feature, as there were not enough labels on the data to be able to tune weights in a generalizable way.
- There is also a standardized “range scaler” of 30, meant to bring the overall Squad Score up to a closer range of 0-100, purely for ease of metric reading.
- Formula
- sl(1)(30) + awl(1)(30) + qn(1)(30) + uw(1)(30) + an(1)(30)
- Range (no change from v1.0)
- the score bottoms out at 0, but does not have a bounded upper range
- Metrics
- Similar to v1.0, the only labels available at the time of this development were a 1-25 ranking of 25 of the training set stories. Applying this Squad Score formula to these 25 stories resulted in a -.63 correlation coefficient of scores to rankings, which is an improvement of .03 over v1.0. The generalizability of this improvement is unconfirmed.
v1.0
Data Science Team’s MVP for Story Squad Release Canvas 2
Functional features by category:
Transcription and Moderation
- Transcription
- Google Cloud Vision OCR (#10)
- Connects to Google Cloud Vision API and uses their Optical Character Recognition model to transcribe the handwritten stories uploaded by users.
- Low confidence flag (#20)
- During transcription, returns Google Cloud Vision’s confidence in each transcribed submission. Raises a flag if the transcription confidence is below 85% signifying poor image or handwriting quality and consequently possibly inaccurate evaluation metrics.
- Google Cloud Vision OCR (#10)
- Text Moderation
- Bad/Inappropriate words filter (#31)
- Added method into Google API service that checks the word tokens against a list of words that are known to be inappropriate.
- Bad/Inappropriate words filter (#31)
- Image Moderation
- Safe Search (#10)
- Connects to Google Cloud Vision API and utilizes their built-in Safe Search service to flag if a user’s uploaded illustration has racy, adult, or violent content.
- Safe Search (#10)
Complexity Analysis
- Complexity Metric - “Squad Score” (#18)
- Cleans transcribed text and returns a custom complexity score
- This baseline implementation includes four features generated only with Python and Pandas. It is intended to be iterated upon.
- Given the limited amount of labels to train a model/formula toward, this formula only utilizes features that are representative of features seen in validated complexity models or requested by stakeholder and in form that is least susceptible to errors in child writing/handwriting and transcription. (i.e. using characters for length metric rather than syllables or words)
- Features:
- sl: story length (in characters)
- awl: average word length (in characters)
- qn: quotes number
- uw: unique words count (over two characters)
- Weights:
- Squad Score is initiated with only weights of 1 for each feature, as there were not enough labels on the data to be able to tune weights in a generalizable way.
- There is also a standardized “range scaler” of 30, meant to bring the overall Squad Score up to a closer range of 0-100, purely for ease of metric reading.
- Formula: sl(1)(30) + awl(1)(30) + qn(1)(30) + uw(1)(30)
- Range: the score bottoms out at 0, but does not have a bounded upper range
- Metrics:
- The only labels available at the time of this development were a 1-25 ranking of 25 of the training set stories. Applying this Squad Score formula to these 25 stories resulted in a -.60 correlation coefficient of scores to rankings.
Deployment
- API Endpoints
- Submission/text (#19)
- REST API endpoint that transcribes and computes squad score of submission then returns that information to the web backend.
- Submission/illustration (#19)
- REST API endpoint that submits the illustration to the Google Vision API: Safe Search service to flag inappropriate content in user submitted content.
- Submission/text (#19)
- GitHub Actions
- Header Security Token Checking
- AuthRouteHandler (#27)
- Feature that checks request’s headers against a known security token to allow access to API endpoints.
- AuthRouteHandler (#27)