- For this project, you will be scraping data from GitHub repository README files. The goal will be to build a model that can predict what programming language a repository is, given the text of the README file.
- You will have to build a dataset yourself. Decide on a list of GitHub repositories to scrape, and write the python code necessary to extract the text of the README file for each page, and the primary language of the repository.
- Which repositories you use are up to you, but you should include at least 100 repositories in your data set.
- As an example of which repositories to use, here is a link to GitHub's trending repositories, the most forked repositores, and the most starred repositories.
- Explore the data that you have scraped
- Transform your documents into a form that can be used in a machine learning model. You should use the programming language of the repository as the label to predict.
- Try fitting several different models and using several different representations of the text (e.g. a simple bag of words, then also the TF-IDF values for each).
- Build a function that will take in the text of a README file, and tries to predict the programming language.
- Build a dataset of Github repositories' readme text
- Explore the text of the readme's and find connections to programming language
- Build a classification ML model that predicts the programming language used in a repo based on readme content.
"A readme file is a text file that is often included with software that contains general information or instructions about the software. The specific nature of this information varies significantly from file to file...There is no general formula for writing a readme, however, and in the end the content depends on the whims of the developer."
"The content you are currently reading is what you will get for each repository you scrape.
"This entire web page is the repository for this project"
- A well-documented jupyter notebook that contains your analysis
- One or two google slides suitable for a general audience that summarize your findings. Include a well-labelled visualization in your slides.
-
Codeup's Curriculum
Feature Name | Description |
---|---|
language | The language of the repository. Scraped this data directly from each Github repository's home page. |
clean | This is a string of characters (words) that have been cleaned through ACSII encoding, tokenizing, lemmatizing and removing stopwords. |
words | A list of cleaned words from the clean column. |
doc_length | Number of individual words in each document (row). |
pred_bow | Our Logistic Regression Model's prediction using Bag of Words as the feature generator. |
pred_tf | Our Logistic Regression Model's prediction using TF-IDF as the feature generator. |
pred_tfidf_tree | Our Decision Tree Model's prediction that used TF-IDF as the feature generator. |
What's the proportion of each language in our data?
What are the most common words in READMEs?
Does the length of the README vary by programming language
Do different programming languages use a different number of unique words?
What are the highest frequencies of word combinations? ie. ngrams
Is the average document length for Python READMEs longer or shorter than the overall average document length?
Two_tailed T-Test:
Null: The average `doc_length` of the python readmes are not
statistically different from the overall population average `doc_length`
Alternative: The average `doc_length` of the python readmes are statistically
different from the overall population average `doc_length`
Is the average document length for JavaScript READMEs longer or shorter than the overall average document length?
Two_tailed T-Test:
Null: The average `doc_length` of the JavaScript readmes are not statistically
different from the overall population average `doc_length`
Alternative: The average `doc_length` of the JavaScript readmes are statistically
different from the overall population average `doc_length`
Data acquired using the BeautifulSoup library. Used helper functions to get requests to the first 30 search pages of most starred repos for Javascript and Python. Used helper function to parse HTML to find certain elements that contained the programming language, repo-sub url, and the readme content for each repo among said pages and saved to a DataFrame. Stored as a json file locally for reproduction.
Readme content is normalized, tokenized, stemmed, lemmatized, and stopwords are removed to produce "clean" content. Duplicate repos are removed and 2 columns are created The data is split into train, validate, and, test; stratifying on the programming language.
Summary
The distribution of JavaScript and Python data is nearly 1:1 Words counts with a distribution of between 40-60% are likely to be useless. Words on both ends of those tails will be more significant in classifying language in the modeling section. Word combinations may be more useful in classification since the combinations are more unique than individual words.Hypothesis Testing
Hypothesis 1:
Two-Tailed T-Test: Is the average document length for Python READMEs longer or shorter than the overall average document length?
-
$H_0$ : The averagedoc_length
of the python readmes are not statistically different from the overall population averagedoc_length
-
$H_a$ : The averagedoc_length
of the python readmes are statistically different from the overall population averagedoc_length
Result: Null hypothesis was not rejected, meaning there is no statistically significant difference in the mean between the python average README doc lengths and the overall README average doc length.
Hypothesis 2:
Two-Tailed T-Test: Is the average document length for JavaScript READMEs longer or shorter than the overall average document length?
-
$H_0$ : The averagedoc_length
of the JavaScript readmes are not statistically different from the overall population averagedoc_length
-
$H_a$ : The averagedoc_length
of the JavaScript readmes are statistically different from the overall population averagedoc_length
Result: Null hypothesis was not rejected, meaning there is no statistically significant difference in the mean between the JavaScript average README doc lengths and the overall README average doc length.
Summary
- Baseline: It appears that JavaScript is the most often occuring result of the two languages represented, thus we will take as our baseline assuming that all README's are in JavaScript, which would mean our baseline model is accurately approximately 52% of the time.
- Feature Extraction: Using Bag of Words and TF-IDF to assign a numerical value to each word for modeling. Set X and y variables for computing. Used helper functions from model.py for cleaner documentation
- Models:
- Logistic Regression Using Bag of Words
- Logistic Regression Using TF-IDF
- Decision Tree Using TF-IDF
Evaluation
Decision Tree model is most likely overfit, performed worse than others on validate. TF-IDF Logistic Regression Model performed best: Model 2. Moved forward with this model for testing on unseen data
Summary
-
Repository languages classes:
- JavaScript
- Python
-
We ran 3 different classification Models:
- Logistic Regression Using Bag of Words
- Logistic Regression Using TF-IDF
- Decision Tree Using TF-IDF
-
The results of the tests show that the model with the highest consistent accuracy is the Logistic Regression model using TF-IDF with an average of 90.5% accuracy across all datasets.
- We suspect that the high degree of accuracy is caused both by some overfitting (accounted for by adjusting the hyperparameters of the models) and only using binary classification.
- As shown in the exploration stage, we can see that there is enough distinctness in the words typically used in the Python and JavaScript repositories that allowed the models to determine the languages of the repository with relative ease.
Note: If additional languages had been added, i.e. adding Java or R into the mix, we expect that the overall accuracy and recall of the models would have gone down. We hypothesize this would've been due to the similarities of the purpose of those languages (not the syntax of those languages), thus the natural language surrounding those languages would have been harder for the model to decipher.
Next Steps
-
As we continue to expand on the project, we would like to introduce additional languages into our repository scraping. That is the single biggest step we can make to improve the robustness of the model.
-
We would have done more exploration related to which language introduced the most inaccuracy; i.e. was it more difficult for the model to decipher the Python repositories accurately, or was it the JavaScript repositories? This question would have extended to the additional languages under the expanded scraping mentioned above.
What should the user viewing this project do to recreate the project?
- Fork or clone this repository.
- Copy and paste the contents of
- nlp_final_notebook for action steps
- wrangle_scratchpad for acquiring and preparing multiple parts of repo data
- acquire.py for helper functions
- prepare.py for helper functions
- explore.py for helper functions
- model.py for helper functions
An Easy Way to Download
To save the file straight in your project directory, follow these steps:
- Click the file in this repository you want to copy and paste. It should open to the page as shown below.
- Right click raw.
- Click save as.
- Click the folder you want to save the file in, such as your project directory.
- Rename the file as file_name.
- Make sure the file is saving as the proper file type file before clicking save.
- You can now edit the file how you want within your project directory.
What tools did you use and what version were they?
Python version 3.85 (all imports can be found in the import code block of each section)
Anyone can use for reproduction and educational purposes.
- Brandon Martinez
- Luke Becker