GitHub - JeffW12345/formality-classification-using-machine-learning: Investigation into the use of supervised machine learning to predict formality classifications of written communication

For the final project of my MSc in Software Development at the University of Glasgow, I carried out an investigation using supervised machine learning. This project explored to what extent it was possible to predict, for each sentence in a corpus of sentences, whether humans had determined the sentence in question to be formal or informal in nature.

This repository contains:

My code (which is written in Python and uses scikit-learn), along with a requirements.txt file.
The data file I used for the test (the original data file and that I added data fields to), consisting of sentences and fields relating to the sentences.
Files relating to testing the program, which are stored in the Program integrity test material folder.
A results spreadsheet.
The project report: MSc_Dissertation_Predicting_The_Formality_Of_Language.pdf

USER NOTES

PROGRAM FILES:

ngram-only-tests is used to run formality classification tests using purely n-grams as the feature.

non-ngram-only-tests is used to run formality classification tests using purely non n-gram features (such as the number of verbs in each sentence).

ngram-and-non-ngram-tests-combined is used to run formality classification tests combining n-grams and other features.

add-fields.py reads data from the data file, creates additional data fields and populates them, and then writes both the original data and the new data back to the file.

mcnemar-stats.py is used to check if the chances of the differences between two sets of prediction results is due to randomness is less than 5%.

checkForDuplicateSentences.py - This is located in the 'Program integrity test material' folder and can be used to check for duplicate sentences.

DATASETS:

The original dataset is original_formality_dataset.csv.

The modified dataset, which contains additional fields that were added by add-fields.py, is new_formality_data.csv.

PROGRAM INTEGRITY TEST FILES FOLDER:

These files were used to test that the programs were working as they should.

Dummy_Data_Before.csv is populated with 20 randomly generated sentences, and the rest of the fields are populated with random numbers..

Dummy_Data_After.csv contains additional data fields and data relating to the aforementioned sentences.

Program_integrity_tests.doxc - Sample console output of tests used to ensure that the program was working correctly. Includes test print statements.

checkForDuplicateSentences.py - Described in 'PROGRAM FILES' above.

RESULTS SPREADSHEET:

Formality_Classification_Results is a results spreadsheet containing the results of formality classification prediction tests.

The spreadsheet is organised into sheets, each of which relates to a different category of test.

SAMPLE CONSOLE OUTPUT:

Sample_Console_Output.docx displays sample console output relating to the three modules responsible for the machine learning tests.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
Program integrity test material		Program integrity test material
Formality_Classification_Results.xlsx		Formality_Classification_Results.xlsx
MSc_Dissertation_Predicting_The_Formality_Of_Language.pdf		MSc_Dissertation_Predicting_The_Formality_Of_Language.pdf
README.md		README.md
add-new-fields.py		add-new-fields.py
mcnemar-stats.py		mcnemar-stats.py
new_formality_data.csv		new_formality_data.csv
ngram-and-non-ngram-tests-combined.py		ngram-and-non-ngram-tests-combined.py
ngram-only-tests.py		ngram-only-tests.py
non-ngram-only-tests.py		non-ngram-only-tests.py
original_formality_dataset.csv		original_formality_dataset.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

USER NOTES

About

Releases

Packages

Languages

JeffW12345/formality-classification-using-machine-learning

Folders and files

Latest commit

History

Repository files navigation

USER NOTES

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages