Lingo

An experimental project that seeks to infer the gender of a person based on their name.

Requirements

1 GB RAM
Python 3.6 and above

Usage

Make sure you issue these commands while in the directory.

You must first train the model with the following command. This will read the file data/training.txt and save the trained model as json in training.json

python3 learn.py

then in order to use, run the file Lingo.py. You will be greeted with a Name: prompt as soon as the training data is loaded into memory.

python3 Lingo.py

TODO

How It Works

TL;DR: BAYES THEOREM.

At both training and use time, each name is divided into about 300 components called 'metrics'. A few metrics include:

Letter pairs. For example: adnan is split into ad, dn, na ...
Letter triplets. For example: adnan is split into adn, dna, nan ...
Pairs and Triplets with offset from the end of the name like: 0:an, 1:na or 0:nan, 1:dna
Singular letters with offsets. (0:n, 1:a, 2:n ...)

Each letter is also represented phonetically in multiple different ways for example a can be GutturalVowel, LongGutturalVowel, LongVowel, LongGuttural, Vowel, Guttural, Long (See phonetics.py a list of representations of each letter).

These phonetic attributes are taken from the Bengali Alphabet page on Wikipedia by matching up each english letter to the fitting phonetic doppleganger in the Bengali language.

Afterwards all the combinations that can occur between the two(or three) lists of phonetic representations of the two(or three) letters in a pair(or triplet) is found and used as a metric. Examples: GutturalVowel-LabialConsonant, Long-LabialAspiratedGenericConsonant-GutturalUnaspirated, Vowel-Consonant-Aspirated

The combinations mentioned above is combined with the offset from the end of the name again to create yet another set of metrics. Example: 0:GutturalVowel-LabialConsonant. These two processes account for the meat of the metrics and is what gives the model the high accuracy achieved.

Note: Internally Lingo uses single letter short hands for traits like Vowel is just v and etc, making the actual metrics look similar to: 0:xwe-fiu

Training

When learning all the about 300 metrics that each name results in are tallied up and stored in the training file for later use. The count of the number of male or female names found is also tallied for later use in Bayesian Inference.

Inferencing

When making an inference, Lingo creates two buckets in memory the female bucket and male bucket. Then all the mtrics for the anme are found out again using the methods above.

Finally the tally for each metric is run though a bayes probability function multiplied by a weight based on offset and metric type and added to the bucket.

metrics that pretain to the ends of names are given higher weights than other metrics
phonetic trait based metric is given precedence over character based metrics.

If the percentage difference in the levels in each bucket is higher than 15% an inference is made. Otherwise the name is considered to be Unisex.

Accuracy

We trained the model on 32 thousand names and checked it against 3,200 names to come to the conclusion that the model is 91% accurate. In order to run this statistic, execute the file checker.py. Should tell you the correct and incorrect percentage soon enough.

python3 checker.py

License

MIT.

Made With ♥ By


	Samiha Tahsin mahir.samiha@gmail.com		Omran Jamal o.jamal97@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.vscode		.vscode
data		data
.gitignore		.gitignore
Lingo.py		Lingo.py
README.md		README.md
checker.py		checker.py
combination.py		combination.py
combine.py		combine.py
infer.py		infer.py
learn.py		learn.py
metrics.py		metrics.py
phonetics.py		phonetics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lingo

Requirements

Usage

TODO

How It Works

Training

Inferencing

Accuracy

License

Made With ♥ By

About

Releases

Packages

Contributors 2

Languages

Reasonence/Lingo

Folders and files

Latest commit

History

Repository files navigation

Lingo

Requirements

Usage

TODO

How It Works

Training

Inferencing

Accuracy

License

Made With ♥ By

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages