ga-decipher

A simple implementation of beam search and a genetic algorithm to solve substitution ciphers, aimed at the decipherment of unknown scripts. The model uses a simple n-gram model to score candidate mappings and beam search or a genetic algorithm to search for the best solution. The program was devised to solve the Rongorongo script of Easter Island, but in theory can be used for any other unknown script.

Installation

Install dependencies

pip install -r requirements.txt

Run the program

python main.py <source_file> <target_file> [options]

Usage

To run the program, execute the main.py script with the following command-line arguments:

source_file: Path to the file containing the source text.
target_file: Path to the file containing the target text.

Where the source is the cipher text (i.e. texts written in an unknown script) and the target is the plain text - a corpus presumably in the same language as the source, but transliterated. They can be independent, as long as representing the same language.

Additional options:

-a, --algorithm: Algorithm to use (bs or ga, default: bs)
-ng, --n-gram: N-gram size (default: 3)
-s, --num-symbols: Number of symbols (default: 50)
-i, --ignore: Symbols to ignore during decoding (comma-separated)
-n, --nodes: Number of nodes in the beam search (default: 10)
-bw, --beam-width: Beam width in the beam search (default: 100)
-p, --pop-size: Population size in the genetic algorithm (default: 2000)
-np, --num-parents: Number of parents in the genetic algorithm (default: 1000)
-m, --mutation-rate: Mutation rate in the genetic algorithm (default: 0.5)
-c, --crossover-rate: Crossover rate in the genetic algorithm (default: 0.8)
-g, --generations: Number of generations (default: 200)
-nc, --n-cores: Number of CPU cores to use (default: 0, uses all available cores)
-e, --eval: Evaluation file to assess accuracy
-o, --output: Output folder to save results

The source and target texts are expected to be simple text files with newline separated sentences. If an evaluation file is provided, it is expected to be a text file with pairs of source and target symbols separated by a space. Refer to the sample files in the data folder as examples.

Example

For running the Japanese experiment with the hiragana text as source, evaluating the results on the correct mapping file and writing the results to a japanese_results folder:

python main.py data/hiragana.txt data/japanese.txt -e data/japanese_mapping.txt -o japanese_results

Experiment

The model was tested on a corpus of Japanese sentences obtained from tatoeba. The choice of Japanese is due to its logosyllabic writing system and relatively simple syllable structure, which make it an appropriate modern parallel for Rongorongo and Rapanui.

Two versions of the Japanese corpus were used - an artifically generated one where only hiragana was used, and the original one with mixed kanji, hiragana and katakana. The former allows for assessing performance for a purely syllabic writing system, while the latter more realistically represents a noisy scenario where logograms are mixed with syllables. A sample of 1000 sentences were used for fitting the language model, and another 1000 for decipherment.

The model correctly identifies 86% of the syllabic symbols in the hiragana corpus. Of the top 10 symbols, all are correctly identified (if one counts the reading of は as wa).

Results

The Rapanui language model was fitted on a sample of short recitations and songs assumed to represent the genres and language present in Rongorongo (Barthel 1960; Blixen 1979; Campbell 1971). For the written corpus, the set of texts removing parallel passages and repetitive structured sequences usually called the "independent text" (Horley 2007) was used. The original Barthel encoding was converted to the simplified proposal by Horley (2021). For details, the reader is referred to the discussion and code in my other rongorongo repository.

A second corpus was prepared by removing the anthropomorphs 200, 381 and 256, as well as glyph 3, under the assumption the former are allographs and are frequently omitted, and that the latter appears to be a decorator (see the discussion here). The results are presented in the table below.

Code	Syllable₁	Syllable₂
200	a
6	i	a
10	u	e
1	e	u
600	te	i
4	ma	te
2	ta	ta
3	o
381	ka
256	ru
8	tu	ma
430	re	ra
9	ha	ha
5	ro	ka

It is curious that the reading of 381 as ka has previously been suggested based on independent evidence (Davletshin 2022).

This is by no means an endorsement that the above solution is correct, as it heavily depends on the correct identification of the glyph inventory, the nature of rongorongo as mainly syllabic, the assumption that the chosen rapanui corpus is representative of the language in the glyphs, among other factors.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
img		img
japanese_results		japanese_results
kanji_results		kanji_results
rongo_cut_results		rongo_cut_results
rongo_results		rongo_results
.gitignore		.gitignore
README.md		README.md
beam_search.py		beam_search.py
genetic_algorithm.py		genetic_algorithm.py
main.py		main.py
ngram_model.py		ngram_model.py
processing.py		processing.py
requirements.txt		requirements.txt
scoring.py		scoring.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ga-decipher

Installation

Usage

Example

Experiment

Results

About

Releases

Packages

Languages

Code	Syllable₁	Syllable₂
200	a
6	i	a
10	u	e
1	e	u
600	te	i
4	ma	te
2	ta	ta
3	o
381	ka
256	ru
8	tu	ma
430	re	ra
9	ha	ha
5	ro	ka

Code	Syllable₁	Syllable₂
200	a
6	i	a
10	u	e
1	e	u
600	te	i
4	ma	te
2	ta	ta
3	o
381	ka
256	ru
8	tu	ma
430	re	ra
9	ha	ha
5	ro	ka

jgregoriods/ga-decipher

Folders and files

Latest commit

History

Repository files navigation

ga-decipher

Installation

Usage

Example

Experiment

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages

Code	Syllable₁	Syllable₂
200	a
6	i	a
10	u	e
1	e	u
600	te	i
4	ma	te
2	ta	ta
3	o
381	ka
256	ru
8	tu	ma
430	re	ra
9	ha	ha
5	ro	ka