Skip to content

Latest commit

 

History

History
86 lines (62 loc) · 5.65 KB

README.md

File metadata and controls

86 lines (62 loc) · 5.65 KB

ga-decipher

A simple implementation of beam search and a genetic algorithm to solve substitution ciphers, aimed at the decipherment of unknown scripts. The model uses a simple n-gram model to score candidate mappings and beam search or a genetic algorithm to search for the best solution. The program was devised to solve the Rongorongo script of Easter Island, but in theory can be used for any other unknown script.

Installation

  1. Install dependencies
pip install -r requirements.txt
  1. Run the program
python main.py <source_file> <target_file> [options]

Usage

To run the program, execute the main.py script with the following command-line arguments:

  • source_file: Path to the file containing the source text.
  • target_file: Path to the file containing the target text.

Where the source is the cipher text (i.e. texts written in an unknown script) and the target is the plain text - a corpus presumably in the same language as the source, but transliterated. They can be independent, as long as representing the same language.

Additional options:

  • -a, --algorithm: Algorithm to use (bs or ga, default: bs)
  • -ng, --n-gram: N-gram size (default: 3)
  • -s, --num-symbols: Number of symbols (default: 50)
  • -i, --ignore: Symbols to ignore during decoding (comma-separated)
  • -n, --nodes: Number of nodes in the beam search (default: 10)
  • -bw, --beam-width: Beam width in the beam search (default: 100)
  • -p, --pop-size: Population size in the genetic algorithm (default: 2000)
  • -np, --num-parents: Number of parents in the genetic algorithm (default: 1000)
  • -m, --mutation-rate: Mutation rate in the genetic algorithm (default: 0.5)
  • -c, --crossover-rate: Crossover rate in the genetic algorithm (default: 0.8)
  • -g, --generations: Number of generations (default: 200)
  • -nc, --n-cores: Number of CPU cores to use (default: 0, uses all available cores)
  • -e, --eval: Evaluation file to assess accuracy
  • -o, --output: Output folder to save results

The source and target texts are expected to be simple text files with newline separated sentences. If an evaluation file is provided, it is expected to be a text file with pairs of source and target symbols separated by a space. Refer to the sample files in the data folder as examples.

Example

For running the Japanese experiment with the hiragana text as source, evaluating the results on the correct mapping file and writing the results to a japanese_results folder:

python main.py data/hiragana.txt data/japanese.txt -e data/japanese_mapping.txt -o japanese_results

Experiment

The model was tested on a corpus of Japanese sentences obtained from tatoeba. The choice of Japanese is due to its logosyllabic writing system and relatively simple syllable structure, which make it an appropriate modern parallel for Rongorongo and Rapanui.

Two versions of the Japanese corpus were used - an artifically generated one where only hiragana was used, and the original one with mixed kanji, hiragana and katakana. The former allows for assessing performance for a purely syllabic writing system, while the latter more realistically represents a noisy scenario where logograms are mixed with syllables. A sample of 1000 sentences were used for fitting the language model, and another 1000 for decipherment.

The model correctly identifies 86% of the syllabic symbols in the hiragana corpus. Of the top 10 symbols, all are correctly identified (if one counts the reading of は as wa).

Results

The Rapanui language model was fitted on a sample of short recitations and songs assumed to represent the genres and language present in Rongorongo (Barthel 1960; Blixen 1979; Campbell 1971). For the written corpus, the set of texts removing parallel passages and repetitive structured sequences usually called the "independent text" (Horley 2007) was used. The original Barthel encoding was converted to the simplified proposal by Horley (2021). For details, the reader is referred to the discussion and code in my other rongorongo repository.

A second corpus was prepared by removing the anthropomorphs 200, 381 and 256, as well as glyph 3, under the assumption the former are allographs and are frequently omitted, and that the latter appears to be a decorator (see the discussion here). The results are presented in the table below.

Glyph Code Syllable1 Syllable2
200 a
6 i a
10 u e
1 e u
600 te i
4 ma te
2 ta ta
3 o
381 ka
256 ru
8 tu ma
430 re ra
9 ha ha
5 ro ka

It is curious that the reading of 381 as ka has previously been suggested based on independent evidence (Davletshin 2022).

This is by no means an endorsement that the above solution is correct, as it heavily depends on the correct identification of the glyph inventory, the nature of rongorongo as mainly syllabic, the assumption that the chosen rapanui corpus is representative of the language in the glyphs, among other factors.