Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
thatbudakguy committed Mar 19, 2021
1 parent 9d137df commit 4b55685
Showing 1 changed file with 9 additions and 6 deletions.
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,17 +29,20 @@ a simple invocation of `dphon` might look like:
$ dphon text_a.txt text_b.txt
```

which would look for phonetically similar passages between `text_a` and `text_b`. the output will be a list of sequences, with an identifier based on the file's name and an indicator of where in the text the sequence occurs:
which would look for phonetically similar passages between `text_a` and `text_b`. the output will be a list of sequences and their phonemic transcriptions, with an identifier based on the file's name and an indicator of where in the text the sequence occurs:

```sh
score 9, weighted 1.0
趙怱及齊將顏聚代之 (text_a 107505–107512)
趙蔥及齊將顏聚代李 (text_b 95016–95024)
1. text_a (2208–2216):
夏后啟曰以為可為故為之為之天下弗能
*ləʔ ɢʷraj kʰˤajʔ ɢʷraj kˤaʔs ɢʷraj tə ɢʷraj tə
2. text_b (3340–3348):
不可弗爲以爲可 故爲之爲之繇其道物
*ləʔ ɢʷraj kʰˤajʔ kˤaʔs ɢʷraj tə ɢʷraj tə pit
```

the numbers next to the identifiers are _token indices_, and may vary depending on how the text is tokenized – `dphon` currently uses character-based tokenization. whitespace will be removed, and the output will be aligned to make it easier to spot differences between the two sequences. by default, insertions are highlighted in green, and mismatches (differences between the two sequences) are highlighted in yellow.
the numbers next to the identifiers are _token indices_, and may vary depending on how the text is tokenized – `dphon` currently uses character-based tokenization. whitespace will be removed, and the output will be aligned to make it easier to spot differences between the two sequences. by default, insertions are highlighted in green, and mismatches (differences between the two sequences) are highlighted in red. additional (non-matching) context added to either side of match sequences is displayed using a dimmed color.

the score is an indicator of how many characters in the sequences were a phonetic match, while the weighted score normalizes the score by the length of the match. results are sorted by score, which results in the longest contiguous matches being listed first.
matches are sorted by the ratio of their phomenic similarity to their graphic similarity – in other words, matches between texts that sound highly similar but were written very differently will be at the top of the list.

by default, `dphon` only returns matches that display at least one instance of _graphic variation_ – a case where two different graphemes are used in the same place to represent the same sound. these cases are highlighted in blue. if you're interested in all instances of reuse, regardless of graphic variation, you can use the `--all` flag:

Expand Down

0 comments on commit 4b55685

Please sign in to comment.