Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex to specify lower-priority collation tokens #69

Open
tla opened this issue Nov 17, 2018 · 0 comments
Open

Regex to specify lower-priority collation tokens #69

tla opened this issue Nov 17, 2018 · 0 comments

Comments

@tla
Copy link
Member

tla commented Nov 17, 2018

It often happens in automated collation that very common / frequent tokens, e.g. punctuation or words like 'and' or 'the', get matched a little too eagerly by the algorithm so that more substantive tokens are misaligned. Moreover, the set of tokens that cause this problem will vary according to language / text type / etc.

At the moment I am dealing with this by assigning random strings of characters in the n field of the JSON object for these tokens, so that CollateX won't match them with anything else. This works, but leads to a bunch of duplicated tokens in the output, which I deal with using a graph search algorithms.

Since what I am doing in post-processing looks and smells an awful lot like collation, it seems like something CollateX should be able to handle internally - match the 'substantive' tokens on a first pass, and the non-substantive ones on a second pass, relative to the alignment that has already been done. The easiest way of specifying these 'unimportant' tokens might be a regular expression, since (as mentioned) they will vary from text to text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant