-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data reading] Qumin always re-segments wordforms, even in the presence of spaces #35
Comments
Actually, when there is no space, as long as Currently, the behaviour also ignores any unknown segments silently, which looks like a good way to get bugs (unless one has forced a segcheck). I think I also need to remove that. |
Normally, Paralex already ensures that this doesn't happen, but maybe it would be better to throw an exception.
We usually throw an error when a dataset is not Paralex compliant, so we should probably be consistent with that. But, if we do...
...Then I agree that it should be done only if an overt keyword is given. That way, the user will always be conscious of what is going on behind the stage.
I hadn't noticed that this wasn't the case, but we should definitely do that.
Why not, I do not see use cases, but since this is already implemented, it is probably easy to keep it. |
The use case is situations where in recent datasets, I mark any non-segmentals (tone, stress, length) directly on a segment (not separated by spaces), but my sound inventory neatly defines these on separate rows. More recent software is able to parse this format, eg "b aː b a" with an inventory of three sounds "b", "a", and "ː" (non-segmental). But Qumin would only work with an inventory of "b", "a", "aː" (and then a long version of every sound). However, with the first inventory, Qumin can indeed work, if it re-parses "baːba", using the list of sounds "b", "a", "ː", and find "b a ː b a". |
All right, I see the link with the first post now. Then we can try this. I am not very familiar with this part of Qumin though. If I understand well, this will only work if the non-segmental information is marked in an way which looks like segmental in the phon_form (which is the case for ː, but maybe not for all segments, for instance if someone has a tone written with a diacritic : â or á). |
I guess that the best solution would be to make Qumin itself tier-compatible, but this would probably be way too much (useless) work on the pattern module. |
At least, that's not a short term goal :) |
The behaviour is that of Qumin V.1: always re-segment. However, it might be better to make it possible to respect the given segmentation. Unfortunately, Qumin has quirks regarding phonology, and the segmentation needed tends to be different from that which I use in later datasets (where I write tiered information, such as length, tones, stress, on the syllable's vowel).
Note that all paralex datasets now MUST have space separation:
I think we need the following cases:
@JPapir : what do you think ? Is that reasonable ?
The line doing the splitting is here:
Qumin/src/qumin/representations/segments.py
Line 44 in 2ea1782
The text was updated successfully, but these errors were encountered: