Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data reading] Qumin always re-segments wordforms, even in the presence of spaces #35

Open
XachaB opened this issue Sep 2, 2024 · 6 comments

Comments

@XachaB
Copy link
Owner

XachaB commented Sep 2, 2024

The behaviour is that of Qumin V.1: always re-segment. However, it might be better to make it possible to respect the given segmentation. Unfortunately, Qumin has quirks regarding phonology, and the segmentation needed tends to be different from that which I use in later datasets (where I write tiered information, such as length, tones, stress, on the syllable's vowel).

Note that all paralex datasets now MUST have space separation:

"The value of the phon_form MUST be a sequence of space-separated segments, e.g. not "dominoːrum" but "d o m i n oː r u m"."

I think we need the following cases:

  • There are no spaces => Throw an error, not paralex compliant (although we do have the means to still parse things... should we ?)
  • There are spaces, by default, split on spaces
  • There are spaces, but a user-defined config parameter asks to re-split: re-split (current behaviour)

@JPapir : what do you think ? Is that reasonable ?

The line doing the splitting is here:

tokens = Inventory._segmenter.findall(string)

@XachaB
Copy link
Owner Author

XachaB commented Sep 2, 2024

Actually, when there is no space, as long as resegment=True, I think we shouldn't complain and just re-segment.

Currently, the behaviour also ignores any unknown segments silently, which looks like a good way to get bugs (unless one has forced a segcheck). I think I also need to remove that.

@JPapir
Copy link
Collaborator

JPapir commented Sep 2, 2024

Currently, the behaviour also ignores any unknown segments silently, which looks like a good way to get bugs (unless one has forced a segcheck). I think I also need to remove that.

Normally, Paralex already ensures that this doesn't happen, but maybe it would be better to throw an exception.

There are no spaces => Throw an error, not paralex compliant (although we do have the means to still parse things... should we ?)

We usually throw an error when a dataset is not Paralex compliant, so we should probably be consistent with that. But, if we do...

Actually, when there is no space, as long as resegment=True, I think we shouldn't complain and just re-segment.

...Then I agree that it should be done only if an overt keyword is given. That way, the user will always be conscious of what is going on behind the stage.

There are spaces, by default, split on spaces

I hadn't noticed that this wasn't the case, but we should definitely do that.

There are spaces, but a user-defined config parameter asks to re-split: re-split (current behaviour)

Why not, I do not see use cases, but since this is already implemented, it is probably easy to keep it.

@XachaB
Copy link
Owner Author

XachaB commented Sep 2, 2024

The use case is situations where in recent datasets, I mark any non-segmentals (tone, stress, length) directly on a segment (not separated by spaces), but my sound inventory neatly defines these on separate rows. More recent software is able to parse this format, eg "b aː b a" with an inventory of three sounds "b", "a", and "ː" (non-segmental). But Qumin would only work with an inventory of "b", "a", "aː" (and then a long version of every sound). However, with the first inventory, Qumin can indeed work, if it re-parses "baːba", using the list of sounds "b", "a", "ː", and find "b a ː b a".

@JPapir
Copy link
Collaborator

JPapir commented Sep 2, 2024

All right, I see the link with the first post now. Then we can try this. I am not very familiar with this part of Qumin though. If I understand well, this will only work if the non-segmental information is marked in an way which looks like segmental in the phon_form (which is the case for ː, but maybe not for all segments, for instance if someone has a tone written with a diacritic : â or á).

@JPapir
Copy link
Collaborator

JPapir commented Sep 2, 2024

I guess that the best solution would be to make Qumin itself tier-compatible, but this would probably be way too much (useless) work on the pattern module.

@XachaB
Copy link
Owner Author

XachaB commented Sep 2, 2024

At least, that's not a short term goal :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants