[data reading] Qumin always re-segments wordforms, even in the presence of spaces #35

XachaB · 2024-09-02T13:05:19Z

The behaviour is that of Qumin V.1: always re-segment. However, it might be better to make it possible to respect the given segmentation. Unfortunately, Qumin has quirks regarding phonology, and the segmentation needed tends to be different from that which I use in later datasets (where I write tiered information, such as length, tones, stress, on the syllable's vowel).

Note that all paralex datasets now MUST have space separation:

"The value of the phon_form MUST be a sequence of space-separated segments, e.g. not "dominoːrum" but "d o m i n oː r u m"."

I think we need the following cases:

There are no spaces => Throw an error, not paralex compliant (although we do have the means to still parse things... should we ?)
There are spaces, by default, split on spaces
There are spaces, but a user-defined config parameter asks to re-split: re-split (current behaviour)

@JPapir : what do you think ? Is that reasonable ?

The line doing the splitting is here:

Qumin/src/qumin/representations/segments.py

Line 44 in 2ea1782

tokens = Inventory._segmenter.findall(string)

XachaB · 2024-09-02T13:18:55Z

Actually, when there is no space, as long as resegment=True, I think we shouldn't complain and just re-segment.

Currently, the behaviour also ignores any unknown segments silently, which looks like a good way to get bugs (unless one has forced a segcheck). I think I also need to remove that.

JPapir · 2024-09-02T14:49:05Z

Currently, the behaviour also ignores any unknown segments silently, which looks like a good way to get bugs (unless one has forced a segcheck). I think I also need to remove that.

Normally, Paralex already ensures that this doesn't happen, but maybe it would be better to throw an exception.

There are no spaces => Throw an error, not paralex compliant (although we do have the means to still parse things... should we ?)

We usually throw an error when a dataset is not Paralex compliant, so we should probably be consistent with that. But, if we do...

Actually, when there is no space, as long as resegment=True, I think we shouldn't complain and just re-segment.

...Then I agree that it should be done only if an overt keyword is given. That way, the user will always be conscious of what is going on behind the stage.

There are spaces, by default, split on spaces

I hadn't noticed that this wasn't the case, but we should definitely do that.

There are spaces, but a user-defined config parameter asks to re-split: re-split (current behaviour)

Why not, I do not see use cases, but since this is already implemented, it is probably easy to keep it.

XachaB · 2024-09-02T14:53:46Z

The use case is situations where in recent datasets, I mark any non-segmentals (tone, stress, length) directly on a segment (not separated by spaces), but my sound inventory neatly defines these on separate rows. More recent software is able to parse this format, eg "b aː b a" with an inventory of three sounds "b", "a", and "ː" (non-segmental). But Qumin would only work with an inventory of "b", "a", "aː" (and then a long version of every sound). However, with the first inventory, Qumin can indeed work, if it re-parses "baːba", using the list of sounds "b", "a", "ː", and find "b a ː b a".

JPapir · 2024-09-02T14:59:59Z

All right, I see the link with the first post now. Then we can try this. I am not very familiar with this part of Qumin though. If I understand well, this will only work if the non-segmental information is marked in an way which looks like segmental in the phon_form (which is the case for ː, but maybe not for all segments, for instance if someone has a tone written with a diacritic : â or á).

JPapir · 2024-09-02T15:01:04Z

I guess that the best solution would be to make Qumin itself tier-compatible, but this would probably be way too much (useless) work on the pattern module.

XachaB · 2024-09-02T15:01:56Z

At least, that's not a short term goal :)

XachaB mentioned this issue Sep 2, 2024

Re-segment explicitly when user passes config variable #36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data reading] Qumin always re-segments wordforms, even in the presence of spaces #35

[data reading] Qumin always re-segments wordforms, even in the presence of spaces #35

XachaB commented Sep 2, 2024 •

edited

Loading

XachaB commented Sep 2, 2024

JPapir commented Sep 2, 2024

XachaB commented Sep 2, 2024

JPapir commented Sep 2, 2024

JPapir commented Sep 2, 2024

XachaB commented Sep 2, 2024

[data reading] Qumin always re-segments wordforms, even in the presence of spaces #35

[data reading] Qumin always re-segments wordforms, even in the presence of spaces #35

Comments

XachaB commented Sep 2, 2024 • edited Loading

XachaB commented Sep 2, 2024

JPapir commented Sep 2, 2024

XachaB commented Sep 2, 2024

JPapir commented Sep 2, 2024

JPapir commented Sep 2, 2024

XachaB commented Sep 2, 2024

XachaB commented Sep 2, 2024 •

edited

Loading