-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stdin.readByte is broken. #17
Comments
What should the correct behavior here be with regards to |
One idea would be for readByte to have a mini-buffer of it's own... such that readByte always pulls a full UTF-8 character off the buffer and then subsequent calls returns the remaining bytes... but if sone were to alternate willy-nilly between |
|
You mean add the support to I could definitely see the path to making readLine throw an error if readByte was in the "middle" of a character though but then it almost feels like you need some API to check the status of the stream/buffer... |
If you mean indexing on |
To solve the original problem, the proper solution would be to add range support to |
@ruby0x1 The doc is correct, but the non documented range implementation is not byte indexed.... |
Yes, right, I read that... but the behavior around "half-characters" is what's causing this bug.
Trying to advance the buffer by a single index seems to eat forward to the next whole UTF-8 character. So that results in bytes being dropped. I already have a fix if you assume that one doesn't use both readByte and readLine which seems problematic.
How would that help here? It seems (I suppose I'm assuming) that // add some UTF-8 charcters to the buffer
Stdin.readByte()
// we're now in the "middle" of a UTF-8 character
Stdin.readLine() What is If it could return the broken string at the exact correct position that might be one answer, but I'm not sure if that's possible or what that would look like. Is |
@mhermier on the same page, I documented it recently, but the fact it's not indexed in bytes should be documented too. |
@ruby0x1 test it with unicode characters, results should be broken, and be based on UTF-8 indexes:
|
EDIT: Previous message contained a broken test. The following patch should be more conform to the byte indexing specification. I can make a branch if required. commit 5baaffacb4909c83ea9302c8feed582d101e2600
Author: Michel Hermier <michel.hermier@gmail.com>
Date: Sun Apr 25 21:58:31 2021 +0200
wren/core: Fix `String` range subscript so that it conforms to bytes indexing requirement.
diff --git a/src/vm/wren_value.c b/src/vm/wren_value.c
index 2700c4def..323517034 100644
--- a/src/vm/wren_value.c
+++ b/src/vm/wren_value.c
@@ -724,26 +724,12 @@ Value wrenNewStringLength(WrenVM* vm, const char* text, size_t length)
Value wrenNewStringFromRange(WrenVM* vm, ObjString* source, int start,
uint32_t count, int step)
{
+ ObjString* result = allocateString(vm, count);
uint8_t* from = (uint8_t*)source->value;
- int length = 0;
- for (uint32_t i = 0; i < count; i++)
- {
- length += wrenUtf8DecodeNumBytes(from[start + i * step]);
- }
-
- ObjString* result = allocateString(vm, length);
- result->value[length] = '\0';
-
uint8_t* to = (uint8_t*)result->value;
for (uint32_t i = 0; i < count; i++)
{
- int index = start + i * step;
- int codePoint = wrenUtf8Decode(from + index, source->length - index);
-
- if (codePoint != -1)
- {
- to += wrenUtf8Encode(codePoint, to);
- }
+ to[i] = from[start + i *step];
}
hashString(result);
diff --git a/test/core/string/subscript_range.wren b/test/core/string/subscript_range.wren
index f4d6cd298..6c8e4ddb9 100644
--- a/test/core/string/subscript_range.wren
+++ b/test/core/string/subscript_range.wren
@@ -42,12 +42,13 @@ System.print("abc"[3..-1] == "") // expect: true
// Bytes: 11111
// 012345678901234
// Chars: sø mé ஃ thî ng
+System.print("søméஃthîng".bytes.join(", ")) // expect: 115, 195, 184, 109, 195, 169, 224, 174, 131, 116, 104, 195, 174, 110, 103
System.print("søméஃthîng"[0..3]) // expect: søm
System.print("søméஃthîng"[3...10]) // expect: méஃt
// Only includes sequences whose first byte is in the range.
-System.print("søméஃthîng"[2..6]) // expect: méஃ
-System.print("søméஃthîng"[2...6]) // expect: mé
-System.print("søméஃthîng"[2...7]) // expect: méஃ
+System.print("søméஃthîng"[2..6].bytes.join(", ")) // expect: 184, 109, 195, 169, 224
+System.print("søméஃthîng"[2...6].bytes.join(", ")) // expect: 184, 109, 195, 169
+System.print("søméஃthîng"[2...7].bytes.join(", ")) // expect: 184, 109, 195, 169, 224
// TODO: Strings including invalid UTF-8.
|
Personally I find the string tests easier to read than the byte value tests... from the tests though I can't tell that any behavior changed here? Looks like it's doing exactly what it was before? |
It changed indexing with a range (substring) to use bytes instead of whole characters. A single character can be multiple bytes which was what happened before, it counts those, the change doesn't. |
It is completely different, since some produce invalid UTF-8 strings that makes the test suite explodes. |
Yeah I think I see now, it just was very confusing that the entire format of the tests also changed so you can't look at the tests diff at a glance to see the diff. I assume previously that in this specific case the beginning bytes were "rounded up" and the end bytes were also to the next whole character (that's how I'm reading it). |
So this patch would actually resolve the whole issue here I think... but PERHAPS introduce the problem where someone calling And can |
Yes, and a little bit more than that, if there was legitimate invalid chars in the middle of the sequence they would have bin previously silently removed. The change is breaking some existing code, since now we don't trim invalid character anymore, next valid character is no longer reached at index
|
However, when previously it was possible to revert utf-8 strings using negative range steps, it does only work with ascii-7 strings now... |
I don't follow this and I'm not sure what a "legitimate invalid char" is. :) |
Legitimate like, you use an invalid utf8 char on purpose to make a split between 2 strings, in a protocol. Well while it fix somehow, the fact that we have bytes indexing in |
Agree, it's more than a tad bit confusion. |
Rethinking at it, maybe we should provide the old behavior unchanged, and provide the new one via |
Argg we are doomed, |
Doesn't that already work? The problem this issue is raising is that StringByteSequence is being used for
static readByte() {
return read_ {
if (__char==null) {
__offset = 0
__char = __buffered[0]
__max = __char.bytes.count
}
// Peel off the next byte.
var byte = __char.bytes[__offset]
__offset = __offset + 1
if (__offset == __max) {
__char = null
__buffered = __buffered[1..-1]
}
return byte
}
} But then you're left with specifying what the correct behavior of |
That does not work, Yup, this is a mess from the start, I don't know what to think. As this your code only works on valid UTF-8. For correctness and efficiency you should store |
I guess I'm not sure why we need it to? It would really help if we back-up and first specified the correct/desired behavior here otherwise we're just chasing our tails without end. How are
You're talking about the case where our input is just random gibberish? That this sort of relies on the input being UTF-8 at core...? I went with that because I feel like the underlying string implementation does that anyways, but I guess that's not 100% true if SO... lets go back and figure out the correct behavior first. I get the impression the String semantics of Wren have been designed with intention (even if they are confusing) and that the correct solution here may not be to just break String, effectively turning it into a byte sequence. :-) So:
What is readLine() expected to return here exactly? Or should it error? OTTOMH I feel like it should error and someone wanting to mix UTF-8 and bytes (building a protocol layer, etc) should do so very carefully. It seems more common that one would be reading most input via bytes or lines, rather than mixing. |
The thing is that there is absolutely no guarantee that any input even
stdin is always utf-8, as per file redirection or broken writer...
As per original byte container definition, I would say that readline should
return the next bytes regardless we are in the middle of a utf-8 char. I
suspect this is the correct thing to do since we can also produce broken
utf-8 string using `\x`...
That also probably means some code will probably fails because they expect
the first valid utf-8 char to be at 0, but it seems to be the logical
choice of this byte indexing mess...
I don't like the idea, but we will probably be forced to have a another
string view, that enforce utf-8 encoding, allowing to do codepoints
indexing, and proper ranges access...
|
That would also means we should remove the smart decoding of utf-8 chars,
from String[Num]. We would be left with String as a glorified byte array,
and have some sort of UTF8String view, to manage things at utf-8 level.
It would break some existing code, but it it is the only sane way I see to
avoid to mix indexing, and make the code more easy to reason.
|
I may have to bow out of this one. I thought I could solve it by storing a list of strings/packets as they arrive, but you end up left with all sorts of edge cases... what if a line is spread across multiple packets, what if a codePoint is? You can lazy join the packets before every read, but how many packets? Worse case you might scan the WHOLE input stream looking for a line ending you never find. And compiling strings into bigger ones just makes it harder to advance the cursor since we have no great way to do that one byte at a time... (while also GCing the old data you've moved past) The simplest thing I tried was just storing the stream as a List of bytes... but this is slow because of all the conversions and also the cost of removing items from the beginning of a list. You might say that doesn't matter for typing in the Repl - but if you wanted to do string processing on a huge input file you were piping in suddenly it would matter a lot. The fastest (and simplest) way to do this might just be using a circular buffer on the C side... store the data as literal bytes in RAM, append as needed... never any need to copy, transform, convert to or from lists. "Truncating" the beginning of the buffer is instant as it's just updating the head to point to the new location in the buffer. The most expensive operation then would become Thats my thought for now. Happy to discuss further with anyone who wanted to pick this up and run with it. If we only care about solving the exact issue filed here you could accomplish that with: __buffered = __buffered.bytes[1..-1].map { |x| String.fromByte(x) }.join() This would be far more correct behavior... at the cost of rebuilding the whole buffer every time a character is read... |
I think we can reduce the problem by giving a max size to readLine() { readLine(1024) }
readLine(maxSize) { ... } That way you upper bound lookup and buffering. Making a circular buffer, would also requires such upper bound as the size of the buffer. So API wise, it would not change anything as the naive implementation. So we can go the naive implementation for now to make it work, and if it really become a bottle neck, we could go to a dedicated ring buffer. |
Scoping
Of course, but the bigger win is in all the other things you'd get for free. Anyways I just burnt out on this one. So many other things I can better contribute to. Looking forward to seeing if this can truly be solved simply and if so how well that will hold up over time. :-) |
Ruby, out of curiosity just how performant was the original to begin with? Is it smart enough to do a memcpy of the original string when given a range? I think because of UTF-8 it has to walk the whole string character by character, yes? If so then perhaps the "simple fix" posted above might not even be that terrible of a performance regression. |
The performance is poor, as it was only really used in the cli. So it is not really important to be efficient for now, since most of the effort is still on the VM, but yet we need something to be functional for the cli to show case. |
Well if it was already quite slow as-is then I'd recommend we just patch it as I suggested above and close out this issue. It'll indeed be fine enough either way for reading CLI input from the keyboard. I can still make a PR if I can get a confirmation everyone is on board with that solution. |
Slow is relative, it's probably not slow in the sense you're expecting. It's not the fastest it possibly could be but that doesn't mean performance is poor. |
Actually, it is slow in the sense I'm expecting. 🙃 I wasn't suggesting the performance was inadequate. I was meaning/expecting "in an absolute sense" (compared to C optimized code) not whether it's "fast enough" practically speaking. My point was only IF it was already super optimized on the C side that I'd be more hesitant to drop in a change that then made it 20x slower. BUT since that's not the case I'll again suggest the tiny patch of just converting to bytes, shifting the data and converting back to a string. Keeping in mind the actual bug/issue here is that we're reading bytes, but shifting by UTF-8. I feel like I may have gone down a rabbit hole and expanded the scope of the discussion too much originally vs fixing the issue and perhaps creating follow-ups for the larger concerns. The bad code:
__buffered = __buffered.bytes[1..-1].map { |x| String.fromByte(x) }.join() Do you have any specific thoughts on that as a fix? |
Maybe we could add a |
I think that would certainly make some sense, though what to name it is a question. |
Well thing is that I don't find a more sensible name to it, and have nothing more to propose for now. The other possibility would be to make it a |
Hi,
When usin Stdin.readByte it incorrectly read bytes depending on the value of the previously byte read.
If you look at the code:
Meaning that for every byte values > 127 the whole utf8 character is skipped instead of a single byte.
The text was updated successfully, but these errors were encountered: