Stdin.readByte is broken. #17

mhermier · 2018-11-02T07:51:32Z

Hi,
When usin Stdin.readByte it incorrectly read bytes depending on the value of the previously byte read.
If you look at the code:

static readByte() {
  return read_ {
    // Peel off the first byte.
    var byte = __buffered.bytes[0] // This operates on bytes
    __buffered = __buffered[1..-1] // This operates on UTF8 strings
    return byte
  }
}

Meaning that for every byte values > 127 the whole utf8 character is skipped instead of a single byte.

The text was updated successfully, but these errors were encountered:

joshgoebel · 2021-04-25T12:49:19Z

What should the correct behavior here be with regards to readLine then? It seems there is a conflict... if you read a single byte of a 4 byte UTF-8... what is readLine to do if you call it next? Your "position" in the buffer is now invalid for returning lines... the next 3 bytes are invalid...

joshgoebel · 2021-04-25T17:56:04Z

One idea would be for readByte to have a mini-buffer of it's own... such that readByte always pulls a full UTF-8 character off the buffer and then subsequent calls returns the remaining bytes... but if sone were to alternate willy-nilly between readByte and readLine one could still lose data.

mhermier · 2021-04-25T18:17:40Z

readLine should technically returns a buffer no matter what encoding errors happens in there. Strictly speaking UTF-8 encoding is supported but not required to be in a String. If we are in the middle of an UTF-8 char operator[_] should handle it somehow (or bomb which would be saner than what we currently have).

joshgoebel · 2021-04-25T18:24:50Z

If we are in the middle of an UTF-8 char operator[_] should handle it somehow (or bomb which would be saner than what we currently have).

You mean add the support to String itself? Right now I can't even see any easy way of truncating the string and cutting off a portion of a character... it seems all the indexing stuff is based on full characters.

I could definitely see the path to making readLine throw an error if readByte was in the "middle" of a character though but then it almost feels like you need some API to check the status of the stream/buffer...

ruby0x1 · 2021-04-25T18:27:04Z

If you mean indexing on String, indices are bytes. See the description here.

mhermier · 2021-04-25T18:30:51Z

To solve the original problem, the proper solution would be to add range support to StringByteSequence.[_] or do some proper buffering for efficiency.

mhermier · 2021-04-25T18:39:20Z

@ruby0x1 The doc is correct, but the non documented range implementation is not byte indexed....

joshgoebel · 2021-04-25T18:40:32Z

If you mean indexing on String, indices are bytes.

Yes, right, I read that... but the behavior around "half-characters" is what's causing this bug.

__buffered = __buffered[lineSeparator + 1..-1]

Trying to advance the buffer by a single index seems to eat forward to the next whole UTF-8 character. So that results in bytes being dropped. I already have a fix if you assume that one doesn't use both readByte and readLine which seems problematic.

the proper solution would be to add range support to StringByteSequence.[_]

How would that help here? It seems (I suppose I'm assuming) that readLine expects to work with UTF-8 [strings] while readByte expects to work with bytes... The problem seems to be a desire to hold a cursor at TWO different places in the buffer concurrently.

// add some UTF-8 charcters to the buffer
Stdin.readByte() 
// we're now in the "middle" of a UTF-8 character
Stdin.readLine()

What is readLine() expected to return here exactly? I was suggesting an error but then it would seem we need to add an API to detect this or advance to the next full-character so that user land could handle this, no?

If it could return the broken string at the exact correct position that might be one answer, but I'm not sure if that's possible or what that would look like.

Is Fiber.abort what one would use here to raise an error?

ruby0x1 · 2021-04-25T18:49:22Z

@mhermier on the same page, I documented it recently, but the fact it's not indexed in bytes should be documented too.

mhermier · 2021-04-25T18:56:32Z

@ruby0x1 test it with unicode characters, results should be broken, and be based on UTF-8 indexes:

https://github.com/wren-lang/wren/blob/main/src/vm/wren_value.c#L735 lenght of the new buffer is based on utf8 not on the passed count.
https://github.com/wren-lang/wren/blob/main/src/vm/wren_value.c#L747 cleans up the buffer.
This is a mess, that silently remove non UTF-8 char of the string, making it not byte indexed.

mhermier · 2021-04-25T19:57:09Z

EDIT: Previous message contained a broken test.

The following patch should be more conform to the byte indexing specification. I can make a branch if required.

commit 5baaffacb4909c83ea9302c8feed582d101e2600
Author: Michel Hermier <michel.hermier@gmail.com>
Date:   Sun Apr 25 21:58:31 2021 +0200

    wren/core: Fix `String` range subscript so that it conforms to bytes indexing requirement.

diff --git a/src/vm/wren_value.c b/src/vm/wren_value.c
index 2700c4def..323517034 100644
--- a/src/vm/wren_value.c
+++ b/src/vm/wren_value.c
@@ -724,26 +724,12 @@ Value wrenNewStringLength(WrenVM* vm, const char* text, size_t length)
 Value wrenNewStringFromRange(WrenVM* vm, ObjString* source, int start,
                              uint32_t count, int step)
 {
+  ObjString* result = allocateString(vm, count);
   uint8_t* from = (uint8_t*)source->value;
-  int length = 0;
-  for (uint32_t i = 0; i < count; i++)
-  {
-    length += wrenUtf8DecodeNumBytes(from[start + i * step]);
-  }
-
-  ObjString* result = allocateString(vm, length);
-  result->value[length] = '\0';
-
   uint8_t* to = (uint8_t*)result->value;
   for (uint32_t i = 0; i < count; i++)
   {
-    int index = start + i * step;
-    int codePoint = wrenUtf8Decode(from + index, source->length - index);
-
-    if (codePoint != -1)
-    {
-      to += wrenUtf8Encode(codePoint, to);
-    }
+    to[i] = from[start + i *step];
   }
 
   hashString(result);
diff --git a/test/core/string/subscript_range.wren b/test/core/string/subscript_range.wren
index f4d6cd298..6c8e4ddb9 100644
--- a/test/core/string/subscript_range.wren
+++ b/test/core/string/subscript_range.wren
@@ -42,12 +42,13 @@ System.print("abc"[3..-1] == "") // expect: true
 // Bytes:           11111
 //        012345678901234
 // Chars: sø mé ஃ  thî ng
+System.print("søméஃthîng".bytes.join(", ")) // expect: 115, 195, 184, 109, 195, 169, 224, 174, 131, 116, 104, 195, 174, 110, 103
 System.print("søméஃthîng"[0..3]) // expect: søm
 System.print("søméஃthîng"[3...10]) // expect: méஃt
 
 // Only includes sequences whose first byte is in the range.
-System.print("søméஃthîng"[2..6]) // expect: méஃ
-System.print("søméஃthîng"[2...6]) // expect: mé
-System.print("søméஃthîng"[2...7]) // expect: méஃ
+System.print("søméஃthîng"[2..6].bytes.join(", ")) // expect: 184, 109, 195, 169, 224
+System.print("søméஃthîng"[2...6].bytes.join(", ")) // expect: 184, 109, 195, 169
+System.print("søméஃthîng"[2...7].bytes.join(", ")) // expect: 184, 109, 195, 169, 224
 
 // TODO: Strings including invalid UTF-8.

joshgoebel · 2021-04-25T20:08:19Z

Personally I find the string tests easier to read than the byte value tests... from the tests though I can't tell that any behavior changed here? Looks like it's doing exactly what it was before?

ruby0x1 · 2021-04-25T20:10:17Z

It changed indexing with a range (substring) to use bytes instead of whole characters. A single character can be multiple bytes which was what happened before, it counts those, the change doesn't.

mhermier · 2021-04-25T20:10:39Z

It is completely different, since some produce invalid UTF-8 strings that makes the test suite explodes.

joshgoebel · 2021-04-25T20:14:14Z

Yeah I think I see now, it just was very confusing that the entire format of the tests also changed so you can't look at the tests diff at a glance to see the diff.

I assume previously that in this specific case the beginning bytes were "rounded up" and the end bytes were also to the next whole character (that's how I'm reading it).

joshgoebel · 2021-04-25T20:17:07Z

So this patch would actually resolve the whole issue here I think... but PERHAPS introduce the problem where someone calling readLine would blow up because they retrieved invalid UTF-8... (I'm not sure what the behavior of indexOf is in this scenario) and that would be a problem for the user?

And can + on a String still operate if the original string is not valid UTF-8? IE, can we still append the buffer here?

mhermier · 2021-04-25T20:26:00Z

Yes, and a little bit more than that, if there was legitimate invalid chars in the middle of the sequence they would have bin previously silently removed.

The change is breaking some existing code, since now we don't trim invalid character anymore, next valid character is no longer reached at index 1...

+ should be safe, it does some printf magic, so it does not really care about invalid bytes, since it blindly memcpy wren String content.

mhermier · 2021-04-25T20:28:34Z

However, when previously it was possible to revert utf-8 strings using negative range steps, it does only work with ascii-7 strings now...

joshgoebel · 2021-04-25T20:30:22Z

if there was legitimate invalid chars in the middle of the sequence they would have bin previously silently removed.

I don't follow this and I'm not sure what a "legitimate invalid char" is. :)

mhermier · 2021-04-25T20:35:26Z

Legitimate like, you use an invalid utf8 char on purpose to make a split between 2 strings, in a protocol.

Well while it fix somehow, the fact that we have bytes indexing in String is a not really logical to me, it would be more logical if this was accessible through StringByteSequence

joshgoebel · 2021-04-25T20:36:58Z

the fact that we have bytes indexing in String is a not really logical to me,

Agree, it's more than a tad bit confusion.

mhermier · 2021-04-25T20:45:55Z

Rethinking at it, maybe we should provide the old behavior unchanged, and provide the new one via StringByteSequence operator [_]. Previous one was allowing to invert letters of the string, so even if indexing was odd it has some usage (even if some test where missing)

mhermier · 2021-04-25T20:52:54Z

Argg we are doomed, StringByteSequence[_] returns a Num ... Have to sleep on this, to see if an idea pops up.

joshgoebel · 2021-04-25T20:56:15Z

and provide the new one via StringByteSequence operator [_]

Doesn't that already work? The problem this issue is raising is that StringByteSequence is being used for readBytes where-as the String (with it's indexing behavior) is being used to advance the cursor.

readByte itself can be fixed by buffering each character and only advancing after all the bytes in the character are read:

  static readByte() {
    return read_ {
      if (__char==null) {
        __offset = 0
        __char = __buffered[0]
        __max = __char.bytes.count
      }
      // Peel off the next byte.
      var byte = __char.bytes[__offset]
      __offset = __offset + 1

      if (__offset == __max) {
        __char = null
        __buffered = __buffered[1..-1]
      }

      return byte
    }
  }

But then you're left with specifying what the correct behavior of readLine should be... or readCodePoint as I suggested also.

mhermier · 2021-04-25T21:21:55Z

That does not work, StringByteSequence operator [_] Does not work with range because smart index returns the byte as Num.

Yup, this is a mess from the start, I don't know what to think. As this your code only works on valid UTF-8. For correctness and efficiency you should store __buffered.bytes and peel off bytes from it.

joshgoebel · 2021-04-25T21:35:47Z

StringByteSequence operator [_] Does not work with range because smart index returns the byte as Num.

I guess I'm not sure why we need it to? It would really help if we back-up and first specified the correct/desired behavior here otherwise we're just chasing our tails without end. How are readBytes and readLines intended to interact in this edge case? My assumptions:

readBytes always returns the next byte from the string, encoding is irrelevant
readLines returns native UTF-8 strings because that's our native String storage implementation

. For correctness and efficiency you should store __buffered.bytes and peel off bytes from it.

You're talking about the case where our input is just random gibberish? That this sort of relies on the input being UTF-8 at core...? I went with that because I feel like the underlying string implementation does that anyways, but I guess that's not 100% true if + is a memcpy...

SO... lets go back and figure out the correct behavior first. I get the impression the String semantics of Wren have been designed with intention (even if they are confusing) and that the correct solution here may not be to just break String, effectively turning it into a byte sequence. :-) So:

// first input some UTF-8 characters to STDIN the buffer
// then
Stdin.readByte() 
// we're now in the "middle" of a UTF-8 character
Stdin.readLine()

What is readLine() expected to return here exactly? Or should it error? OTTOMH I feel like it should error and someone wanting to mix UTF-8 and bytes (building a protocol layer, etc) should do so very carefully. It seems more common that one would be reading most input via bytes or lines, rather than mixing.

mhermier · 2021-04-26T03:55:03Z

The thing is that there is absolutely no guarantee that any input even stdin is always utf-8, as per file redirection or broken writer... As per original byte container definition, I would say that readline should return the next bytes regardless we are in the middle of a utf-8 char. I suspect this is the correct thing to do since we can also produce broken utf-8 string using `\x`... That also probably means some code will probably fails because they expect the first valid utf-8 char to be at 0, but it seems to be the logical choice of this byte indexing mess... I don't like the idea, but we will probably be forced to have a another string view, that enforce utf-8 encoding, allowing to do codepoints indexing, and proper ranges access...

mhermier · 2021-04-26T04:34:54Z

That would also means we should remove the smart decoding of utf-8 chars, from String[Num]. We would be left with String as a glorified byte array, and have some sort of UTF8String view, to manage things at utf-8 level. It would break some existing code, but it it is the only sane way I see to avoid to mix indexing, and make the code more easy to reason.

joshgoebel · 2021-04-30T09:06:51Z

I may have to bow out of this one. I thought I could solve it by storing a list of strings/packets as they arrive, but you end up left with all sorts of edge cases... what if a line is spread across multiple packets, what if a codePoint is? You can lazy join the packets before every read, but how many packets? Worse case you might scan the WHOLE input stream looking for a line ending you never find. And compiling strings into bigger ones just makes it harder to advance the cursor since we have no great way to do that one byte at a time... (while also GCing the old data you've moved past)

The simplest thing I tried was just storing the stream as a List of bytes... but this is slow because of all the conversions and also the cost of removing items from the beginning of a list. You might say that doesn't matter for typing in the Repl - but if you wanted to do string processing on a huge input file you were piping in suddenly it would matter a lot.

The fastest (and simplest) way to do this might just be using a circular buffer on the C side... store the data as literal bytes in RAM, append as needed... never any need to copy, transform, convert to or from lists. "Truncating" the beginning of the buffer is instant as it's just updating the head to point to the new location in the buffer. The most expensive operation then would become readCodePoint which requires building UTF-8 characters/codes.

Thats my thought for now. Happy to discuss further with anyone who wanted to pick this up and run with it.

If we only care about solving the exact issue filed here you could accomplish that with:

__buffered = __buffered.bytes[1..-1].map { |x| String.fromByte(x) }.join()

This would be far more correct behavior... at the cost of rebuilding the whole buffer every time a character is read...

mhermier · 2021-04-30T10:17:46Z

I think we can reduce the problem by giving a max size to readLine(), so in code it should be something like:

readLine() { readLine(1024) }
readLine(maxSize) { ... }

That way you upper bound lookup and buffering.

Making a circular buffer, would also requires such upper bound as the size of the buffer. So API wise, it would not change anything as the naive implementation. So we can go the naive implementation for now to make it work, and if it really become a bottle neck, we could go to a dedicated ring buffer.

joshgoebel · 2021-04-30T10:36:40Z

Scoping readLine indeed helps reduce worse case on one of the edge cases. :-)

Making a circular buffer, would also requires such upper bound as the size of the buffer.

Of course, but the bigger win is in all the other things you'd get for free. Anyways I just burnt out on this one. So many other things I can better contribute to. Looking forward to seeing if this can truly be solved simply and if so how well that will hold up over time. :-)

joshgoebel · 2021-04-30T10:39:07Z

__buffered = __buffered[1..-1]

Ruby, out of curiosity just how performant was the original to begin with? Is it smart enough to do a memcpy of the original string when given a range? I think because of UTF-8 it has to walk the whole string character by character, yes?

If so then perhaps the "simple fix" posted above might not even be that terrible of a performance regression.

mhermier · 2021-04-30T10:43:56Z

The performance is poor, as it was only really used in the cli. So it is not really important to be efficient for now, since most of the effort is still on the VM, but yet we need something to be functional for the cli to show case.

joshgoebel · 2021-04-30T10:50:29Z

Well if it was already quite slow as-is then I'd recommend we just patch it as I suggested above and close out this issue. It'll indeed be fine enough either way for reading CLI input from the keyboard.

I can still make a PR if I can get a confirmation everyone is on board with that solution.

ruby0x1 · 2021-04-30T17:21:11Z

Slow is relative, it's probably not slow in the sense you're expecting. It's not the fastest it possibly could be but that doesn't mean performance is poor.

joshgoebel · 2021-04-30T19:11:41Z

Slow is relative, it's probably not slow in the sense you're expecting.

Actually, it is slow in the sense I'm expecting. 🙃 I wasn't suggesting the performance was inadequate. I was meaning/expecting "in an absolute sense" (compared to C optimized code) not whether it's "fast enough" practically speaking.

My point was only IF it was already super optimized on the C side that I'd be more hesitant to drop in a change that then made it 20x slower. BUT since that's not the case I'll again suggest the tiny patch of just converting to bytes, shifting the data and converting back to a string.

Keeping in mind the actual bug/issue here is that we're reading bytes, but shifting by UTF-8. I feel like I may have gone down a rabbit hole and expanded the scope of the discussion too much originally vs fixing the issue and perhaps creating follow-ups for the larger concerns.

The bad code:

var byte = __buffered.bytes[0] // This operates on bytes
__buffered = __buffered[1..-1] // This operates on UTF8 strings

__buffered = __buffered.bytes[1..-1].map { |x| String.fromByte(x) }.join()

Do you have any specific thoughts on that as a fix?

mhermier · 2021-05-01T11:59:55Z

Maybe we could add a String.bytesAt(index_or_range) or alike, that would return a new string but that would operate on bytes?
It would not solve everything, but would allow a fast byte array manipulations, and would have a low impact on the API.

joshgoebel · 2021-05-01T13:38:34Z

I think that would certainly make some sense, though what to name it is a question. bytesAt to me doesn't make me think I'm getting back a string.

mhermier · 2021-05-01T18:28:44Z

Well thing is that I don't find a more sensible name to it, and have nothing more to propose for now.

The other possibility would be to make it a StringByteSequence method, but it would cause a double allocation (String and the new StringByteSequence).

ruby0x1 transferred this issue from wren-lang/wren Sep 18, 2019

joshgoebel mentioned this issue Apr 29, 2021

Overall Unofficial Roadmap #98

Open

6 tasks

joshgoebel linked a pull request May 20, 2021 that will close this issue

(fix) Stdin.readByte now propery removes single bytes from buffer #111

Open

Stdin.readByte is broken. #17

Stdin.readByte is broken. #17

Comments

mhermier commented Nov 2, 2018

joshgoebel commented Apr 25, 2021

joshgoebel commented Apr 25, 2021

mhermier commented Apr 25, 2021

joshgoebel commented Apr 25, 2021 • edited Loading

ruby0x1 commented Apr 25, 2021

mhermier commented Apr 25, 2021

mhermier commented Apr 25, 2021

joshgoebel commented Apr 25, 2021 • edited Loading

ruby0x1 commented Apr 25, 2021 • edited Loading

mhermier commented Apr 25, 2021 • edited Loading

mhermier commented Apr 25, 2021 • edited Loading

joshgoebel commented Apr 25, 2021

ruby0x1 commented Apr 25, 2021 • edited Loading

mhermier commented Apr 25, 2021

joshgoebel commented Apr 25, 2021 • edited Loading

joshgoebel commented Apr 25, 2021 • edited Loading

mhermier commented Apr 25, 2021

mhermier commented Apr 25, 2021

joshgoebel commented Apr 25, 2021

mhermier commented Apr 25, 2021

joshgoebel commented Apr 25, 2021

mhermier commented Apr 25, 2021

mhermier commented Apr 25, 2021

joshgoebel commented Apr 25, 2021

mhermier commented Apr 25, 2021

joshgoebel commented Apr 25, 2021 • edited Loading

mhermier commented Apr 26, 2021 via email

mhermier commented Apr 26, 2021 via email

joshgoebel commented Apr 30, 2021

mhermier commented Apr 30, 2021

joshgoebel commented Apr 30, 2021

joshgoebel commented Apr 30, 2021

mhermier commented Apr 30, 2021

joshgoebel commented Apr 30, 2021

ruby0x1 commented Apr 30, 2021

joshgoebel commented Apr 30, 2021 • edited Loading

mhermier commented May 1, 2021

joshgoebel commented May 1, 2021

mhermier commented May 1, 2021

joshgoebel commented Apr 25, 2021 •

edited

Loading

joshgoebel commented Apr 25, 2021 •

edited

Loading

ruby0x1 commented Apr 25, 2021 •

edited

Loading

mhermier commented Apr 25, 2021 •

edited

Loading

mhermier commented Apr 25, 2021 •

edited

Loading

ruby0x1 commented Apr 25, 2021 •

edited

Loading

joshgoebel commented Apr 25, 2021 •

edited

Loading

joshgoebel commented Apr 25, 2021 •

edited

Loading

joshgoebel commented Apr 25, 2021 •

edited

Loading

joshgoebel commented Apr 30, 2021 •

edited

Loading