-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roadmap #159
Comments
I've been using JSMN in a somewhat memory constrained application. The 16-bit micro has 64kB internal flash, but due to other required libraries, I didn't have a lot of room to spare. So keeping a low-memory footprint in mind is important to me.
Some platforms I've used don't have the non-standard |
So, first of all I'd like you to thank for such nicely done piece of free software. I really appreciated of it. And like ahogen, I also use this parser. Ok, since you're planning to make significant changes to this project, I'd like to post here some thoughts about this:
For my humble opinion you should at least check if this is even printable character in case of regular strings. If in double quotes pair suddenly happens binary garbage - that should not be parsed anyway, welp. From other point of view, this project is not JSON validator, but JSON parser. Maybe it's better to throw away any validation since current validation is not even close to what should be and focus on performance and low memory usage. Then, recommend users to use separate modules for JSON validation, like this: http://www.json.org/JSON_checker/JSON_checker.c |
Well, for the JSON to be parsable it has to be valid so as long as the parser is never wrong then if we are able to parse the thing then we are basically validating it. There's no way out of it. How I think it could work is that the utilities for parsing JSON values as seen by JSMN will have to adhere to RFC-8259 anyway and so they will be able to tell you whether there is a problem with value itself. For example if token is of type string but it contains for example a byte of value 0x05, then the value parser should return an error as this is invalid according to the standard. |
Also I'm feeling the a lot of pressure here as it seems like the original creator of this project either abandoned it or is too busy/tired to work on it and beside him I am the only person that can do anything with this repository. I have ambition of keeping JSMN as great as it is when it comes to parsing speed and memory usage but make it better at actually parsing JSON. Right now in my opinion there are quite a bit parsing bugs in JSMN which mostly come from the fact of how it's built. Right now I have an idea of how JSMN could be slightly modified in a sense of idea of how the parsing happens but it would mean somewhat of a large code changes. And so I thought that why not as well just think of what else could be changed and just make JSMN the best thing out there that's easy to use. |
Currently, if
Dunno. It's free software, people are free to modify it for their own needs.
Hmmm... It reminds me an old joke about having a fast, reliable, strong, and cheap vehicle at same time. I have no idea if you can keep the same performance with all required validation. |
Haven't really done any work myself yet. So far it's all planning and getting the courage to do any changes to this repository. After all I'm not @zserge and this is basically official repository of JSMN. I was given responsibility I haven't exactly asked for :p Feel free to create PR. Although I think the way library works should be discussed to make it as close to perfect without weird trade-offs. But you are right, there are limits to what can be achieved here. |
@pt300 I really like the suggested list of changes. But I want to share my vision of the 'non-strict mode'. Basically, it's not a feature of JSMN, but rather a side-effect of "what would happen if we |
Additionally, I would rather invest into advanced documented supersets of JSON, like JSON5, rather than in the undocumented, barely known non-strict mode. |
What about the ability of estimation of token array length? Considering it works only for specific cases I think it's not a really good feature to keep around, especially that it seems to add more complexity to the code than it's worth. |
In my case I need to open a JSON object from an embedded file system based storage that may potentially be too large (up to ~64kb) to fit into the available RAM (~4 - 8kB), pull certain configuration values from it, then I'm done. I'm not particularly concerned about speed. Allocating RAM space for the token array is doable. I currently have JSMN implemented and working well, and have been getting by though minimising the size of the JSON string (remove whitespace, ensure that only required JSON elements are present, etc) and maximising the heap space available to fit larger objects; however we're starting to feel the pinch as other areas of the code grow and start to demand more RAM. I have been toying with the idea of abstracting the access to the underlying original JSON string so that it can still be backed by a raw array of bytes as it currently is, or alternatively a paged based solution that will only require a portion of the JSON string to be in RAM at once. I'm unsure if this is even possible with the structure of JSMN, and even if it is, it is probably too much of a change to be considered based the 'light weight and fast' ethos (which I very much appreciate). If I do get around to implementing it, I will certainly make the changes available. |
Non-strict mode is important to the extent that JSON implementations differ in the wild (or, at least, to the extent that the "few standards" differ.)
JSMN is simple enough to already be "complete" --- so you should consider forking and modifying as you need, manually merging bugfixes if/when bugs are ever discovered. Something I'd like to see is much clearer documentation of the functions, expectations in error cases, etc. Currently this requires code inspection & experimentation. |
Thank you for your work on jsmn. On my private copy I've added some functionality for easier traversal, in particular all tokens have a "first_child" and a "next_sibling". I added a very simple function to populate these values after parsing (speed is not important to me). And then I can iterate over children of a node by:
I am not asking anything from the jsmn project but if you would like to see the code I would be happy to share (not that it's difficult). |
We needed this for a project I'm on. I went ahead and implemented it. See PR #194. The downside is there are some sacrifices made to non-strict mode (see the PR for details on that).
@zserge This seems reasonable to me. (And is why I did not worry about sacrificing some non-strict functionality for RFC 8259 compliance in the PR above.) The rest of the proposed changes sound very nice. If we do add extra utilities, it would be nice to see them in either a separate header or able to be enabled/disabled with a I'd also like to propose a few additional improvements:
I plan to work on these in the near future and will create additional pull requests as they are ready, in case there is interest. Edit: I've submitted pull requests for single-object mode and more granularity for non-strict macros. I'd be interested in updating the README to add more detail once some of the decisions here are finalized. |
From @pt300's original comments:
Strict mode following RFC-8259 should definitely be the default and any flags should extend this functionality.
Letting parents be the rule rather than the exception would reduce the flags that end users need to use by default to increase speed and the cost of a small amount of memory with
This was the original reason for my fork of jsmn years ago and has become power behind my verification of RFC-8259 as well as rules for extending the base.
This was the second reason behind my original fork; I needed a faster way to get from one sibling to the next. This, combined with parent, makes going to the next and previous sibling very efficient. Down to @dominickpastore's additions
I both agree and disagree with you on this point. It is true that the current implementation parses multiple json objects, but I believe that this should be the exception rather than the rule. Testing against the rules from nst/JSONTestSuite, multiple json objects in a buffer should produce a negative result. So instead of your
This would definitely be helpful. I do like your implementation of
This also needs to be done. When I first started with the examples to try to figure out how to implement jsmn I was lost and it took more trial and error than it should have. I think we should also have a better example implementation. I have an
First I propose changing Along these lines, I think we should also extend I have made some of these changes in PR #197 and I intend to implement a few more. UTF-8 support is high on my list with more granularity following close behind (though the latter would be much faster to implement). |
A few comments:
Adding a field in
I would have preferred single-parsing to be the default as well, but I am hesitant to change this behavior since it would break backward compatibility. That might sound a bit silly when both you and I have already introduced major pull requests that break backward compatibility, but here is my ideology:
This last one is the category multiple-parsing falls into. Granted, there was only one test case I saw that applies to it, which is As for RFC compliance of multiple-parsing: RFC 8259 only defines what makes up a single valid JSON text. The context surrounding it is not discussed at all. Thus, it is totally up to jsmn whether it should parse one object at a time or multiple, if it should care whether there is non-JSON garbage following valid objects, etc. (That does not mean JSONTestSuite is incorrect, it's just operating on different assumptions. If the input is supposed to be a single JSON text, input containing multiple objects is, of course, not that.)
I used
I agree that this behavior is a good idea (see discussion under "Arrays and objects as keys" on PR #194), my point was just that it should be documented. Changing the name feels like a step too far though--see notes above on backward compatibility. |
My thought is that since we are changing the default functionality from non-strict to strict that we were breaking backwards compatibility already. If we are trying to follow RFC 8259 as strictly as possible (pun slightly intended), I think we should go all out and follow it as closely as we can. If the As for int test_simple_json(void) {
check(parse("\"Hello world!\"", 1, 1, JSMN_STRING, "Hello world!", 0));
check(parse("42", 1, 1, JSMN_PRIMITIVE, "42"));
check(parse("true", 1, 1, JSMN_PRIMITIVE, "true"));
return 0;
} Both the "42" and the "true" return For
You're probably right on this one. The other option would be to start with something like the following to have the possibility to phase it out. typedef struct jsmntok {
jsmntype_t type;
int start;
int end;
union {
int size __attribute__((deprecated("use children instead")));
int children;
}
#ifdef JSMN_PARENT_LINKS
int parent;
#endif
} jsmntok_t;
Absolutely. Let us get into specifics over in issue #154. |
That is a fair point.
Okay, I think there are two issues here: 1) Whether multiple-parsing violates RFC 8259 and 2) if it should be the default in jsmn. For (1): RFC 8259 ultimately just specifies a grammar. Something like For (2): My original argument was that, as a behavior RFC 8259 is neutral on, we should preserve it in the interest of backward compatibility. As you point out, backward compatibility is right out the window when we make strict mode the default. As long as we are changing the defaults for some configuration macros, there's little reason not to make single-parsing the default if that is more useful. So, that becomes the question: is single-parsing a more useful default? I would prefer to have wider input than just @themobiusproject and me on that, but I haven't seen much other activity in a while, so that might be wishful thinking. In fact, I see three options for this:
I am biased toward (3) being the default because the flexibility seems very useful (in fact, for my use case, I have a stream of JSON-formatted messages that need to be processed), but I can see the argument for (2): it's perhaps the least "surprising" way for jsmn to behave. Also, while one of them has to be the default, I think it is reasonable to have all three be selectable options.
Unfortunately, anonymous unions aren't allowed under C89, or even C99 :( That said, I still don't think it's a major issue as long as
I want to add a little more follow up on this: Parent links nearly universally improve parsing speed, so if memory constraints aren't a concern, you basically always want them. Sibling links aren't as universal of a benefit: they don't improve parsing at all and only benefit applications with the right use case (needing to traverse arrays or objects but not needing to visit every token within). I bring this up because sibling links seem to be a common request, but I think it's partly due to perceived difficulty of implementing array/object traversal. Without arrays/objects as keys and with the current semantics for /* Assume token is a pointer to the current jsmntok_t */
size_t remaining;
for (remaining = 1; remaining > 0; remaining--) {
remaining += token->size;
token++;
}
/* token now points to the next sibling */ Of course, the performance benefits are not to be ignored, and I suspect the majority of applications will benefit from them, so sibling links should likely be the default. I just wanted to mention that there are non-memory-constrained applications that would have reason to turn off sibling links. |
I think this point is what I disliked about multiple parsing by default; if a later json object fails then everything fails.
Advocating for? Pretty close to that. If given the choice between 1 and 2, I would choose 2.
But then you add option 3... Figuring out if there is more to the buffer that was passed in is simple (comparing the jsmn_parse
I recognize that anonymous unions are an extension of compilers rather than part of a standard, but I had to throw in one more option before allowing
My goodness, that is much more elegant solution than I came up with. (Heading back to my code to make some changes.) I would still want to include a To this point, I have an explode function, which I mentioned earlier, that visually helps people understand the data in the tokens and how to utilize them. I wrote it for a coworker to show how to use the tokens. A short example of the output of this (which uses the supplied library.json) is as follows:
This would be part of the better documentation that is sorely needed. |
I agree, I found this to be difficult to work with, even if there are multiple objects to be parsed.
It's almost that simple...if there is trailing whitespace, this won't work. I was thinking about this while I wrote that last night and I thought of two ways to implement it:
It's worth mentioning, this isn't the only sticking point for unenclosed primitives. They are also troublesome for incremental parsing. Suppose we have the following two JSON entities:
If they happen to get split when an application is doing incremental parsing:
the first results in
I think making it the default is reasonable. I think a lot of applications will benefit from the performance boost. And as straightforward as the above is, it's still not as easy as Also worth mentioning: traversing backward is not so easy. It's not as common a need, but should there also be
I agree. (This is also one of the things originally proposed at the top of this thread, of course.) That snippet above is straightforward, but it's not immediately obvious (though it might be more obvious with better documentation for I'd like to see the following in a
That is incredibly useful. That should absolutely be part of the documentation. |
Leading whitespace shouldn't be an issue either. I was planning on implementing it so when
For the previous sibling I go to the toksuper's first child and follow token->next_sibling until the next sibling is the token I started at. This is essentially what I do when I am figuring out next sibling anyways.
I have a good key lookup function that I would be happy to share and compare with what you have. I am getting utf8 implemented to also go from the escaped \u format to utf{8,16,32} and back again. My parser flags tokens as
I'm glad you like it. It has also really helped me debug jsmn as I have been modifying it. |
Thinking about it more I see your point; if you reach the end of valid json object but there is only whitespace left in the buffer, then you could misinform an application saying that there is more data to be parsed. To fix this issue I put a loop after a successful json object has been found right before the return count: while (parser->pos < len && isWhitespace(js[parser->pos])) {
parser->pos++;
} |
Ack on some single object mode! We're abusing it currently by setting This "works" at cost of wasted cycles if there is more than one object, since we discard the buffer covered by the first token and try again (this is a JSON command stream). |
Here I'd like present my suggestions for what I believe could be done to improve quality of JSMN. Everyone should feel free to comment on this and give own suggestions. Also, I will need opinion from @zserge on this.
Split strict and non-strict parsing code
Idea behind it is that this would make it easier to maintain both versions without having to overcomplicate the code and to eliminate risk of breaking strict version when trying to improve non-strict one
Make strict mode the default
Considering that JSMN is mostly aiming at parsing JSON strings and not it's abbreviations the library should do it by default.
Enable parent links by default
Parent links feature enables a great speed boost at a quite low memory cost. I believe a lot of people aren't using this library on really memory constrained platforms. But if there's a need for it the user should just disable parent links. It could be probably nice to rename the define to something like JSMN_LOW_MEMORY.
Strict mode should follow RFC-8259
There are a few standards with slight differences for JSON. I believe it should be explicitly said that we would follow RFC-8259 as THE standard for our JSON parsing.
Maybe create a package for Arduino
Considering the fact JSMN aim at platforms that are similar to what Arduino offers I believe it would be nice to either offer this library as is to Arduino community or separately write a wrapper to make access to this library more C++ like.
Bit flags for token types Make jsmntype_t values bit flags #108
Small change that would make it easier to write parsing code in certain places.
Utilities
Token array traversing
Using JSMN in a safe manner can be difficult. Because of that I think it would be nice to write an optional header file with utilities to easily and safely traverse JSMN's output. These should work at least for the strict mode.
Values decoding
At this point JSMN made the decision of not parsing the values. It could be left this way but then I believe it would be great to optionally offer utilities that could take for example a string from a token and decode it into UTF-8 or a widechar string.
The text was updated successfully, but these errors were encountered: