This document provides a short walkthrough of the source code for the PicoLisp-JSON encoder/decoder.
Note: This document covers v3
of the JSON library. To view the older (C/ffi bindings) version click here.
It's split into a few sections for easier reading:
- Global variables: Important variables used throughout the library.
- Pure PicoLisp JSON decoding: Decoding JSON in PicoLisp, without external libraries.
- Internal functions: Recursion and datatype-checking.
Make sure you read the README to get an idea of what this library does.
Also, I recommend you visit my PicoLisp Libraries Page for additional PicoLisp tips and ideas.
Prior to version 17.3.4
, PicoLisp provided the local function to prevent variables from leaking into the global namespace, however it was removed in the 32-bit
version, and its semantics were changed, thus introducing a breaking change for anyone using (local)
in their code.
To work around this issue, I modified the library to disable namespaces by specifying the environment variable PIL_NAMESPACES=false
.
(unless (= "false" (sys "PIL_NAMESPACES"))
(symbols 'json 'pico)
(local MODULE_INFO *Msg err-throw)
This change allows the JSON library to be loaded correctly on all 32/64-bit systems using PicoLisp higher than version 3.1.9
(for backwards compatibility), however if namespaces aren't required, it's probably best to disable namespaces as mentioned above.
In v2
, an external C library was used to perform JSON string decoding. This version gets rid of that dependency and performs all parsing directly in PicoLisp.
The JSON spec requires proper handling of Unicode characters written as: \uNNNN
, where N
is a hexadecimal digit, as well formfeed \f
and backspace \b
, which are not handled by PicoLisp. However it does handle newline \n -> ^J
, carriage return \r -> ^M
, tab \t -> ^I
.
Similar to the @lib/json.l
included with PicoLisp, this library calls str to tokenize the JSON string.
Unforunately, the tokenization removes the single \
from Unicode characters, turning \u006C
into u006c
, rendering it impossible to safely differentiate it from a random string containg the u006c
character sequence.
In that case, it's necessary to parse the Unicode characters before tokenizing the string:
(str (json-parse-unicode (chop Value)) "_")
The (json-parse-unicode)
function receives a chop(ped) list of characters representing the full JSON string, and returns a pack(ed) string with all \uNNNN
values converted to their UTF-8 symbol:
[de json-parse-unicode (Value)
(pack
(make
(while Value
(let R (pop 'Value)
(cond
[(and (= "\\" R) (= "u" (car Value))) (let U (cut 5 'Value) (link (char (hex (pack (tail 4 U) ] # \uNNNN hex
[(and (= "\\" R) (= "b" (car Value))) (pop 'Value) (link (char (hex "08") ] # \b backspace
[(and (= "\\" R) (= "f" (car Value))) (pop 'Value) (link (char (hex "0C") ] # \f formfeed
(T (link R)) ]
Let's see what's going on here:
- make is used to initiate a new list
- while loops over the list stored in
Value
, until the list is empty - pop removes the first element from the list stored in
Value
- A conditional check since we're searching for a
\b
(backspace),\f
(formfeed), or\uNNNN
(Unicode) character - If the character following
\\
(it's escaped\
) isu
, then we pop the next 5 items from the list (i.e:uNNNN
) using cut - link is used to add a new list to the list created with
(make)
- Finally, we pack the last 4 items from the previously cut items (i.e:
NNNN
), and use hex and char to convertNNNN
.
For Unicode characters, it ends up like this: "\\u0065" -> "e"
. Yay!
There's no point in decoding a JSON file that isn't valid, so an early detection method is to determine whether all the curly braces ({}
) and square brackets ([]
) are matched.
We'll use a stack-based algorithm to count brackets, and only consider it a success if the stack is empty at the end.
First, we provide the tokenized string to the (json-count-brackets)
function, and map over each character. For each character, we perform the following:
(if (or (= "{" N) (= "[" N))
(push 'Json_stack N)
(case N
("]" (let R (pop 'Json_stack) (unless (= "[" R) (err-throw "Unmatched JSON brackets '['"))))
("}" (let R (pop 'Json_stack) (unless (= "{" R) (err-throw "Unmatched JSON brackets '{'")))) ) ) )
- If the character is an opening
{
or[
, push it to the stack - If the character is a closing
}
or]
, pop the next value from the stack, and if that character isn't the matching bracket (i.e:{
for}
, or[
for]
), then we have unmatched JSON brackets. Easy.
Those who are paying attention will notice the (err-throw)
function. It does two things:
(msg Error)
(throw 'invalid-json NIL)
The msg function will output a message to STDERR, because the UNIX Philosophy.
The throw function will raise an error in the program, with the 'invalid-json
label and a NIL
return value.
The decoder will catch the raised error, as it should, but more importantly, the NIL
return value will indicate that decoding failed. This is important for programs which embed this library, as it won't break a running program, and will behave exactly as expected when something goes wrong.
We'll briefly cover the validation for objects, arrays, and the separator.
Essentially, (json-array-check)
, (json-object-check)
simply validate whether the value following the {
or [
brackets are allowed.
The (json-object-check-separator)
is used to ensure a :
separates the string from the value (ex: {"string" : value}
).
[de json-object-check (Name)
(or
(lst? Name)
(= "}" Name)
(err-throw (text "Invalid Object name '@1', must be '}' OR string", Name) ]
As you can see, it's quite simple, and if there's no match, (err-throw)
will be called.
This part of the code was completely rewritten from scratch, so we'll go through it together.
We'll begin by looking at how JSON is decoded in this library.
A fully tokenized JSON string might look like this:
("{" ("t" "e" "s" "t") ":" "[" 1 "," 2 "," 3 "]" "}")
Now, look at the (iterate-object)
function. This is a recursive function which loops and iterates through the global *Json
variable, a list which contains the tokenized JSON string, and then quickly builds a sexy PicoLisp list.
[de iterate-object ()
(let Type (pop '*Json)
(cond
((= "[" Type) (make (link-array T)))
((= "{" Type) (make (link-object)))
((lst? Type) (pack Type))
((num? Type) Type)
((= "-" Type) (if (num? (car *Json)) (format (pack "-" (pop '*Json))) (iterate-object)))
((= 'true Type) 'true)
((= 'false Type) 'false)
((= 'null Type) 'null)
(T (err-throw (text "Invalid Object '@1', must be '[' OR '{' OR string OR number OR true OR false OR null", Type) ]
We treat the *Json
list as a stack, and iterate through it after popping one or more elements, until there's nothing left but tears of joy.
The condition for [
will start a new list with make, and call (link-array)
with the argument T
. We'll see why later.
The rest is quite easy to understand, but I'll focus on the case of (= "-" Type)
. The tokenization doesn't recognize negative numbers, so -32
would be tokenized to '("-" 32)
. To solve this, we check for a single "-"
, and if the next item in the list is a number, then we pop the "-"
, pack it with the number ((pack)
creates a string), then use format to convert it to a number.
In other words, our tokenized '("-" 23) -> -23
. Please note, since -23
is not a string, this could not have been done in the Unicode parsing stage. It must occur after tokenization with (str)
.
Both the (link-array)
and (link-object)
function make a call to the more generic (link-generic)
function. It accepts three arguments: the type of item, the closing bracket, and an unevaluated quote(ed) function.
(link-generic "array"
"]"
'(link (iterate-object))
(link-generic "object"
"}"
'(link-object-value Name) ]
They're quite similar. In both cases, the function will iterate once more over the object, depending on various conditions described in (link-generic)
.
Let's look at some of the magic going on in (link-generic)
:
# 1. ((any (pack "json-" Type "-check")) Name)
# 2. (unless (= Bracket Name) (eval Iterator))
The first looks a bit weird, but it essentially uses any and pack to dynamically generate a function name, and then calls it with the Name
argument.
This gives something like: (json-array-check "[")
- dynamically generate Lisp functions ftw!
The second is a bit easier to grok, where it simply eval(uates) the given function passed as through the variable Iterator
.
Earlier, we saw (link-array T)
was called, but sometimes, only (link-array)
is called, without the T
argument. Why?
To differentiate an Array
from an Object
in PicoLisp, we append T
to the start of the list. When recursing, unless it's a new array, we don't provide the T
argument:
(when Make (link T))
The previously tokenized JSON string would end up like this:
(("test" T 1 2 3))
The code for encoding JSON strings hasn't changed, so feel free to read about it here.
That's pretty much all I have to explain about the new and improved v3
pure PicoLisp JSONencoder/decoder. I'm very open to providing more details about functionality I've skipped, so just file an issue and I'll look into amending this document.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Copyright (c) 2018 Alexander Williams, Unscramble license@unscramble.jp