Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make FeatureIterator more robust #207

Open
michaelkirk opened this issue Sep 2, 2022 · 0 comments
Open

Make FeatureIterator more robust #207

michaelkirk opened this issue Sep 2, 2022 · 0 comments

Comments

@michaelkirk
Copy link
Member

FeatureIterator is intended to be an efficient way to iterate through individual Features in a FeatureCollection without ever having to parse the entire FeatureCollection into memory at once.

To do so, it seeks through the input stream until it finds the features array, and then starts parsing the individual features one by one.

Currently however, the way we seek to the "start of the features" array is looking for any occurrence of [. This would fail with a document like:

{ 
  "type": "FeatureCollection",
  "bbox": [100.0, 0.0, 105.0, 1.0],
  "features": [
    ...
  ] 
}

or more complicated ones like this:

{ 
  "type": "FeatureCollection"
  "some_foreign_member": { "a": [1], "b" : { "foo": [2, 3] } },
  "features": [
    ...
  ],
}

We should update FeatureIterator to be more robust in how it parses a FeatureCollection.

bors bot added a commit that referenced this issue Sep 6, 2022
206: stream reading features r=frewsxcv a=michaelkirk

- [x] I agree to follow the project's [code of conduct](https://github.com/georust/geo/blob/master/CODE_OF_CONDUCT.md).
- [x] I added an entry to `CHANGES.md` if knowledge of this change could be valuable to users.
---

This greatly reduces the peak memory usage needed to parse a GeoJSON FeatureCollection by utilizing the FeatureIterator's trick of seeking to the `"features": [` array without actually parsing the FeatureCollection. Note there are [some robustness issues](#207) with this approach, some of which can be improved, but overall I think it's worth the trade-off.

There is a small (0-1%) CPU regression in our deserialization bench due to this change, but the peak memory profile of the `stream_reader_writer` examples decreased 90% - from 2.22MiB to 237Kib.

Before:
<img width="1624" alt="Screen Shot 2022-09-02 at 2 42 04 PM" src="https://user-images.githubusercontent.com/217057/188239657-ade76677-f3e2-4750-b71d-bed0effbe215.png">

After:
<img width="1624" alt="Screen Shot 2022-09-02 at 2 40 29 PM" src="https://user-images.githubusercontent.com/217057/188239645-5274e0ff-71b2-4da6-8ac9-1f72d288c2bb.png">

Notice that overall memory allocations stayed the same (see #88 (comment)), but the peak usage is much lower since we never hold the entire data structure in memory at once.


Co-authored-by: Michael Kirk <michael.code@endoftheworl.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant