Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance on using the tile_size parameter? #7

Open
rsimon opened this issue Jun 13, 2023 · 8 comments
Open

Guidance on using the tile_size parameter? #7

rsimon opened this issue Jun 13, 2023 · 8 comments

Comments

@rsimon
Copy link

rsimon commented Jun 13, 2023

Hi,

do you have any guidance on how to choose the tile_size parameter? I've been experimenting with our dataset of geo-coordinates (projected, EPSG 3857, with X/Y coordinate ranges between ~ [-20.000.000, 20.000.000]).

This seems to work in deepscatter in principle. But rendering is extremely slow, not like the other online demos I've seen. It also appears to render all our data at the initial zoom level. I then changed the tile_size to 1.000.000 and recreated the tileset. This seemed to display the initial render much faster, but then didn't appear to render a hiqher-quality image after zooming.

Another issue I observed: the inital viewport wasn't automatically adjusted to the coordinate range of the points. Rather, the points were crammed into a tiny corner. Easy to zoom into the correct region, of course. But I'm wondering if there's something I'm doing wrong with my coordinates?

@bmschmidt
Copy link
Owner

As a rule I use a tile_size parameter of about 65,000. I don't know that there are major benefits to larger ones--the key sign that you need a larger tile size, in my mind, would be if the loading waterfall is always backlogged and transfers are spending much of their time in a state other than transferring data from server to client. It's relatively flexible, but there may be some cases where it breaks. (Off the top of my head there is one place where I allocate memory on the GPU in blocks of 64MB, which means that a tile_size of greater than 16 million might completely break things in deepscatter; I can't think of any others, but I also haven't tested much. Created an issue there)

If you expose the scatterplot as a window variable plot, you can see what deepscatter thinks the extent is by entering plot.dataset.extent. A miscalculated extent can indeed cause some problems, because it means the quadtree will be degenerate for several layers--rather than each tile having four populated children, it will have just one for several partitions.

My advice for now is I think the same as before--try populating not from CSV, but from a parquet file where you can be confident of types. It's possible, e.g., that I have some North America-specific character localization choices in the way it does the Arrow CSV parser. I don't know I've every built a quadtile larger than 20 million off of a CSV file, but I've done over 1 billion happily on parquet.

@rsimon
Copy link
Author

rsimon commented Jun 15, 2023

Thanks for this. I converted to Parquet (using PyArrow). This did seem to improve things somewhat - and worked on the whole dataset! 🙏

But I still think I'm getting wrong behavior during tile rendering/zooming. Rendering still takes ages, and my machine freezes almost completely when rendering. (I'm attaching a screenshot below. Done in mid-render, but still after waiting for at least 5mins or so.)

I also logged the inferred extent, as you suggested. Deepscatter claims it's approx. x [ -3.433.116.378, 21.834.822.860 ], y [ -968.008.534, 4.070.507.694 ]. This is way off, but probably matches exactly with the intial viewport I'm seeing. Is there a way to fix the extent during vthe generation of the tileset?

image

@rsimon
Copy link
Author

rsimon commented Jun 15, 2023

P.S.: for reference, I'm attaching a screenshot of the initial viewport. (It's the tiny orange dot, mid/left.)

image

@rsimon
Copy link
Author

rsimon commented Jun 15, 2023

Ah - never mind, found it. It's the limits argument.

Ok, fixing the coordinate ranges yields much better results:

image

Rendering is still a bit slow (as you can see with the missing tiles in North America, and my machine is still half-frozen. But I assume that's not avoidable?

@bmschmidt
Copy link
Owner

There shouldn't be major problems rendering 100m points quickly. I'll send you a link to a 1.5b point map that should render fairly smoothly. I suspect the rendering problems and the limits problems are somehow related. If the limits are wrong, it will attempt to draw too many points at a time. Someone else has reported. similar bug in quadfeather this week I'm investigating.

Something else that can cause big performance degradation because of the way GPUs work is coincident points--it's possible adding some random noise with the --randomize parameter might help here. But I suspect something else is going on.

@rsimon
Copy link
Author

rsimon commented Jun 15, 2023

Thanks, I'll try --randomize. But we shouldn't really have coincident points I think. At least not significant amounts...

@bmschmidt
Copy link
Owner

Are you running with tile_size 1m still? On further thought could conceivably slow things a great deal, because it could lead to many unnecessary points being processed on GPU.

@rsimon
Copy link
Author

rsimon commented Jun 15, 2023

No, I went back to tilesize 50.000.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants