Skip to content

2019 Toronto Monday

Dzmitry Malyshau edited this page Apr 3, 2019 · 1 revision

Interning positions?

(gw, kvark)

  • we intern sizes but not positions of clips and primitives
  • gecko bakes the scroll offsets
  • new API now allows us to ask for the scroll offsets and “unbake” them
  • however, hit testing still is a problem (TODO: clarify)
  • need get_relative_transform to be used consistently
  • blocked on flattening rework (TODO: resolve)

Render task graph

(nical, gw, kvark)

Problem 1: a task is dependent on my multiple other tasks Case: multiple drop shadows on the same text item Opportunity: only downscale the text once, use for all shadow tasks

Note: render task cache can't be used, since it doesn't handle dependencies well (scheduling issue).

  • solution: schedule the RT cache as late as possible
  • when dependencies are in the texture cache, we'd need to render in a pass and blit back to the texture cache

Concern: render task rect rect allocation assumes that the source is in the previous pass.

  • solution: blit contents of a task across passes
  • schedule as late as possible
  • retain some of the render task slices/rects as opposed to clearing

Blits are expensive:

  • bounds are not tight
  • the perf on Intel scales with the number of pixels we touch

TODO: check ARM/Mali for when the tiles are resolved:

  • does it happen if the tile is unchanged?
  • what if it was just cleared?

Motivation:

  1. reduce redundant shadow tasks
  2. remove "mark for saving"
  3. SVG filters are expressed as graphs

Q: retain across frames? A: currently, not retaining any shadow tasks

Q: debugging tools for RT graph resolver? A: fun thing to write, given that the thing is fairly standalone N: need tooling to find out the best scheduling off-line, compare with the run-time by the number of pixels

Current "best" strategy:

  • ping-pong as current WR
  • schedule late

Q: incremental deployment of the new alloctor?

  1. first, integrate with existing behavior
  2. enable for shadows and other things
  3. play with strategies

Future:

  1. Since texture array, sub-manage slices. Try work with slices, not rects.
  2. Render pass as the whole render target, select the slice in VS.
  3. Try identifying the 1:1 pass work, use sub-passes. - tile cache memory limits? know how many mask slices are there - can provide the whole frame as one giant render pass

Q: can we exploit the axises and auto-rotate things?

  • would be good!
  • segmentation solves the problem to some extent
  • could also exploit the symmetry

Rounded corners optimizations:

  • only render the corners into the mask
  • exploit the symmetry
  • more precise bounding/geometry to reduce fill rate
    • can't apply the local clip rect in this case!
      • quad tree subdivision (or a regular grid) - still draw rects
    • can't multiply the clip mask (TODO: discuss)

Current PLS "optimization"

(gw, kvark, Gankro)

Ideas:

  1. use a test case that doesn't rely on tile elimination
  2. avoid unorm <-> f32 convertions
  3. bind as write-only more often (requires 32-bit chunks to be written)
  4. don't multiply clip values early, only do in the combination pass

TODO: compile a list of questions for ARM

  1. how exactly we can get advantage of tile elimination?
    • does it work for off-screen targets?
    • what are the supported formats?
    • what are the states that affect it?
  2. is anything happening to a tile we don't touch by geometry?
  3. TODO

WebGPU integration

(kats, kvark, jgilbert)

Process of vendoring wgpu-rs:

  • move into the tree
  • improve the remote layer
  • establish scripts to/from GitHub

Gecko will have 2 implementations as well, in the shape of different structs with the same virtual interface: local and remote. Differences are:

  1. Client parameter in all functions of the remote layer
  2. Swapchain integration (unknown on Gecko side)
  3. Pass dependencies collection in the remote layer

Q: how do we reduce the JS calls in client apps?

Moving into Gecko:

  1. copy into tree as "gfx/webgpu"
  2. connect it into libgkrust's Cargo.toml and lib.rs
  3. Run ./mach vendor rust, check complains about licensing. This will add dependencies to third_party/rust, make sure it looks sane
  4. add build option "--enable-webgpu" similar to WR here. make the libgkrust integration stuff from step 2 behind a feature flag controlled by the build option - JG: Why allow non-webgpu builds? We don't let you build without webgl. - DM: switch ON when ready at least in some form? No need to slow down everyone for now - JG: This is done with a pref, not a build option, usually. - DM: WR was a build option at the beginning, before it was able to consider shipping anywhere - JG: We know we're going to be shipping it, and that we want a prototype to play with sooner rather than later. To that end, it seems like all downside to have this be a build option. Just leave it as a pref, if that's acceptable. (which I think it is!) - DM: OK, sounds reasonable - DM: A tricky part is selecting which backend to build with. If it's optional, we can straightforwardly enable Vulkan build on Linux CI. If it's mandatory, we'll need to resolve the backend selection logic right here, which complicates the integration a bit.
  5. To add a taskcluster job, first decide if you want a full Firefox build with webgpu enabled, or a standalone webgpu build. I think the former might make more sense if you just need to catch build regressions. For the latter, copy and modify the existing webrender standalone jobs such as this one - copy it into a new taskcluster/ci/webgpu/kind.yml file. You won't need the wrench-deps stuff, just run cargo build in the gfx/webgpu folder

WebRender architecture overview

(gw)

Display Lists:

  • Items
    • text
    • box shadow
    • image
  • Stacking contexts
    • filters
  • Clip chains
  • Reference frames

Scene ("model"):

  • picture tree
  • spatial tree
  • clip chains

Q: "scene" term vs WR capturing Q: internation? (see picture caching) Q: tile caching?

can be scrolled around

Frame ("view"):

  • update spatial tree
  • update picture tree
  • update visibility
  • update primitives
    • generate render tasks
  • assign passes
  • batch

Submit:

  • apply resource updates
  • for each pass (see GPU work topic)

Picture caching

Q: stuff moved but not marked as changed by the debug overlay? A: could be fixed-position element that isn't cached, drawn on top

Interning key:

  1. item itself
  2. clipping
  3. transform
  4. animated properties (e.g. opacity)

Picture = (prim uuid, uuid, uuid, ..) Tiles 1024x256. Identifying the dirty regions and updating them with a scissor rect.

Q: what happens with complex regions? A: draw the whole thing

  • should set the Z on tiles to reject the pixels over the valid tiles (TODO: verify)
  • blog-post-like pages are still the bad case

Q: tile coordinate space? A: world. If stuff is scrolled, the positions are adjusted, so we get the same world results.

Q: how does the valid content gets into the new frame A: copied through the texture cache

New API in development to expose the scroll offsets to WR from Gecko, allows removing hacks in WR and caching more surfaces.

Clusters are build during flattening:

  • bounds
  • spatial node

GPU work in depth

draw_tile_frame:

  • for each pass
    • bind pass n-1 as input
    • for each A8 target
      • draw clips
      • draw blurs
    • for each RGBA8 target
      • draw borders
      • draw alpha batches
      • draw blurs
      • draw scalings

Most drawing looks like:

  1. bind textures
  2. bind shader
  3. update VAO/instances
  4. draw instanced

How the shader looks: brush_solid -> brush.glsl

main() { // VS
  fetch_brush();
  brush_vs();
}
main() { // FS
  brush_fs();
  do_clip();
}

Data is passed to shaders

  1. PrimitiveInstance written to the instance buffer - 16 bytes with prim address, clip address, flags
  2. brush common and specific data is written to the GPU cache
  3. read by fetch_brush

List of all brush and scene shaders:

GPU cache

Rows associated with block count per element(16, 64, 512, etc). Simple slab allocator to find a next entry after the user provided all the data via request().

TODO: validation could be more comprehensive

Q: do we have i32 textures? A: segmentation! primitive header: color, UVS, and GPU cache address - are written into the f32 prim header

Texture uploads

(mstange, gw, kvark, jrmuizel, dan, ..)

Client storage on Mac:

  • alignment
  • don't use texture storage
  • don't use it with texture in flight
  • don't use it with texture data

Use as an upload vector only, not as direct texture storage.

Problems:

  1. stalls! no proper PBO renames
  2. forced format conversion: no BGRA8 internal format on Mac

Potential path to fight stalls:

  • switch to Scatter
  • re-initialize GPU cache texture

Q: remove GPU cache texture in favor of vertex data only?

Idea: small test suite to figure out what works well and what not on a platform (texture uploads, UBOs, depth testing, etc)

WebRender debugging workshop

https://paper.dropbox.com/doc/WR-debugging-workshop--AalZzf941wQkvDMDIAURIDBqAQ-RC4fgQlmYHmrU83Sd8Nds