Skip to content

2019 Toronto Tuesday

Dzmitry Malyshau edited this page Apr 5, 2019 · 2 revisions

Battery life

(jrmuizel, markus)

Need to:

  1. avoid copy of the frame contents
  2. avoid rendering when we only need to scroll
  3. do partial presents
  4. for video, give YUV surface to WS
  5. not composite occluded windows
  6. use the same D3D device for video decoding to avoid a copy

Initial plan:

  • document splitting. Separate OS layers for documents.

Current GL behavior:

  • we call glSwapBuffers
  • Window Server knows what window is changed, considers the opaque-ness
  • unfortunately, our window is transparent, forsing WR to composite it on top

Short-term measures for baseline:

  1. switch to opaque
  2. use scissor rect in GL compositor
  3. by using core animation layer (instead of NSview GL backed)
  4. multiple layers (with different opaque-ness) on the same surface

Chrome on Mac uses aggressive CA layerization. Can enable debug visualization of the layer borders to see how they are composited and invalidated.

Android problem. Chrome is bad at it. Has recent APIs (both SDK and NDK) are similar to CA (SurfaceControl), need to use them!

Solution ideas:

  • identify multiple scroll roots in WR (ideally, pipe through this information from Gecko)
  • use tile caching and CA layers for our scroll roots
  • fall back to the current way of compositing if something is unusual
  • at first, have one giant WR tile per layer
  • picture caching is relevant since it has tracking for dependencies and knows when it needs to be invalidated
  • could draw directly into the tiles/layers, but that would require batching to be separate as well during invalidation

Video

Current:

  1. clear
  2. draw video
  3. WR copy to screen

Use WebGL? Already drawn into IOSurface, so we could put them on screen easily. WR needs to be able to mark images as "special" (aka "WebGL" image). WR needs to expose "this layer can go here" semantics. WR makes a decision to make layers, attaches to a root layer that Gecko provides. There is a bit of compositor logic in WR to know about platform layers.

Planeshift crate? Take it and shape into what we need.

Document splitting

One layer for document, one for chrome.

Need to get that first, play with partial present, and then go into layerization of the content.

Q: priorities for platforms A: Windows and Android first

Q: measuring battery perf, avoiding regressions

  • Intel power gadget on Mac
  • Resize many windows with throbbers
  • Mobile has comprehensive tools
  • Need to measure the number of dirty pixels we send to WS, periodically compare against the power metrics

TODO: find a champion to experiment with compositing on document splitting Jessie to follow up with perf team on Q2 OKRs for measuring battery perf

Problem: currently WR regresses versus non-WR on Windows, it's not using partial present.

First step in partial present: compute the dirty rect for chrome (throbber!). Currently WR doesn't do any dirty tracking on chrome, since it's outside of the tile cache.

Fusion update

(rhunt)

Essense: enhance the process isolation.

Old: 4 content processes for all tabs. Spectre attack proved that we can't trust JS memory. New: website is the only thing in an OS process. Any iframes inside it from different sites are in other processes.

Tricky case: site A contains site B that contains site A. Still needs the inner and outer sites in the same process...

Problem: memory usage raises with the many processes.

  • WebRender helps here because it shares a lot of context between tabs/iframes
  • ImageLib is still a problem though, since it lives in the content process

Fission roughly targets Nightly Windows desktop at the end of 2019. Currently in phase 2 (preffable, partially functioning).

Filters

Limitations: no filters on iframes.

TODO: look at how Chrome is doing compositing w.r.t. filters

We have telementry to know if we are painting a sub-document inside a filter. Could be an SVG filter as well.

WR should eventually support SVG filters natively. Rough plan and prototyping is taking place. Lots of incremental work to follow.

New API

... to recursively draw the contents starting from chrome in its process. Useful for screenshots and other things, where the compositor results aren't enough.

Non-WR codepath

(jrmuizel)

non-accelerated WR

Q: how do we indicate to the user that they are on non-WR and some bugs aren't going to be fixed?

Q: is it worth cleaning up the frame layer builder (FLB)? A: need to establish the timeline

WR ships in 67 on NV, 68 on AMD, then Intel.

Q: Win7 doesn't have direct composition, so WR path will be slow?

  • Win7 end of life is Jan 2020, we'll probably be fine with a slowdown, considering users will be migrating to Win10 next year
  • Main blockers: direct composition and driver quality
  • Today ~50% of Windows users are on 7 and 8 variants
  • half of those have D2D unavailable or blacklisted
  • we can drop D2D (for content only? as opposed to canvas2d) once we have the majority of current D2D users moved to WR

Android

  • ES2 we don't support. Missing things:
    • integer texture (can work around)
    • fp32 textures (should switch universally to fp16)
    • array textures
    • instanced arrays
    • dynamic loops in GLSL of blurs (can work around)
  • should be able to enable on ES3

Software

Changes since Orlando: picture caching helps. Problem: supporting both WR and non-WR code paths for FLB is becoming more of a pain.

Options:

  1. LLVM pipe
    • tried before on desktops, was somewhat usable
    • on the spot benchmark shows it to be usable, spends 90% of time in the text run rendering (presumably, because of subpixel AA), 5% in clip shaders, and the rest in blits
  2. SwithShader (in the future - Vulkan version)
    • TODO: need to benchmark with WR
    • currently being integrated into Gecko for WebGL (~Q2)
    • exposed GLES extensions
  3. D3D11 WARP
  4. Skia backend for WR?
    • allow us to make more shortcut

Depth testing

  • slower in software than on HW (because of memory bandwidth?)
  • could be fast? not at the moment
  • can do more aggressive culling on WR side for this case

Suggestion: "safe" WR mode:

  • where it doesn't run any complex shaders and just basically do compositing
  • we can run software WR with the existing GPU compositor in Gecko to get that
  • can use the same facilities as we need for direct composition
  • call skia to do all the non-trivial work?

Printing requires SW rendering. It can go through the same path as SVG images.

Plan:

  • FLB stays at least throughout 2019
    • reach the point where its performance doesn't matter any more
  • gradual removal of features:
    • drop support for component alpha
    • drop accelerated layers
  • turn D2D off for content in 2019

Idea: disable texture array for the texture cache (and more). Start with 512, gradually increase to the max size.

Next:

  • Prioritize getting someone to test WR on software GL to see how well it works
  • Need to determine specific performance problems
    • could also then get feedback from swiftshader to see if they have suggestions to help determine if that is the path to follow
    • swiftshader integrated behind a pref at some point?
  • When to do Windows 7?
  • Determine the blockers for shipping on more Windows

WebGPU status

Link to Dzmitry's doc

This summer: MVP snapshot of API where we can work towards release version of spec

Q. Binding and shader language resolved by then? TBD. MVPs might not consume same shader language. What to consume is largest outstanding question.

Display Lists

Items we'd definitely like to address this quarter (non-WR display lists):

https://bugzilla.mozilla.org/show_bug.cgi?id=1534549 https://bugzilla.mozilla.org/show_bug.cgi?id=1539597 https://bugzilla.mozilla.org/show_bug.cgi?id=1502049

For WR display lists

  • Potential low hanging fruit there we could address to improve performance
  • Alexis is already looking in to some of that
  • But we can discuss in more detail during WR planning

For non-WR:

  • Focus on work that isn't blocked by removing frame layer builder
  • See how far we get in terms of performance improvements this quarter
  • In June: we can make a call around how much more time we want to spend for now

Android GPU debugging and optimization workshop

(gw, kvark, jnicol)

Mali GPU debugger:

  • needs build.rs changes on unrooted devices in order to pre-load their library
  • need the tool to be launched first, then the app
  • shows render passes, but no tile invalidation info
  • shows shader costs for cycles, ALU, loads/stores, TU instructions, and register occupation

We spend 200us CPU time per draw call!

Things to keep in mind for Android:

  1. minimize framebuffer switches (causing resolves): bind, clear, draw, done
  2. invalidate earlier and more, i.e. the depth buffer
  3. reduce shader complexity
  4. try using pixel local storage to reduce the writes

Flattening semantics

(kvark, gw)

Two ways to interpret flattening as a concept:

  1. input Z is ignored
  2. output Z is zeroed, but input can affect X and Y

Problem: if Z is zeroed, the transform becomes uninvertible... We rely on this in several places through WR code. Current clip-scroll tree computes world transform and the inverse. The latter used in CPU and GPU code.

Case: plane splitting. We can move the splitting logic into the space of the preserve3D root. But then we need to figure out the view vector for sorting, and that requires the inverse for the world transform of the preserve3D root...

Preserve Z during flattening

The first idea to explore is turning Z into identity transformation on flattening instead of zeroing out. That would still "work" within an assumption that input Z is zero.

Another concept of inversion

The flattened transform is not generally invertible in a sense that you can't build an un-transform from it that is usable on any vector. However, we know that we are only going to be using it on 2D vectors, effectively, so the un-transform should have concrete sense for these cases.

We might need to come up with a mathematical primitive that explains this 2.5D transform and allows both CPU and GPU code to use it.

Mix-blend rollback

(kvark, gw)

RIP pictur-ization of mix-blend

CSS is written with a real read-back in mind.

CSS Shooter perf

(kvark, gw)

Main cost is on:

  • CPU compositor
  • GPU

Lots of draw calls and FBO switches with blits...

TODO: double check sample queries, ensure we take offscreen surfaces into account, also include the blits!

CSS uses the background blend mode with 2+ textures that Gecko supports but provides to WR as regular mix-blend SC.

Figured out the following solutions:

  1. use KHR_blend_equation_advanced and IHV specific variants
    • fall back to EXT_shader_framebuffer_fetch on some Mobile platforms
    • fall back to the current slow path
  2. special background blend code path?
  3. picture caching on plane splitting SCs:
    • depends on general picture caching work (ability to intern dependencies of all the pictures and find out which need to be cached)
    • blocked by the flattening rewrite...
  4. draw some of the plane split output as opaque