Application-wide, size-limited cache for scenario networks #940

abyrd · 2024-07-20T11:54:38Z

When running regional analyses on a list of over 70 scenarios with bike egress on a large network with many transit stops (Netherlands), the workers eventually run out of memory and stall. This is because the linkages and egress tables for all these scenarios accumulate in memory and are never evicted. This problem was not usually noticeable when people were running analyses one by one, but becomes apparent when using scripting tools to launch regional analyses for large numbers of slightly varying scenarios.

Here I have used a Caffeine LoadingCache to hold the scenario networks instead of a simple Map. This allows eviction of older items when the cache reaches a maximum size. It was a little tricky to locate all the scenario application logic in the right place together with the CacheLoader function while maintaining access to all needed data structures (such as the ScenarioCache). There were comments in the code about the possibility (and advantages) of a single top-level scenario network cache instead of nesting such a cache within each TransportNetwork. Adopting this approach by switching to a single cache inside TransportNetworkCache simplified/clarified some of the code.

The single-cache approach does have the downside of not evicting the scenario networks together with their base network when it's evicted, as well as the problem of possibly retaining multiple base networks in memory because the N cached scenario networks hold references to different base networks. But in practice, in non-local (cloud) operation, a given worker instance is locked to a single network and these situations should never happen.

Here scenario application is happening inside a CacheLoader. When resolving or applying scenarios, all sorts of errors and exceptions can then happen inside the cache's value-loading code. But as mentioned in the code comments and Caffeine Javadoc, the get method of the LoadingCache allows these exceptions to bubble up. I tested this with some handled and unhandled exceptions and validation problems inside the scenario/modification application code, and everything worked as expected, with errors clearly visible in API responses and the UI.

greatly improves debugging dependent artifacts

Moved from individual TransportNetwork instances to a single cache under TransportNetworkCache.

abyrd · 2024-07-20T14:07:23Z

Initial impression is that this worked as intended in a large batch of regional analyses. In two similar previous runs, workers crashed from running out of memory or ground to a halt from excessive garbage collection, and parts of the batch had to be manually rerun. After this change, the whole batch finished quickly.

Nonetheless this deserves some careful review because it shifts the locking/synchronization behavior onto the loading cache instead of simple synchronized methods, and eviction behavior will be different than before. Some complex interactions may exist with large/numerous regional analyses, or workers performing both regional and single point analyses.

ansoncfit

Summarizing some points from a fresh look at this PR and a detailed discussion in a work session:

We expect the changes here to fix a bug (potential Out of Memory Exceptions) and make regional analysis progress smoother, especially when workers are switching between tasks on different scenarios. The latter change (to locking/synchronization behavior) may introduce a few edge cases in single-point analyses and local operation, but any risks there are small (in an extreme case, one worker bogging down) and in my judgement worth the improved user experience.

Expanding on

[shifting] locking/synchronization behavior onto the loading cache instead of simple synchronized methods,

This PR introduces a Caffeine LoadingCache to use its eviction features for scenario networks. Previously, a LoadingCache was used for the base network but a Map was used for the scenarios. LoadingCaches also have locking behavior built in. Removing the synchronized keyword two methods, as proposed here, is not necessary to fix the initially identified bug. But given the use of a LoadingCache, removing the original synchronization allows smoother transitions between regional analyses on different scenarios and arguably improves legibility/maintainability.

The limit of 10 scenarios ignores large potential differences in scenario sizes (e.g., depending on which egress mode linkages/tables are included in the corresponding linkage cache).

The eviction strategy is least-recently used. So in the common case of repeatedly comparing one scenario (e.g. some future "baseline") against a range of alternatives, the "baseline" will remain in the cache.

Workers in cluster mode do not switch between networks (the broker will not send them a task for a network they have not requested work on). In local use, networks can switch, so there may be some unexpected behavior introduced when comparing many different scenarios/networks in locally.

In our discussion a few months ago, there was an "aside" about having workers drop out of the pool as they built new egress modes. This issue is more directly addressed by features enabled in #941, rather than the changes here.

abyrd added 4 commits July 20, 2024 18:11

produce sources JAR

baab891

greatly improves debugging dependent artifacts

update some comments

eac1272

Unified size-limited scenario network cache

704c8ce

Moved from individual TransportNetwork instances to a single cache under TransportNetworkCache.

remove explicit synchronization, rely on caches

1862bef

fix javadoc

9d36b01

abyrd marked this pull request as ready for review October 3, 2024 00:04

ansoncfit approved these changes Oct 3, 2024

View reviewed changes

Merge branch 'dev' into abyrd/scenario-network-cache

54eb4f4

abyrd enabled auto-merge October 3, 2024 16:34

abyrd merged commit 3437d67 into dev Oct 3, 2024
3 checks passed

abyrd deleted the abyrd/scenario-network-cache branch October 3, 2024 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application-wide, size-limited cache for scenario networks #940

Application-wide, size-limited cache for scenario networks #940

abyrd commented Jul 20, 2024 •

edited

Loading

abyrd commented Jul 20, 2024

ansoncfit left a comment

Application-wide, size-limited cache for scenario networks #940

Application-wide, size-limited cache for scenario networks #940

Conversation

abyrd commented Jul 20, 2024 • edited Loading

abyrd commented Jul 20, 2024

ansoncfit left a comment

Choose a reason for hiding this comment

abyrd commented Jul 20, 2024 •

edited

Loading