Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor performances fix and a small bug fix #45

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

fdagenais-cinesite
Copy link
Contributor

@fdagenais-cinesite fdagenais-cinesite commented Feb 8, 2021

First of all, this branch contains some minor performance fixes:

  1. For this one, the credits go to Alexander who pointed it out. In most cases, an agent id's will match its index in the agents list. When fetching an agent id from its index, we can save time by first looking if the index matches the id. This avoids having to loop over all entries in the vector, which does not scale well as the number of agents increases.
  2. Most of the time is generally spent computing the skin deformation. During that operation, roughly 5% of the time was spent on non required normalization of 4D vectors. These extra computations have been removed. In my test with ~750 agents, this was roughly ~2s, which is not much, but as the number of agents and/or their mesh complexity increases, this could save some more time.
  3. Loading the scene hierarchy of the AtomsVariationReader or AtomsCrowdGenerator nodes in Gaffer, it can take a few seconds to process. Most of the time is spent on loading the atoms meshes in AtomsVariationReader. To speed things up, all meshes from one agent are now loaded in parallel, which significantly improves responsiveness. Note that the gain is limited when processing more that the atoms node, since other nodes might be processed on other cores, e.g., during a render. I wasn't sure this was really needed, but it does improve things when looking directly at the Atoms nodes. I can remove Also note that this is a quick fix, but a better solution, which would take more time to implement, would be to change the way we load atoms data. Instead of loading everything at once, we could load only the required information to generate the child names, bbox, etc., and then load the meshes only when requested.

Also, this branch contains a small bug fix I found while looking at the skin deformation code in AtomsCrowdGenerator. Instead of using the normals matrix's inverse to transform the normals back to local space, the normals matrix (local to world) is used instead. I believe this has not been detected before because the agent's transformation matrices only contained translation information, with the inner 3x3 matrix being the identity.

- Faster agent Id retrieval in most cases
- Removed unnecessary vector normalizations when computing skin
  deformations
Updating the variations scene (childnames, sets, etc.), requires
to load all atoms data in memory. This includes all meshes, which
are read in series. Meshes from one agent type are now loaded in
parallel to improve speed.
Copy link
Collaborator

@andrewkaufman andrewkaufman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we don't usually load N in the first place, so the normalization computations never occur. Might be worth looking into whether you need the deformed N from Atoms or can get away without it (eg if your meshes are subdivided anyways, or even if they are linear you would fallback to winding-order at rendertime).

}
else
{
std::cout << "Index did not match... Doing a linear search!" << std::endl;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't leave the debug print in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed that one... I will remove it, thanks!

if ( !inGeoMap )
// Load atoms meshes in parallel
tbb::blocked_range<size_t> tbb_range(0, nbGeos, 1);
tbb::parallel_for(tbb_range, [&](const tbb::blocked_range<size_t>& range) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to be careful with tbb parallelization inside gaffer nodes because of the way the compute cache policies work. Here's an example:

https://github.com/GafferHQ/gaffer/blob/15f6530800a99028fd60dbfc3cc1541d0bc139c8/src/GafferScene/PrimitiveSampler.cpp#L371

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two different considerations here :

  1. TBB exception handling. Unless told otherwise, TBB will silently cancel your parallel_for if an exception is thrown in another task inside the same task tree - which as far as you are concerned could be completely unrelated to what you are doing. This is what you need the task_group_context that Andrew linked for - it tells TBB that your parallel_for is a distinct operation that shouldn't be affected by exceptions thrown in any others.
  2. Deadlock due to TBB task stealing. TBB loves to wander off and do whatever TBB tasks it can squeeze in, so the wait for your parallel_for can cause the execution of completely arbitrary TBB tasks elsewhere in Gaffer. Gaffer performs does some operations behind mutexes, so if your parallel_for is behind that mutex, and the arbitrary task TBB chooses to do needs that mutex, you get deadlock. This is what the cache policies Andrew mentioned are for. By declaring a cache policy of either TaskIsolation or TaskCollaboration, you avoid the deadlock by making sure TBB won't do any outer tasks unrelated to your parallel_for. TaskCollaboration has more overhead but is smart enough to allow multiple waiting threads to collaborate on the construction of the engine if they all pull on the engine plug concurrently. There's an example of this in FilterResults.

Yes, there is an unfortunate overloading of the term "isolation" here. This comes from TBB, where task_group_context::isolated means "not cancelled by exceptions elsewhere" and this_task_arena::isolate means "won't wander off and perform unrelated tasks".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of that. Thank you for the clear explanation. :) Just to be sure, that means that setting the cache policy on the engine data output to TaskCollaboration will allow the currently running threads (the ones from which the engineDataPlug compute method are called from) to wait and pick up the subtasks (atoms mesh loading) while they are waiting for the output to be computed? Does that mean that setting it to TaskIsolation will prevent the subtasks to have access to all cores, or that the subtasks will simply need to create more threads instead of reusing the ones waiting, which causes more overhead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, that means that setting the cache policy on the engine data output to TaskCollaboration will allow the currently running threads (the ones from which the engineDataPlug compute method are called from) to wait and pick up the subtasks (atoms mesh loading) while they are waiting for the output to be computed?

Yes.

Does that mean that setting it to TaskIsolation will prevent the subtasks to have access to all cores?

Yes, if other tasks/threads also need the result from engineDataPlug(), they would just wait uselessly for it in the TaskIsolation case. TaskIsolation solves the deadlock problem, but does nothing to aid performance. In an ideal world, TaskCollaboration would be the only option, but it has additional overhead so we are keeping TaskIsolation around as an option until we're happy that collaboration is suited for everything - more recent TBBs seem to have a new mechanism that might help us implement it more efficiently.

In your case, I would expect the engine loading to be heavy enough and infrequent enough that TaskCollaboration would be the way to go.

Copy link
Contributor Author

@fdagenais-cinesite fdagenais-cinesite Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks!

- Removed a forgotten cout
- Fixed a few issues related to TBB in EngineData::loadAgentTypeMesh
  (AtomsVariationReader.cpp)
@fdagenais-cinesite
Copy link
Contributor Author

Reviewers comments have been handled in latest commit:

  • The std::cout call that was left over has been removed.
  • The engineDataPlug() cache policy has been changed to TaskCollaboration.
  • The atoms mesh loading tbb loop is now run in it's own isolated context group.

Note: I did not add a way to disable Normals transformation during skin deformation because I was not sure if there was a safe way to guarantee that it won't be used. If you want, I can still add an option that can be toggled by the user to enable/disable normals transformation.

@andrewkaufman
Copy link
Collaborator

Sorry, to clarify about my skipping N comment, I mistakenly thought we had an opt-out plug on AtomsVariationReader which would skip loading them here, but checking now I see we only have that for Pref & Nref... If we had an additional toggle for N then the AtomsCrowdGenerator wouldn't need any changes because it already returns early if they are missing.

In any case, its not related to this PR, just a thought for the future.

@johnhaddon
Copy link
Collaborator

Latest changes look good to me.

Another thing you might want to consider for the future is adding support for cancellation. This is the mechanism Gaffer uses to abort long running operations and make the UI responsive again. You do that by calling IECore::Canceller::check( context->canceller() ) periodically in your long compute. If cancellation has been requested, that'll throw an IECore::Cancelled exception which will abort the computation all the way back to the caller. Your only other responsibility is to make sure your code is exception-safe. Here's a super simpler example :

https://github.com/GafferHQ/gaffer/blob/c908425b6d2e82105cd92a1398fae8eeb0ca9600/src/GafferImage/Resample.cpp#L552

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants