discussion: pluggy speedup by using 'codegen' #449

wahuneke · 2023-09-27T18:31:37Z

I wanted to pull this conversation into a new ticket so that I don't clutter up @RonnyPfannschmidt 's project ticket. This discussion relates to issue #322 .

The idea is not new. The technique is described here, for example: https://medium.com/@yonatanzunger/advanced-python-achieving-high-performance-with-code-generation-796b177ec79

I am shortening the concept to the word 'codegen' for the sake of this discussion.

I'm hoping this discussion can allow project leaders to ask and answer these questions:

"How much speedup does it give?"
"How much 'mess' does it add to the code?"
"Even if we keep the amount of exec code to a bare, bare minimum, is it something that we want in the project?"

Some key takeaways:

in the POC, this implementation achieved approximately 18% - 50+% speedup (more plugins -> more speedup)
this speedup does not require new packages or build steps (e.g. no Cython)
the resulting code does have the downside of requiring that some code should live as a string instead of as code. But, the quantity of this code abomination is relatively small

wahuneke · 2023-09-27T18:31:47Z

Let me explain the idea by walking you through its evolution from the beginning:

After running the standard benchmark against the standard pluggy code, I thought: "how fast would it be if I just write the function calls out by hand as a straight script and use that in place of the hook caller??" .
I reduced the benchmark down to one test (5 plugins, 0 wrappers) and wrote that out very simply as a new type of hook caller. It looked like this (more or less):

result = []
result.append(plugin1.fun(hooks, nesting))
result.append(plugin2.fun(hooks, nesting))
result.append(plugin3.fun(hooks, nesting))
result.append(plugin4.fun(hooks, nesting))
result.append(plugin5.fun(hooks, nesting))
return result

Patching the above in to the benchmark as a new type of PluginManager was not difficult and pretty soon I could see that (obviously) this straight version is a lot faster. (let's say 90% faster, I don't remember for sure)
From here, the next jump was this: "can I write the above but make it 'dynamic'?" Here's an approximation of the hook caller that results. Just quick and dirty:

class NewerPluginManager(PluginManager):
    def do_compile(self):
        """When you're ready to enter !!turbo!! mode, run this. hook attribute will be replaced and can't be restored"""

        # NOTE: this is not the real code. I don't have a copy of that, so this is approximate,
        # for explanation purposes
        plugins = []
        code = ""
        for i, impl in enumerate(reversed(self.hook.fun.get_hookimpls())):
            plugins[i] = impl.function
            code += f"result.append(plugins[{i}](hooks, nesting))\n"

        @dataclasses.dataclass
        class Caller:
            locs: Mapping
            pyc: Any

            def fun(self, **kwargs):
                result = []
                exec(self.pyc, None, {'result': result} | kwargs | self.locs)
                return result

        self.hook = Caller({'plugins': plugins}, compile(code, "", "exec"))
...
...
    # And its usage in the test case below:
    for i in range(plugins):
        pm.register(Plugin(i), name=f"plug_{i}")
    for i in range(wrappers):
        pm.register(PluginWrap(i), name=f"wrap_plug_{i}")

    if hasattr(pm, 'do_compile'):
        # This step is TEMPORARY, just for the POC, don't worry  :)
        pm.do_compile()
 
    benchmark(pm.hook.fun, hooks=pm.hook, nesting=nesting)

I hope that the above example illustrates the concept. The above runs and works as well as the version from step 2. It is a bit slower than step 2 but a lot faster than the baseline pluggy code.

Next, re-implement the above but add in all the complexity that is in callers._multicall() (exception handling, wrapper handling, etc.).
Lastly, make the do_compile() step into something automatic and get it set up so that it will automatically re run the compile when needed. Thus eliminating the need to manually declare that you are ready to "compile"

The code resulting after step 5 (which is where I've stopped for now) is now in a temporary branch in my fork.

wahuneke · 2023-09-27T18:46:37Z

wahuneke · 2023-09-28T18:31:15Z

I've started working on a general purpose Python package to facilitate this type of optimization. Perhaps it will be worth revisiting this idea in Pluggy after I've had a chance to analyze and streamline the concept in a context completely external to pluggy.

RonnyPfannschmidt · 2023-09-29T05:51:39Z

I intend to prepare some refactorings that move more responsibility into the hook callers as to ensure a simpler inner loop

My personal intent is to eventually move those loops from the python level to the c level as the Interpreter overhead seems more than say a simple loop

Right now im under the impression that removing call overhead and hook parameter overhead are a bigger win than dynamically unrolling loops.

That being said I also don't quite follow the optimization tooling you are preparing, some docs that pulled together the research and benchmarks + explain the niche/structure where this manifests it's wins would be highly appreciated.

With my current understanding of the technique you proposed, my impression is, that The potential gains would melt away as the tooling moves from poc to feature party, and I'd love to be wrong there

wahuneke · 2023-09-30T20:44:30Z

Well said. I agree that performance gains often "melt away" in the move to feature parity.

I cannot pave the way into C extensions, as I've never done it before. I'd love to see how it is done though and will be happy to help with details in C once the outline is in place. (I do have quite a lot of C experience and love the language in general).

Since you asked about specific gains, I will highlight one particular line. As I described in my earliest comment (in the other ticket), I started out with some experiments where I hard coded some values to see where speedups happen. There were two lines in particular that I found to be particularly costly in the original code:

args = [caller_kwargs[argname] for argname in hook_impl.argnames], this one for obvious reasons. But, more surprising was this line which struck me as something that might be replaced by an isinstance check or something: function_gen = cast(Generator[None, object, object], res) I didn't dig into it much, but the creation of the type, the specialized generic Generator, might be more costly than we would have guessed from looking at it. Or maybe it's the cast function itself, somehow.

It's that first one there which led to me developing the rather convoluted solution that I am now fine tuning.

Edit: I also wanted to underscore that the performance wins do not come from loop unrolling or from dropping a bit of unused code (those were mentioned more as simple examples of optimizations that also do happen to be occuring). The main win is in the way that the call to the plugin is built (it's not a list, built through iterating and doing dict lookups), instead it uses dynamically-hard-coded names against symbols that are injected into the exec namespace directly by using the caller_kwargs as the exec local namespace.

RonnyPfannschmidt · 2023-10-01T07:22:25Z

The cast is just for type annotations

There's a number of potential techniques to simplify certain details

The loop about arguments for example can be replaced with a pre-allocated vector call in the c api

Hook calling vs hook wrapper calling can be split into more concrete apis

I still strongly doubt that a python codegen will bring consistent enhancement to pytest where calls with different hook sets and additional hooks are common

wahuneke · 2023-10-02T14:21:52Z

I understand that it is for type annotations, but there is a comment in the code that implies that pluggy is also counting on the function to effectively provide a type check as well. In any case, I was just trying to answer your question about the specific areas where I'd observed (empirically) a possible performance impact.

I still strongly doubt that a python codegen will bring consistent enhancement to pytest where calls with different hook sets and additional hooks are common

I think this answer resolves my initial question (in the other ticket) about whether a "codegen" approach would be a good fit here. It sounds like the sequence of hooks is too dynamic, and it sounds like there are simpler solutions for performance improvements which are already in the pipeline.

I'll close this ticket since I think the idea has been thoroughly discussed!

wahuneke mentioned this issue Sep 27, 2023

speedup of pluggy #322

Open

9 tasks

wahuneke closed this as completed Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discussion: pluggy speedup by using 'codegen' #449

discussion: pluggy speedup by using 'codegen' #449

wahuneke commented Sep 27, 2023 •

edited

Loading

wahuneke commented Sep 27, 2023

wahuneke commented Sep 27, 2023

wahuneke commented Sep 28, 2023

RonnyPfannschmidt commented Sep 29, 2023

wahuneke commented Sep 30, 2023 •

edited

Loading

RonnyPfannschmidt commented Oct 1, 2023

wahuneke commented Oct 2, 2023

discussion: pluggy speedup by using 'codegen' #449

discussion: pluggy speedup by using 'codegen' #449

Comments

wahuneke commented Sep 27, 2023 • edited Loading

wahuneke commented Sep 27, 2023

wahuneke commented Sep 27, 2023

wahuneke commented Sep 28, 2023

RonnyPfannschmidt commented Sep 29, 2023

wahuneke commented Sep 30, 2023 • edited Loading

RonnyPfannschmidt commented Oct 1, 2023

wahuneke commented Oct 2, 2023

wahuneke commented Sep 27, 2023 •

edited

Loading

wahuneke commented Sep 30, 2023 •

edited

Loading