Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing Variable Naming #181

Open
Theoreb opened this issue Oct 24, 2024 · 1 comment
Open

Enhancing Variable Naming #181

Theoreb opened this issue Oct 24, 2024 · 1 comment

Comments

@Theoreb
Copy link

Theoreb commented Oct 24, 2024

First of all, I'd like to thank the team for developing such a valuable tool. The integration of humanify for JavaScript deobfuscation is incredibly useful, and I appreciate the work that has been done.

However, using the local version of humanify (with the 2b model), I've encountered a few issues that could be resolved to improve the tool's functionality, especially when working with large, complex codebases.

Repeated variable names

In large code bases with many variables, humanify frequently assigns the same name to different variables, leading to confusion and reduced readability - especially when multiple variables are renamed in generic terms such as _______variable.

We should introduce a check to prevent duplication of variable names in the same scope. If a name has already been used, the system could ask for confirmation, ensuring that names remain unique and clear.

Limiting the length of variable names

Currently, variable names are limited to 12 characters, which can have the effect of truncating the names, making them more difficult to understand.

We should raise the limit to 25 characters, which would result in more descriptive names. In addition, we should ensure that names are valid in ASCII and avoid arbitrary word cuts for better readability.
If the model cannot give a concise enough name, we can add a condition to explicitly ask it to give a shorter name.

Insufficient Context for Variable Naming

When variables are defined in isolation (e.g., in very large blocks of declarations), the model struggles to assign meaningful names due to the lack of context:

`
var context = this;
var counter;
var generateCodes;
var d = 0;
var k = 0;
var g = 0;
var e = 8;
var b = -4;
var h = 0;
var n = 2;
var m = 0;
var q = 0.5;
var r = 0;
var s = 15;
var l = [];
var v = 99999;
var C;
var z;
var A = 0;
var G = false;
var J = false;
var K = false;

`

Providing the model with additional context from different sections of the code where the variables are actually used would likely improve the accuracy of the variable names. (We could go through the ast around the declaration zone and take several pieces where the variable is modified and used.) This enhancement would allow the model to make more informed decisions based on actual usage patterns rather than relying solely on initial definitions.

Edit: the '--contextSize' parameter is used to enlarge the program's context window. However, the parameter seems to have been declared as an optional Boolean and not as a valid optional integer (see:

.option(
"--contextSize",
"The context size to use for the LLM",
`${DEFAULT_CONTEXT_WINDOW_SIZE}`
)
).

Base on the commander.js lib documentation, it should be:
.option( "-c, --contextSize <contextSize>", "The context size to use for the LLM",${DEFAULT_CONTEXT_WINDOW_SIZE})

Thank you for considering these suggestions. I look forward to future improvements.

@0xdevalias
Copy link

We should introduce a check to prevent duplication of variable names in the same scope. If a name has already been used, the system could ask for confirmation, ensuring that names remain unique and clear.

Similar/related:

Also it can happen that a variable gets renamed but another one with the same name already exists in the scope, or the code gets split up in such a way that not all references are renamed.

Off the top of my head, if it's not easily solved via tweaking the prompt, you could potentially have it do a 'refinement/feedback round' where you provide feedback to the LLM + get it to correct the issues (which could be repeated up to X times) eg.

The following variables were already defined in the scope, so we can't use the names you chose. Please choose a new name for them so that they don't clash:

  • X
  • Y
  • Z

Originally posted by @0xdevalias in #8 (comment)

The proper fix would be to use toIdentifier from @babel/types

From a 2sec search I couldn't find rendered docs, but here's the relevant source for each:

toIdentifier definitely seems like a more robust approach than the current 'prefix with _' approach for sure.

Though I wonder if a 'proper fix' should also involve tweaking how we prompt for/filter the suggestions coming back from the LLM itself as well. Like instead of just forcing an invalid suggestion to be valid (with toIdentifier), we could detect that it's invalid (with isValidIdentifier) and then provide that feedback to the LLM, asking it to give a new suggestion; probably with some max retry limit; after which we could fall back to using the invalid suggestion run through toIdentifier, or log a warning and leave it un-renamed or similar.

Originally posted by @0xdevalias in #117 (comment)

I suppose the best solution could be that Humanify would retry with a prompt like:

Rename a function ..., here is a list of names that you cannot use: ${failedIdentifiers}

Originally posted by @jehna in #67 (comment)

Related context (from secondary comment on another issue):

Also it can happen that a variable gets renamed but another one with the same name already exists in the scope, or the code gets split up in such a way that not all references are renamed.
Maybe use babel's scope-aware rename that also handles these conflicts (have to find a way of getting rid of the _ prefix): astexplorer.net#/gist/7283141e13dab314521744603a95e9b7/05b11370db8d5ef257550b2916b87d6e72e4eb1d

Originally posted by @j4k0xb in #8

Originally posted by @0xdevalias in #67 (comment)


We should raise the limit to 25 characters, which would result in more descriptive names.

This sounds reasonable to me. Maybe we could also add an option/config to allow this to be specified/tweaked by individuals as needed.


Providing the model with additional context from different sections of the code where the variables are actually used would likely improve the accuracy of the variable names. (We could go through the ast around the declaration zone and take several pieces where the variable is modified and used.) This enhancement would allow the model to make more informed decisions based on actual usage patterns rather than relying solely on initial definitions.

This sounds like a cool approach. It sounds similar to what a lot of the good AI coding assistants do to provide context Eg. Heres a blog post about Aider implementing their 'repo map':

'Stack Graphs' is also another potentially interesting tool/tech in this space, that is part of what powers GitHub's 'precise code navigation':

A few notes/links/references I recently collated RE: stack graphs + related libs

Stack Graphs (an evolution of Scope Graphs) sound like they could be really interesting/useful with regards to code navigation, symbol mapping, etc. Perhaps we could use them for module identification, or variable/function identifier naming stabilisation or similar?

  • https://github.blog/changelog/2024-03-14-precise-code-navigation-for-typescript-projects/
    • Precise code navigation is now available for all TypeScript repositories.
      Precise code navigation gives more accurate results by only considering the set of classes, functions, and imported definitions that are visible at a given point in your code.

      Precise code navigation is powered by the stack graphs framework.
      You can read about how we use stack graphs for code navigation and visit the stack graphs definition for TypeScript to learn more.

      • https://github.blog/2021-12-09-introducing-stack-graphs/
        • Introducing stack graphs

        • Precise code navigation is powered by stack graphs, a new open source framework we’ve created that lets you define the name binding rules for a programming language using a declarative, domain-specific language (DSL). With stack graphs, we can generate code navigation data for a repository without requiring any configuration from the repository owner, and without tapping into a build process or other CI job.

        • LOTS of interesting stuff in this post..
        • As part of developing stack graphs, we’ve added a new graph construction language to Tree-sitter, which lets you construct arbitrary graph structures (including but not limited to stack graphs) from parsed CSTs. You use stanzas to define the gadget of graph nodes and edges that should be created for each occurrence of a Tree-sitter query, and how the newly created nodes and edges should connect to graph content that you’ve already created elsewhere.

        • Why aren’t we using the Language Server Protocol (LSP) or Language Server Index Format (LSIF)?

          To dig even deeper and learn more, I encourage you to check out my Strange Loop talk and the stack-graphs crate: our open source Rust implementation of these ideas.

  • https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github
    • GitHub has developed two code navigation approaches based on the open source tree-sitter and stack-graphs library:

      • Search-based - searches all definitions and references across a repository to find entities with a given name
      • Precise - resolves definitions and references based on the set of classes, functions, and imported definitions at a given point in your code

      To learn more about these approaches, see "Precise and search-based navigation."

      • https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github#precise-and-search-based-navigation
        • Precise and search-based navigation
          Certain languages supported by GitHub have access to precise code navigation, which uses an algorithm (based on the open source stack-graphs library) that resolves definitions and references based on the set of classes, functions, and imported definitions that are visible at any given point in your code. Other languages use search-based code navigation, which searches all definitions and references across a repository to find entities with a given name. Both strategies are effective at finding results and both make sure to avoid inappropriate results such as comments, but precise code navigation can give more accurate results, especially when a repository contains multiple methods or functions with the same name.

  • https://pl.ewi.tudelft.nl/research/projects/scope-graphs/
    • Scope Graphs | A Theory of Name Resolution

    • Scope graphs provide a new approach to defining the name binding rules of programming languages. A scope graph represents the name binding facts of a program using the basic concepts of declarations and reference associated with scopes that are connected by edges. Name resolution is defined by searching for paths from references to declarations in a scope graph. Scope graph diagrams provide an illuminating visual notation for explaining the bindings in programs.

Originally posted by @0xdevalias in 0xdevalias/chatgpt-source-watch#11


Edit: the '--contextSize' parameter is used to enlarge the program's context window. However, the parameter seems to have been declared as an optional Boolean and not as a valid optional integer (see:

.option(
"--contextSize",
"The context size to use for the LLM",
`${DEFAULT_CONTEXT_WINDOW_SIZE}`
)

).
Base on the commander.js lib documentation, it should be: .option( "-c, --contextSize <contextSize>", "The context size to use for the LLM",${DEFAULT_CONTEXT_WINDOW_SIZE})

It'd probably be better to raise this as a separate bug issue than to group it within this one; as it's more likely to be seen/fixed quicker that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants