Enhancing Variable Naming #181

Theoreb · 2024-10-24T14:57:24Z

First of all, I'd like to thank the team for developing such a valuable tool. The integration of humanify for JavaScript deobfuscation is incredibly useful, and I appreciate the work that has been done.

However, using the local version of humanify (with the 2b model), I've encountered a few issues that could be resolved to improve the tool's functionality, especially when working with large, complex codebases.

Repeated variable names

In large code bases with many variables, humanify frequently assigns the same name to different variables, leading to confusion and reduced readability - especially when multiple variables are renamed in generic terms such as _______variable.

We should introduce a check to prevent duplication of variable names in the same scope. If a name has already been used, the system could ask for confirmation, ensuring that names remain unique and clear.

Limiting the length of variable names

Currently, variable names are limited to 12 characters, which can have the effect of truncating the names, making them more difficult to understand.

We should raise the limit to 25 characters, which would result in more descriptive names. In addition, we should ensure that names are valid in ASCII and avoid arbitrary word cuts for better readability.
If the model cannot give a concise enough name, we can add a condition to explicitly ask it to give a shorter name.

Insufficient Context for Variable Naming

When variables are defined in isolation (e.g., in very large blocks of declarations), the model struggles to assign meaningful names due to the lack of context:

`
var context = this;
var counter;
var generateCodes;
var d = 0;
var k = 0;
var g = 0;
var e = 8;
var b = -4;
var h = 0;
var n = 2;
var m = 0;
var q = 0.5;
var r = 0;
var s = 15;
var l = [];
var v = 99999;
var C;
var z;
var A = 0;
var G = false;
var J = false;
var K = false;

`

Providing the model with additional context from different sections of the code where the variables are actually used would likely improve the accuracy of the variable names. (We could go through the ast around the declaration zone and take several pieces where the variable is modified and used.) This enhancement would allow the model to make more informed decisions based on actual usage patterns rather than relying solely on initial definitions.

Edit: the '--contextSize' parameter is used to enlarge the program's context window. However, the parameter seems to have been declared as an optional Boolean and not as a valid optional integer (see:

humanify/src/commands/local.ts

Lines 24 to 28 in 706108e

    
           .option( 
        
             "--contextSize", 
        
             "The context size to use for the LLM", 
        
             `${DEFAULT_CONTEXT_WINDOW_SIZE}` 
        
           )

).

Base on the commander.js lib documentation, it should be:
.option( "-c, --contextSize <contextSize>", "The context size to use for the LLM",${DEFAULT_CONTEXT_WINDOW_SIZE})

Thank you for considering these suggestions. I look forward to future improvements.

The text was updated successfully, but these errors were encountered:

0xdevalias · 2024-10-25T01:41:18Z

We should introduce a check to prevent duplication of variable names in the same scope. If a name has already been used, the system could ask for confirmation, ensuring that names remain unique and clear.

Similar/related:

Also it can happen that a variable gets renamed but another one with the same name already exists in the scope, or the code gets split up in such a way that not all references are renamed.

Off the top of my head, if it's not easily solved via tweaking the prompt, you could potentially have it do a 'refinement/feedback round' where you provide feedback to the LLM + get it to correct the issues (which could be repeated up to X times) eg.

The following variables were already defined in the scope, so we can't use the names you chose. Please choose a new name for them so that they don't clash:

X

Y

Z

Originally posted by @0xdevalias in #8 (comment)

The proper fix would be to use toIdentifier from @babel/types

From a 2sec search I couldn't find rendered docs, but here's the relevant source for each:

isValidIdentifier: https://github.com/babel/babel/blob/a2025d7d695b0f3b0506f66225d6b15bcfa1cf6a/packages/babel-types/src/validators/isValidIdentifier.ts#L7-L25

toIdentifier: https://github.com/babel/babel/blob/a2025d7d695b0f3b0506f66225d6b15bcfa1cf6a/packages/babel-types/src/converters/toIdentifier.ts#L4-L26

toIdentifier definitely seems like a more robust approach than the current 'prefix with _' approach for sure.

Though I wonder if a 'proper fix' should also involve tweaking how we prompt for/filter the suggestions coming back from the LLM itself as well. Like instead of just forcing an invalid suggestion to be valid (with toIdentifier), we could detect that it's invalid (with isValidIdentifier) and then provide that feedback to the LLM, asking it to give a new suggestion; probably with some max retry limit; after which we could fall back to using the invalid suggestion run through toIdentifier, or log a warning and leave it un-renamed or similar.

Originally posted by @0xdevalias in #117 (comment)

I suppose the best solution could be that Humanify would retry with a prompt like:
Rename a function ..., here is a list of names that you cannot use: ${failedIdentifiers}
Originally posted by @jehna in #67 (comment)

Related context (from secondary comment on another issue):

Also it can happen that a variable gets renamed but another one with the same name already exists in the scope, or the code gets split up in such a way that not all references are renamed.
Maybe use babel's scope-aware rename that also handles these conflicts (have to find a way of getting rid of the _ prefix): astexplorer.net#/gist/7283141e13dab314521744603a95e9b7/05b11370db8d5ef257550b2916b87d6e72e4eb1d

Originally posted by @j4k0xb in #8

Originally posted by @0xdevalias in #67 (comment)

We should raise the limit to 25 characters, which would result in more descriptive names.

This sounds reasonable to me. Maybe we could also add an option/config to allow this to be specified/tweaked by individuals as needed.

Providing the model with additional context from different sections of the code where the variables are actually used would likely improve the accuracy of the variable names. (We could go through the ast around the declaration zone and take several pieces where the variable is modified and used.) This enhancement would allow the model to make more informed decisions based on actual usage patterns rather than relying solely on initial definitions.

This sounds like a cool approach. It sounds similar to what a lot of the good AI coding assistants do to provide context Eg. Heres a blog post about Aider implementing their 'repo map':

https://aider.chat/2023/10/22/repomap.html

'Stack Graphs' is also another potentially interesting tool/tech in this space, that is part of what powers GitHub's 'precise code navigation':

Explore whether stack graphs may be useful in this tool Aider-AI/aider#534

A few notes/links/references I recently collated RE: stack graphs + related libs

Stack Graphs (an evolution of Scope Graphs) sound like they could be really interesting/useful with regards to code navigation, symbol mapping, etc. Perhaps we could use them for module identification, or variable/function identifier naming stabilisation or similar?

https://github.blog/changelog/2024-03-14-precise-code-navigation-for-typescript-projects/

Precise code navigation is now available for all TypeScript repositories.
Precise code navigation gives more accurate results by only considering the set of classes, functions, and imported definitions that are visible at a given point in your code.

Precise code navigation is powered by the stack graphs framework.
You can read about how we use stack graphs for code navigation and visit the stack graphs definition for TypeScript to learn more.

https://github.blog/2021-12-09-introducing-stack-graphs/

Introducing stack graphs

Precise code navigation is powered by stack graphs, a new open source framework we’ve created that lets you define the name binding rules for a programming language using a declarative, domain-specific language (DSL). With stack graphs, we can generate code navigation data for a repository without requiring any configuration from the repository owner, and without tapping into a build process or other CI job.

LOTS of interesting stuff in this post..

As part of developing stack graphs, we’ve added a new graph construction language to Tree-sitter, which lets you construct arbitrary graph structures (including but not limited to stack graphs) from parsed CSTs. You use stanzas to define the gadget of graph nodes and edges that should be created for each occurrence of a Tree-sitter query, and how the newly created nodes and edges should connect to graph content that you’ve already created elsewhere.

https://github.com/tree-sitter/tree-sitter-graph

tree-sitter-graph
The tree-sitter-graph library defines a DSL for constructing arbitrary graph structures from source code that has been parsed using tree-sitter.

https://marketplace.visualstudio.com/items?itemName=tree-sitter.tree-sitter-graph

tree-sitter-graph support for VS Code
This language extension for VS Code provides syntax support for tree-sitter-graph files.

Why aren’t we using the Language Server Protocol (LSP) or Language Server Index Format (LSIF)?

To dig even deeper and learn more, I encourage you to check out my Strange Loop talk and the stack-graphs crate: our open source Rust implementation of these ideas.

https://github.com/github/stack-graphs

Stack graphs
The crates in this repository provide a Rust implementation of stack graphs, which allow you to define the name resolution rules for an arbitrary programming language in a way that is efficient, incremental, and does not need to tap into existing build or program analysis tools.

https://docs.rs/stack-graphs/latest/stack_graphs/

https://github.com/github/stack-graphs/tree/main/languages

This directory contains stack graphs definitions for specific languages.

https://github.com/github/stack-graphs/tree/main/languages/tree-sitter-stack-graphs-javascript

tree-sitter-stack-graphs definition for JavaScript
This project defines tree-sitter-stack-graphs rules for JavaScript using the tree-sitter-javascript grammar.

The command-line program for tree-sitter-stack-graphs-javascript lets you do stack graph based analysis and lookup from the command line.

cargo install --features cli tree-sitter-stack-graphs-javascript

tree-sitter-stack-graphs-javascript index SOURCE_DIR

tree-sitter-stack-graphs-javascript status SOURCE_DIR

tree-sitter-stack-graphs-javascript query definition SOURCE_PATH:LINE:COLUMN

https://github.com/github/stack-graphs/tree/main/languages/tree-sitter-stack-graphs-typescript

tree-sitter-stack-graphs definition for TypeScript
This project defines tree-sitter-stack-graphs rules for TypeScript using the tree-sitter-typescript grammar.

The command-line program for tree-sitter-stack-graphs-typescript lets you do stack graph based analysis and lookup from the command line.

https://dcreager.net/talks/2021-strange-loop/

Redirects to https://dcreager.net/talks/stack-graphs/

Incremental, zero-config Code Navigation using stack graphs.

In this talk I’ll describe stack graphs, which use a graphical notation to define the name binding rules for a programming language. They work equally well for dynamic languages like Python and JavaScript, and for static languages like Go and Java. Our solution is fast — processing most commits within seconds of us receiving your push. It does not require setting up a CI job, or tapping into a project-specific build process. And it is open-source, building on the tree-sitter project’s existing ecosystem of language tools.

https://www.youtube.com/watch?v=l2R1PTGcwrE

"Incremental, zero-config Code Nav using stack graphs" by Douglas Creager

https://media.dcreager.net/dcreager-strange-loop-2021-slides.pdf

https://media.dcreager.net/dcreager-2022-ucsc-lsd-slides.pdf

https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github

GitHub has developed two code navigation approaches based on the open source tree-sitter and stack-graphs library:

Search-based - searches all definitions and references across a repository to find entities with a given name

Precise - resolves definitions and references based on the set of classes, functions, and imported definitions at a given point in your code

To learn more about these approaches, see "Precise and search-based navigation."

https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github#precise-and-search-based-navigation

Precise and search-based navigation
Certain languages supported by GitHub have access to precise code navigation, which uses an algorithm (based on the open source stack-graphs library) that resolves definitions and references based on the set of classes, functions, and imported definitions that are visible at any given point in your code. Other languages use search-based code navigation, which searches all definitions and references across a repository to find entities with a given name. Both strategies are effective at finding results and both make sure to avoid inappropriate results such as comments, but precise code navigation can give more accurate results, especially when a repository contains multiple methods or functions with the same name.

https://pl.ewi.tudelft.nl/research/projects/scope-graphs/

Scope Graphs | A Theory of Name Resolution

Scope graphs provide a new approach to defining the name binding rules of programming languages. A scope graph represents the name binding facts of a program using the basic concepts of declarations and reference associated with scopes that are connected by edges. Name resolution is defined by searching for paths from references to declarations in a scope graph. Scope graph diagrams provide an illuminating visual notation for explaining the bindings in programs.

Originally posted by @0xdevalias in 0xdevalias/chatgpt-source-watch#11

Edit: the '--contextSize' parameter is used to enlarge the program's context window. However, the parameter seems to have been declared as an optional Boolean and not as a valid optional integer (see:

humanify/src/commands/local.ts

Lines 24 to 28 in 706108e

.option(

"--contextSize",

"The context size to use for the LLM",

`${DEFAULT_CONTEXT_WINDOW_SIZE}`

)

).
Base on the commander.js lib documentation, it should be: .option( "-c, --contextSize <contextSize>", "The context size to use for the LLM",${DEFAULT_CONTEXT_WINDOW_SIZE})

It'd probably be better to raise this as a separate bug issue than to group it within this one; as it's more likely to be seen/fixed quicker that way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing Variable Naming #181

Enhancing Variable Naming #181

Theoreb commented Oct 24, 2024 •

edited

Loading

0xdevalias commented Oct 25, 2024

Enhancing Variable Naming #181

Enhancing Variable Naming #181

Comments

Theoreb commented Oct 24, 2024 • edited Loading

Repeated variable names

Limiting the length of variable names

Insufficient Context for Variable Naming

0xdevalias commented Oct 25, 2024

Theoreb commented Oct 24, 2024 •

edited

Loading