-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple diff analysis based on strings and identifiers #10
Comments
According to a random post I read on Hacker News, difftastic uses tree-sitter too, so I suppose it would be general enough for this use case |
@michaelskyba For the variable names like
@michaelskyba An easy way to connect from those strings back to the context in the code would definitely be interesting. Almost feels like something you could do with
I know @j4k0xb has done some cool stuff with using monaco as the web IDE for And I think @pionxzh 's As a first exploration/PoC, I would probably consider the contextual linking back to be out of scope though, as it has the ability to quickly become a complicated rabbithole/distraction from the core features.
@michaelskyba Mm, makes sense.
@michaelskyba Yeah, there are definitely lots of AST parsers out there. Technically you could probably use TreeSitter or similar for this. I'm curious what your reasoning for wanting to avoid using node is though? Pretty much all of the scripts and automation in this repo currently are currently written in node/JS/TS; for many use cases it's a pretty useful tool. If you want to go deep on this subject, there's heaps of stuff on my gists, eg.:
@michaelskyba Yup,
@michaelskyba If you're happy to use node and babel parser/etc, then I can definitely help you get started with it. There are already some examples within this repo (albeit not using traverse I don't think): There are a few 'AST Explorer' tools that are super useful when working with AST's as well: I would personally start by writing it as a standalone script that just writes to the console or a file, that way you can quickly 'prove the concept' and see if it will be beneficial. Then later on, if you wanted to enhance the UX (user experience) of using it by integrating it into a text editor / IDE / GUI / etc, that could be a path for further exploration. |
Here's a quick/dirty example that I hacked together with ChatGPT + some patterns from existing code. I haven't deeply tested it works properly, but from a quick skim through + basic test, it looks like it should probably work, or be close to doing so: Example String Extraction Code#!/usr/bin/env node
const fs = require('fs');
const path = require('path');
const parser = require('@babel/parser');
const traverse = require('@babel/traverse').default;
let DEBUG = false; // Initial debug state
function displayUsage(scriptName) {
console.log(`Usage: ${scriptName} <file-path>
Options:
--debug Enable debug mode.
-h, --help Display this usage information.`);
}
function parseCommandLineArguments() {
const scriptName = path.basename(process.argv[1]);
const args = process.argv.slice(2);
if (args.includes('-h') || args.includes('--help')) {
displayUsage(scriptName);
process.exit(0);
}
const debugFlagIndex = args.indexOf('--debug');
if (debugFlagIndex !== -1) {
DEBUG = true;
args.splice(debugFlagIndex, 1);
}
if (args.length !== 1) {
displayUsage(scriptName);
console.error('\nError: Invalid number of arguments provided.');
process.exit(1);
}
const filePath = args[0];
let code;
try {
code = fs.readFileSync(filePath, 'utf8');
} catch (err) {
console.error(`Error reading file: ${filePath}\n${err.message}`);
process.exit(1);
}
return { code };
}
function parseCodeToAST(code) {
return parser.parse(code, {
sourceType: 'module',
plugins: [
// Add any necessary plugins here
],
});
}
function extractStrings(ast) {
const strings = [];
traverse(ast, {
StringLiteral: ({ node }) => {
strings.push(node.value);
},
TemplateLiteral: ({ node }) => {
node.quasis.forEach((element) => {
strings.push(element.value.cooked);
});
},
});
return strings;
}
function main() {
const { code } = parseCommandLineArguments();
const ast = parseCodeToAST(code);
const strings = extractStrings(ast);
if (DEBUG) {
console.log('[DEBUG] Extracted Strings:', strings);
} else {
console.log(strings.join('\n'));
}
}
main(); The main parts to pay attention to are Example Output⇒ ./scripts/extract-js-strings.js unpacked/_next/static/chunks/webpack.js
define cannot be used indirect
object
function
object
static/chunks/
1092ac6e
9784a43c
queryString
c029043f
tsub-middleware
30750f44
303a24d8
420d684a
schemaFilter
auto-track
legacyVideos
65fd0ec4
sso
5be645cc
remoteMiddleware
ajs-destination
0d9e9cd9
b9f778a8
adb5c70d
.
660fa56f7d9f8841
f366c884f1a0069d
5c208462c867bb6d
11bb38d51776e36c
3ff460de12e80a25
1695899797a18685
c39541320838d2e1
7f0de228a3562cc0
35c9d0cb5854cadd
1aac68555afd4d6b
035f577202ad7345
fd18bd86faeaa7b0
e9a991b898301078
361ecd58fa0ac9f5
797361ea99a892a2
8596a5c250b2c4f7
b4746c7a1f512b5a
36f791c3f2dbf0ec
29104c892767faf2
415799798aaeeffc
04ab16a240f4ce97
32ab3d24800a1a80
3421eea06995d5df
0f0eee9f9d7a9a3a
df174924c6968fe4
c865be216348dd51
2f6ad9ce262d966a
4c4cb925af3442d0
c63200a5c100c707
46b523e60584f44f
739643ec58c60aab
2fad09c9cd727ecc
a08151d3efed095c
0c0ee75dab4b3a1d
f0c19e9fbc4a2762
1ce750acd0106d6e
b80d67713515dff2
d028af8c55d386c9
b41a5c7bc5017269
94b9b8a3298ce71f
352bbad25795ea42
60463b33da33b8b8
6ef5e476dad7ed16
2ae7c3fec20da7cb
73ef26c15091fb86
eecba8f575f628a2
0445a5d12846771b
9dc2787a40cce737
1a42259f0c3d91c1
6c7d9ca4e6244388
70979731ae1a472e
c3edeef63a58e362
270e8342cc6087e0
a671d3e99d9a3215
ee7e1452e33739e6
b0eb71c1fc6253b5
4b3fe5232d753e7c
2177f28b937c71b0
583ec3a0c8764140
018b8b2946dc8b20
10400e67574b5a9a
abf37f1db8be5c89
3a60974229840ffd
c682389becf401d9
f235c81abfebf195
6d3d8ed11b0e9709
367b294ff3d2e4a0
5fa4b187af7d697f
42809fbb8b2f331e
01c9e3f110f368ae
c6c0f36a9a82d8dd
fa6fc1112abad268
87e726622393959c
84cbf69a0bf382da
edae42c864b47324
bc4f77e47caaeac6
cbbca09eae978a72
79c8284e0121e7e4
d018f6e9a9ebea16
16a1a63dd7d6adcb
207481a6c8e707c7
5b368a3b1f211414
32ab39c72c4d4509
3b0e463179d66d5e
6abd0ae9c08306a9
36bb6cb6e7d3b8e4
4ed8f6b71d8bfe92
aead845693cd969c
1f34ff95cbccd78d
d9cd8bc88bf780e8
92961eb1ec331066
550cbb4354158933
.js
static/css/
f3d0bad461ffb80b
944a9aa5c00c7d0e
d92eaad6fd257c73
.css
object
return this
object
_N_E:
script
src
data-webpack
script
utf-8
nonce
data-webpack
/
anonymous
timeout
undefined
Module
__esModule
undefined
nextjs#bundler
https://cdn.oaistatic.com/_next/
link
stylesheet
text/css
load
load
missing
Loading CSS chunk
failed.
(
)
CSS_CHUNK_LOAD_FAILED
/
anonymous
link
data-href
href
stylesheet
style
data-href
load
missing
Loading chunk
failed.
(
:
)
ChunkLoadError
chunk- While the above example shows that the concept works, running it on a larger file ends up with a LOT more output content (eg. 57k lines in this case): ⇒ ./scripts/extract-js-strings.js unpacked/_next/static/chunks/pages/_app.js | wc -l
57025 Obviously most of that wouldn't change between builds though, so would end up being a lot less all in all. Edit: Committed the above basic script in 210a870 for future reference/usage. |
Something like this could also be useful for reducing the 'noise' of code that moves around:
|
Yeah, possibly; that's the kind of thing I was thinking could be useful as a text editor integration. But I think that would make sense as a wrapper around the main script. The output itself could maybe have the line numbers to the left of each finding, which would be parsed by the wrapper
I guess I just personally find the JS/npm ecosystem pretty annoying compared to e.g. Python or Go. It's not a dealbreaker though
Yeah, looks useful, I linked to it as part of my notes for the ChatGPT frontend
I still want to look at Tree-sitter in case I like it more, but this kind of API from function extractStrings(ast) {
const strings = [];
traverse(ast, {
StringLiteral: ({ node }) => {
strings.push(node.value);
},
TemplateLiteral: ({ node }) => {
node.quasis.forEach((element) => {
strings.push(element.value.cooked);
});
},
});
return strings;
} |
Looks like building .so language files from source is deprecated in py-tree-sitter, but there are no official python bindings for the javascript grammar. Welp, that's annoying... The third-party Go bindings in https://github.com/smacker/go-tree-sitter look more promising maybe |
From the testing I've done to this point, I think would prefer go-tree-sitter to Babel, so I'll continue developing the formatting here. It'd still be useful to have your extract-js-strings.js for reference, though. |
That sounds like it could indeed save time on skipping some changes, but I'd expect some minification diffs to be structurally different too, such that the same values are written differently, even if the actual name is the same. But I expect @0xdevalias would have a more informed perspective In other news, I got a basic byte tracker that checks for newlines, to map each name properly:
|
afaik minifiers should produce the exact same structure for unmodified regions of the source |
Oh, that's more convenient, then. Does the formatter (biome/prettier) have the same property? |
depends, they sometimes introduce additional line breaks, parentheses, " " around all properties if one changes, etc |
RE: #10 (comment)
@michaelskyba Yeah definitely. Having a script that is able to be run standalone, but that can also be wrapped into an IDE/other tooling for a nicer GUI is how I would look at approaching it myself. --
@michaelskyba Mm, fair enough. I find they all have their own issues and strengths, but don't tend to find one is objectively better than the other in most cases. Each to their own though. --
@michaelskyba Mm, makes sense. RE: #10 (comment)
@michaelskyba Nice! RE: #10 (comment)
@michaelskyba That's fair enough. Curious, aside from not liking node/etc, what aspects of RE: #10 (comment)
@j4k0xb Yeah, that is more or less the same concept I'm following with my diff minimiser; though in the current PoC version of it, I'm 'stabilising' the identifier names after running an initial diff on it, rather than doing so pre-diff. I still haven't cleaned it up + pushed the code here yet, but you can see some snippets of updates from when I was last working on it here: --
@j4k0xb If by 'get the diff positions' you mean the sections that are still diff after stabilising the names, then that is more or less how my diff minimiser script is doing things currently. It basically ingests a diff, processes the added/removed chunks with an AST parser to normalise any identifiers, and then if the 'identifier normalised' code is equal, then it removes that chunk from the diff that is output at the end. That way the identifiers are only ever changed within the minimiser check itself, but you still end up with the original names in the output. RE: #10 (comment)
@michaelskyba A lot of the time the majority of the code remains stable. Occasionally I'll see webpack chunks move around a bit, and/or change the chunk ID that is referring to them/etc, but I think that usually only happens when related code has actually changed in the build, and then that causes them to be reworked when they are minimised/etc. I pre-process the code with --
@michaelskyba Nice! RE: #10 (comment)
@j4k0xb That aligns with my thoughts/experience with them too. RE: #10 (comment)
@michaelskyba Generally speaking, yeah, they tend to be pretty 'stable' in that regard (with the slight caveat that I mentioned above RE: needing to sometimes run them twice) RE: #10 (comment)
@j4k0xb I don't think I've seen this happen, or if I have, it's been so rare that I don't remember it/notice it. The main times I have noticed an issue is when I need to run it twice because it failed to properly format everything the first time; and when I accidentally change the settings I am passing to it (like from using |
I didn't investigate Babel enough to have an even somewhat-informed opinion, but in my head, the interface that I wanted was to write a recursive function that would DFS through the tree, which is what I ended up being able to make easily with go-tree-sitter. From what I can tell just from seeing -- Something I've realized is that diffs of JavaScript files are further from valid JavaScript than I thought, such that it's impossible for tree-sitter to parse them out of the box. I think the way it would have to work is to extract the strings/identifiers of the new file and But I'd like to view them in the context of the diff rather than just the new file. Just by going to a line and being told that xxx is new, the context of the change isn't obvious. So maybe there's also a way to find, given a line L in the new file, which line in the diff shows L being added. I'd expect it to be more complicated than a
How does your AST parse the chunks of those diffs? For a diff like this: diff --git a/1.js b/2.js
index ad8dee1..923ec19 100644
--- a/1.js
+++ b/2.js
@@ -1,2 +1,2 @@
-console.log("bar \
+console.log("foo \
bar") , I don't see an obvious way to dodge the string quoting, besides mapping the changed lines back to the files they came from, which itself sounds difficult. |
@michaelskyba As far as I am aware, It doesn't inherently have to specify which type of node to run against, it could run for all of them (you can see the most basic example using
You may find that last section on AST traversal a useful read even if you're not using Babel specifically, as it's a pretty common pattern from what I have seen. Even with
@michaelskyba Different in what sense? Because the parts of the diff included aren't necessarily fully formed valid JavaScript?
@michaelskyba Not sure I am following here. Why do you feel this would need to be done; and/or can you demonstrate your thoughts with an example to help explain it?
@michaelskyba Also not sure I'm fully following your logic here. Maybe the same clarity from explaining the above will help me here too?
@michaelskyba What aspect of that are you asking about? The fact that the string is split over 2 lines, where one hasn't changed, and thus wouldn't be seen in the added/removed 'pair'; such that the code to be parsed would end up being just I guess my first question would be whether that is ever a case that would exist after being prettified with
After that, I did end up trying the lenient parser on it's own, which still had some issues even after adding the hacky fixes to it; and then I believe I trialled using just the tokenizer from the lenient parser, which from memory ended up working pretty well for my needs, since it was still able to parse the Though for your use case of extracting the strings, I don't think parsing the diffs is the right approach. I would probably just parse the raw JS, and dump the full lot of strings out to a .txt file; then I would diff that whole text file of the latest version, against the whole text file of the previous version. |
A lot of this will probably be uninteresting/useless noise, but I just made a repo and pushed all of my hacky WIP/PoC AST exploration scripts to it so they're captured in one place:
What might be of interest though, is that I pushed the current hacky state of my diff minimiser code there too. I haven't checked it for a good few weeks, so can't even remember if it's in a runnable state currently, but if nothing else, there might be some ideas hidden away in there that help explain what I was referring to above: When I get a chance to get back to them, I plan to finish my research/refinements on the best method(s), and then clean it all up back to a single useful script that I will commit to this repo. But until then, those may be of some interest to someone. |
Alright, yeah, Babel's system makes more sense now in the context of defining a visitor. I'd still prefer to use a general enter function and then have something like an if statement, but overall I could see it being more convenient than manual recursion. traverse(ast, {
enter(path) {
if (
path.node.type === "Identifier" &&
path.node.name === "n"
) {
path.node.name = "x";
}
}
}); I initially disliked tree-sitter's predicates when I first saw them but I've looked at some of the go.dev docs and maybe they would be cleaner, since I should still be able to track the source iteratively with Now that I think about it, if this is in the context of being added to the
Yes, that arbitrary segments taken from a JavaScript file won't be fully formed valid JavaScript by themselves. But also the surrounding diff syntax of e.g. diff --git a/original.js b/vendor-6a62821ae00843ea.js?dpl=576b52374bc8a5e2b6fbb183818b6924a40171a2
index 05cb4d0..57d05c3 100644
--- a/original.js
+++ b/vendor-6a62821ae00843ea.js?dpl=576b52374bc8a5e2b6fbb183818b6924a40171a2 That could be filtered out depending on the output structure, though.
Example: We have three version of a JS file: 0.js, 1.js, 2.js The idea I'm describing in that quote is to do the following:
This is the alternative to trying to first look at the diff between 0.js and 1.js or 1.js and 2.js, and then running the script on that diff as input. With the former idea of considering each file separately, the only input being passed to the AST program will be valid JavaScript, so it should always work as intended. The downside is that if you look into 2.txt and see e.g. "DynamicModelSwitcherEnabled", you'd want to see which part of the 1.js --> 2.js diff shows that being introduced, which 2.txt doesn't tell you directly.
Here, I'm talking about a potential solution to the downside I mentioned above. If 2.txt can report that "DynamicModelSwitcherEnabled" was found on line L in 2.js, it could be possible to look at the diff between 1.js and 2.js, and find where line L is added. Hopefully that makes a bit more sense?
Yes, which would be completely unparsable out of the box, as the rest of the code in the diff would be incorrectly considered a string. This was meant to be an example of how an AST parser can't pass a raw provided diff. If you mean to not parse the file itself but to split it up into each diff chunk, and then to parse the removed / added lines separately, then that could potentially work better. I haven't tried that with tree-sitter.
I don't think this exact case of a string being split across multiple lines occurs in the real code, but something similar happened when I was first testing that broke most of the parsing. I might check again later if the exact context would be useful.
I see, that should work, but it confirms that there is no simple, universal solution.
Yeah, that is essentially what I was thinking, except I wanted to only search for completely new entries. I'll see if I can make the string/identifier --> find insert line system work. |
@michaelskyba nods, yeah, that's fair. Deciding whether it's better to use a raw
@michaelskyba Quickly skimming through the docs on the using predicate S-expressions: For my own reference, it looks like you:
And then yeah, there seem to be heaps of useful things that I think this method is somewhat close to my 'named node type' visitor pattern that I used in my previous Babel example PoC.
@michaelskyba Yeah, if your goal is to include what you're working on in this repo, then matching the existing style of JS + Babel is ideal (unless there is a good reason to differ from it). Though if this just ends up being your own personal exploration for learning, or to create a tool in it's own repo, obviously you can choose whatever language/lib choices you want to use/explore.
@michaelskyba nods, makes sense. So in my WIP PoC diff minimiser that I recently uploaded (Ref) I first parse the diff format using
@michaelskyba Yup, makes sense.
@michaelskyba Yeah, that was the approach I took in my WIP PoC diff minimiser (Ref): A very rough walkthrough:
It's currently quite messy/hacky code, with multiple implementations scattered throughout as I was exploring different ways of doing things (eg. It looks like the main AST parsing/comparison logic is in You can also see the pre/post-processing steps I was taking to try and hack the 'arbitrary bits of JavaScript' into something that would parse in But after all of that, I think my ultimate conclusion (that I have yet to fully test/polish/turn into my final tool) was that I think I can get pretty good results, with minimal pre/post-hacking required, by just using
I need to play around with that more, and may end up going back to another library (eg. if I can use Babel's tokenizer or similar directly). I originally was looking at
@michaelskyba nods, fair, makes sense.
@michaelskyba Well, I guess in the end, depending on what you mean by 'universal solution', I think the tokenizer approach is likely pretty close to a 'universal solution'.
@michaelskyba Curious why you want to 'only search for completely new entries' at this stage of things within the AST parsing tool? I believe you'll get basically the same outcome (with simpler code) by embracing the unix philosophy (Ref) of combining tools. That way you can have the AST parsing script output all of the strings, and then just do a standard
@michaelskyba Curious to see your approach here if you do get it to work. (As just because the method I described above is how I believe would be simplest to implement it, doesn't mean that there isn't something i'm missing/unaware of/etc that would end up being an even better way)
@michaelskyba nods, right, that makes sense. Though I'm curious, how do you see the method I described above/previously comparing with a pattern like this (better/worse/etc?)? Where you basically:
If this '[filename]-strings.txt' file was committed alongside each Or if we didn't want to store those files in the git repo, we could still diff any arbitrary build version against another, but would just need to re-run the extraction script each time. I haven't tested this, so it may need slight tweaks, but this is the basic command ChatGPT drafted for doing this when I asked: diff \
<(git show commit1:path/to/example.js | node extract-strings.js) \
<(git show commit2:path/to/example.js | node extract-strings.js)
@michaelskyba Right, yup, I think I'm following now. Curious if the method I described above is a satisfactory solution to that, or if you would still be wanting to try and manually keep track of this sort of data between the 'pre-diffed' |
@michaelskyba I haven't tested this at all, so it might not be perfectly correct, but this is what ChatGPT drafted up based on the method I was talking about with using the predicate function of package main
import (
"fmt"
"io/ioutil"
"log"
"github.com/smacker/go-tree-sitter"
"github.com/smacker/go-tree-sitter/javascript"
)
func main() {
// Load your JavaScript code from a file
content, err := ioutil.ReadFile("your_file.js")
if err != nil {
log.Fatalf("Failed to read file: %v", err)
}
// Parse the content using tree-sitter
parser := sitter.NewParser()
parser.SetLanguage(javascript.GetLanguage())
// Create a tree from the content
tree := parser.Parse(nil, content)
defer tree.Delete()
// Prepare the query
queryString := `(string) @string`
// Compile the query
query, err := sitter.NewQuery([]byte(queryString), javascript.GetLanguage())
if err != nil {
log.Fatalf("Failed to create query: %v", err)
}
defer query.Delete()
// Create a query cursor
cursor := sitter.NewQueryCursor()
defer cursor.Delete()
// Execute the query
cursor.Exec(query, tree.RootNode())
// Iterate over the matches
for {
match, ok := cursor.NextMatch()
if !ok {
break
}
for _, capture := range match.Captures {
// Get the node for the current capture
node := capture.Node
// Print the content of the string literal
fmt.Println(node.Content(content))
}
}
} |
This isn't specifically related, but I described it poorly initially, so to clarify, I think identifiers could be useful indicators too, and not just strings, so I'm trying to extract both types.
Hmm, that's a good point. Using Go or tree-sitter at all is probably already a more significant departure than the actual style of manual recursion vs visitor, or if statement vs node type pattern. I might just continue storing it in mine for now, and then perhaps make a rewrite in the future if it would make sense.
That makes more sense now, and is pretty sophisticated. Despite the chunks being arbitrary segments, I'd expect it to perform significantly better out of the box than running the AST on the raw diff. The string split across multiple lines scenario I mentioned, for example, shouldn't ever happen, because it is only caused by the second file's chunk
being parsed as if it was part of the first:
Oh, I see, I previously misunderstood what the context and role of the tokenizer was. Then yeah, I retract that; it should be general enough. In the case of only doing the extraction, it still looks like it'd be easier to run an AST before the diff, though.
There I (ambiguously) meant the "search" in the context of the outer diff, not inside the same AST parsing program. I'm generally a fan of the Unix philosophy, and try to follow it when, for example, writing personal shell scripts. I was thinking of directly using a standard
All of those are pretty easy to solve, though. Maybe it'd make the most sense to only store the line-numbered versions of the entries of each file. Then, diff could still be used by removing the line numbers on the fly with something like
I think this part should be compatible with your method. That was just an idea for finding the context of an entry inside the js --> js diff.
Originally I was thinking of only storing new entries in each txt file, but I think you're right that having I think the line numbers would have to be stored in the pre-diffed txt files, or there'd be a lot of additional complexity for locating lines in the js --> js diff.
Yeah, something like this could also work. I guess the convenience will depend on how long the extractor will take to run. |
@michaelskyba Identifiers churning between builds are sort of specifically the thing that creates the noise we're trying to eliminate though..?
@michaelskyba That's fair :)
@michaelskyba nods it seems to work pretty well. Another alternative is to not do the text-diff first, and just do an 'AST diff' or similar (while ignoring identifier names changing), but I believe that would run a lot slower (or at least, the existing tools I tried definitely seemed to be slower)
@michaelskyba See comment above RE: this. Speed issues + laziness for not wanting to implement my own AST diff algorithm aside, that would probably be the 'ideal' 'pure' way to achieve this. I was trying to leverage a text based diff as a starting point for speed reasons.
@michaelskyba Of those, 1 would be almost trivial to handle as is; 2 would be slightly more annoying (though I also wonder how often that case would come up in reality, and if it needs to be 'solved for'); 3 it seems you already have some decent ideas on how to work around.
@michaelskyba Makes sense :)
@michaelskyba nods yeah, makes sense.
@michaelskyba nods yeah, exactly. I would assume it would run too slow and so be kind of annoying; but that could be improved by implementing a local cache (outside of the git repo) type thing such that for a given build, it only ever needs to run the actual parsing code once, and then would lookup from the cache after that (unless cleared). Another reason I would think about committing the intermediary files is then you can sort of 'interactively explore' them through GitHub's diff interface; which was also one of the reasons I started committing the builds themselves (though the noise of the identifier churn between them sort of makes it a bit unusable for that purpose currently unfortunately) |
I haven't looked at the diffs that much but from what I've seen, the churn only affects a limited number of the identifiers, such as the ones comprised of 1-2 characters like e.d(t, {
HV: function () {
return N;
},
N8: function () {
return S;
},
aP: function () {
return m;
},
do: function () {
return R;
}, This is why I wanted to be able to look at introduced identifiers that have never been seen in any previous JavaScript file version. That way, these common churn identifiers would be found and eliminated, and then we'd only start seeing the more interesting ones like
Interesting, I didn't think about a diff on the AST level. But yeah, I guess it isn't a common enough use case for there to have been a lot of work already out there on matching the efficiency of text diffs.
Number 2 is relevant to the identifier churn mentioned above. I would assume that something like
Yeah, if there's a way to store some piece of cache smaller than the full output that still significantly speeds up the execution time.
Yeah, somebody interested in some ~casual browsing would probably not want to clone the code and run the extractor themselves. Diffs between extraction output would probably be similarly difficult to look at directly, though. |
I rewrote it (no line tracking yet) to use the following query pattern:
Here is the diff between the extractions (| sort | uniq) for two vendor- .js files I tested: https://gist.github.com/michaelskyba/fb93861fe6c4880138a586716e4beb18 If I only look for added entries rather than the diff itself, it's only 8 lines, not including the source map line that prettier added and the $ rg -x -F -v -f 1713546501-cf14668063190b4c-uniq.txt 1713620903-c41f7bc7c59d56a3-uniq.txt
311:find_this_test_123
1396:"16 16 12 12 8 16"
9002:"&dpl=9133c19ee9ae73caaad3f40e3ac2f20dd10e0fdf"
9003:"?dpl=9133c19ee9ae73caaad3f40e3ac2f20dd10e0fdf"
13739:`headlessui-menu-button-${(0, p.M)()}`
13740:`headlessui-menu-item-${(0, p.M)()}`
13741:`headlessui-menu-items-${(0, p.M)()}`
18089:"M20.39 18.39A5 5 0 0 0 18 9h-1.26A8 8 0 1 0 3 16.3"
24471:Re4
27786://# sourceMappingURL=vendor-c41f7bc7c59d56a3.js.map (Edit: I realize now that sourceMappingURL is part of the original file and was not added by Prettier.) The diff between the actual JS files was 13622 lines and just looked like identifier churn to me. Snippet: @@ -140080,7 +139375,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
return nS(e).getPropertyValue(t);
}
let nA = ["top", "right", "bottom", "left"];
- function nx(e, t, n) {
+ function nT(e, t, n) {
let r = {};
n = n ? "-" + n : "";
for (let i = 0; i < 4; i++) {
@@ -140089,14 +139384,14 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
}
return (r.width = r.left + r.right), (r.height = r.top + r.bottom), r;
}
- let nT = (e, t, n) => (e > 0 || t > 0) && (!n || !n.shadowRoot);
+ let nx = (e, t, n) => (e > 0 || t > 0) && (!n || !n.shadowRoot);
function nC(e, t) {
let n, r;
let i = e.touches,
o = i && i.length ? i[0] : e,
{ offsetX: a, offsetY: s } = o,
l = !1;
- if (nT(a, s, e.target)) (n = a), (r = s);
+ if (nx(a, s, e.target)) (n = a), (r = s); |
@michaelskyba Mm, fair enough. There are definitely some that seem to not get minimised, but it's more like the majority of identifiers that get minimised (and thus churn) between builds. Make sense RE: the 'only new' aspect. Though you could probably also eliminate a lot of that noise (with minimal false positives) by filtering out short identifiers that match the patterns used in minification.
@michaelskyba nods true true. You can see some of the tools I explored in this issue: Such as:
@michaelskyba Yeah, absolutely would be.
@michaelskyba It wouldn't even necessarily need to be smaller than the full output. The key difference as I see it is whether it would be a file that you would track within the repo/etc; or more of a 'hidden away in the cache' thing. Obviously depending on your workflow/etc, you could just manually store the pre-calculated versions somewhere outside of a git repo anyway, and it would have a similar effect.
@michaelskyba Mm, to some degree probably yeah. Though I guess it depends how much of a 'casual browser' we're talking about, and how much we reduce the noise in whatever output format it is. My ideal goal would be to be able to re-write the identifiers to something stable (yet meaningful) between builds + be able to account for modules that 'move around', such that the diff shown on github (for the
@michaelskyba It might be more interesting (and yet way larger) to look at the
@michaelskyba Neat!
@michaelskyba Would be curious to see what that same diff would look like running through my diff minimiser (even in it's current non-polished form): Seemingly 104 lines of diff normally: ⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/vendor.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/vendor.js | wc -l
104 And 66 after running through the minimiser: ⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/vendor.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/vendor.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | wc -l
66 But a lot of those extra lines are context/diff boilerplate: Minimised Diff outputdiff --git a/unpacked/_next/static/chunks/vendor.js b/unpacked/_next/static/chunks/vendor.js
index dcd0d0a..14aa329 100644
--- a/unpacked/_next/static/chunks/vendor.js
+++ b/unpacked/_next/static/chunks/vendor.js
@@ -69965,7 +69965,7 @@
r +
"&q=" +
(i || 75) +
- "&dpl=fbb61498fbf3416255234adfde4320b0577c7e49"
+ "&dpl=46b7f37021d2244b5b4454b927e0735229de5626"
);
}
Object.defineProperty(t, "__esModule", { value: !0 }),
@@ -73156,7 +73156,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
64302: function (e, t) {
"use strict";
function n() {
- return "?dpl=fbb61498fbf3416255234adfde4320b0577c7e49";
+ return "?dpl=46b7f37021d2244b5b4454b927e0735229de5626";
}
Object.defineProperty(t, "__esModule", { value: !0 }),
Object.defineProperty(t, "getDeploymentIdQueryOrEmptyString", {
@@ -135660,13 +135656,11 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
if (
((this.options = this.#O.defaultQueryOptions(e)),
(0, r.VS)(n, this.options) ||
- this.#O
- .getQueryCache()
- .notify({
- type: "observerOptionsUpdated",
- query: this.#A,
- observer: this,
- }),
+ this.#O.getQueryCache().notify({
+ type: "observerOptionsUpdated",
+ query: this.#A,
+ observer: this,
+ }),
void 0 !== this.options.enabled &&
"boolean" != typeof this.options.enabled)
)
@@ -136602,13 +136596,11 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
let t = this.options;
(this.options = this.#O.defaultMutationOptions(e)),
(0, s.VS)(t, this.options) ||
- this.#O
- .getMutationCache()
- .notify({
- type: "observerOptionsUpdated",
- mutation: this.#Y,
- observer: this,
- }),
+ this.#O.getMutationCache().notify({
+ type: "observerOptionsUpdated",
+ mutation: this.#Y,
+ observer: this,
+ }),
this.#Y?.setOptions(this.options);
}
onUnsubscribe() {
@@ -196309,4 +196301,4 @@ ${t.comment}`
},
},
]);
-//# sourceMappingURL=vendor-9a1d7ae664409493.js.map
+//# sourceMappingURL=vendor-4cb082c626a8af6f.js.map
If we wanted to then filter it down to just added the lines, then we get 13: ⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/vendor.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/vendor.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | grep '^+' | grep -v '^+++' | wc -l
13 ⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/vendor.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/vendor.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | grep '^+' | grep -v '^+++'
+ "&dpl=46b7f37021d2244b5b4454b927e0735229de5626"
+ return "?dpl=46b7f37021d2244b5b4454b927e0735229de5626";
+ this.#O.getQueryCache().notify({
+ type: "observerOptionsUpdated",
+ query: this.#A,
+ observer: this,
+ }),
+ this.#O.getMutationCache().notify({
+ type: "observerOptionsUpdated",
+ mutation: this.#Y,
+ observer: this,
+ }),
+//# sourceMappingURL=vendor-4cb082c626a8af6f.js.map Doing that same example for ⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | wc -l
14051
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | wc -l
2901
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | grep '^+' | grep -v '^+++' | wc -l
1003 Personally, I find the 'only newly added' removes too much context and makes it harder to review; so that 2901 view is the type of one I would usually look at. It still has noise in it due to needing to refine the minimiser further; and there are some parts that are technically not noise (as I define it), that someone might find value in also tuning out, which could make it smaller again. But (aside from where huge blocks of code have moved, and currently lose their The diff I chose there was just the latest one though, and I have often seen |
Here's a fun little hack I just thought up (while the diff minimiser script still breaks
Currently this doesn't work fully, as I think I'm not properly updating the line metadata in the diff hunks when I've removed sections from them in the diff minimiser.. but something like this (in theory): ⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | \
sed 's|a/unpacked/_next/static/chunks/pages/_app.js|a/_app.original.js|g' | \
sed 's|b/unpacked/_next/static/chunks/pages/_app.js|b/_app.patched.js|g' \
> _app.minimised.diff
⇒ git show 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js > _app.original.js
⇒ git apply _app.minimised.diff
⇒ git -c color.diff.oldMoved='white dim' \
-c color.diff.oldMovedAlternative='white dim' \
-c color.diff.newMoved='white dim' \
-c color.diff.newMovedAlternative='white dim' \
-c color.diff.newMovedDimmed='white dim' \
-c color.diff.newMovedAlternativeDimmed='white dim' \
diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js Edit: I wrote a new ⇒ ./scripts/fix-diff-headers.js _app.minimised.diff _app.fixed.diff
Reading diff from _app.minimised.diff and writing corrected diff to _app.fixed.diff
Diff successfully corrected and written to _app.fixed.diff
⇒ patch _app.original.js _app.fixed.diff -o _app.patched.js
patching file _app.original.js
6 out of 90 hunks failed--saving rejects to _app.patched.js.rej
⇒ git -c color.diff.oldMoved='white dim' \
-c color.diff.oldMovedAlternative='white dim' \
-c color.diff.newMoved='white dim' \
-c color.diff.newMovedAlternative='white dim' \
-c color.diff.newMovedDimmed='white dim' \
-c color.diff.newMovedAlternativeDimmed='white dim' \
diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js That way, instead of just seeing a giant chunk of 'removed' and a giant chunk of 'added' when this module moved; we can see the dimmed lines which were moved (but otherwise unchanged), and spend less time/effort having to consider them: An even better version of this would be a Though the way I was originally intending on fixing/improving the diff minimiser script was to try just maintaining the ANSI colouring of the input lines, and outputting the lines with the same colouring at the end. |
This is an idea for a type of analysis of code diffs. This issue is just for tracking notes and ideas
Example input 1
Here the extraction might be
Of course, this gives you far less information than the original, but I think it could be a good trade-off in cases where you want to look at the diff a little bit but don't have time to see everything.
Example input 2
Since common generic ones like e, t, n, "", and "." would show up frequently in any context, they would have already been seen in the past, and therefore filtered out. You'd more focus on the e.g. "Use Up and Down to choose options", with some kind of convenient way to jump back to see it in-context in the code.
For input like the following:
, none of the names or strings would probably be new, and so you wouldn't see it at all. This is intended, because I can't gleam any conclusions from looking at it, and thus would prefer not to see it
Glenn's comments
https://twitter.com/_devalias/status/1770284997385277554
For large diffs I think it'd be a lot, but strings and names are a subset of the raw diff, so it should still be less work than a full manual analysis. The idea is to just visually filter through them until you see a name/string that looks interesting on their own, which could lead to something good in-context.
Haven't worked with Babel but some relevant docs seem to be
Are there other AST parsers too? Would something like TreeSitter work? I'd generally prefer to avoid node.js if it's not required
I haven't seen enough diffs to exactly anticipate how these would look like but there might be different solutions like color-moved that could work depending on how it goes
Related:
Yeah, I want to make a prototype and see if it will kind of work. I'm still not sure on the implementation, though; the most efficient system might be to integrate with a text editor, which makes it harder to be replicable
The text was updated successfully, but these errors were encountered: