Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[archive,sqlite-] guess sqlite/tar/zip filetypes confidently #2617

Merged
merged 1 commit into from
Dec 1, 2024

Conversation

midichef
Copy link
Contributor

This increases the likelihoods guessed when using a few of the guessers that inspect binary data. Right now guess_sqlite/tar/zip give an inferred likelihood of 1:

return sorted(filetypes, key=lambda r: -r.get('_likelihood', 1))[0]

After this PR they would give a likelihood of 10, in line with guess_xls, guessurl_mimetype and, guessurl_airtable.

As Visidata operates now, I don't think this PR makes any difference to filetype determination in practice. But it could if changes are made to the guessing algorithm later.

And never guess that empty files are tarfiles.
@midichef
Copy link
Contributor Author

midichef commented Nov 29, 2024

One such change that would make this PR matter, would be lowering the priority of file extensions. Currently, file extensions override filetype guessing completely:

visidata/visidata/_open.py

Lines 121 to 124 in 7e75dee

openfuncname = 'open_' + filetype
openfunc = getattr(vd, openfuncname, vd.getGlobals().get(openfuncname))
if not openfunc:
opts = vd.guessFiletype(p)

Because of that override, guess_extension() never determines a filetype in practice. It never actually runs when there is a file extension. Its implementation details show it would assign a lower confidence to extension-based filetypes, using the relatively weak likelihood of 3:
def guess_extension(vd, path):
# try auto-detect from extension
ext = path.suffix[1:].lower()
openfunc = getattr(vd, f'open_{ext}', vd.getGlobals().get(f'open_{ext}'))
if openfunc:
return dict(filetype=ext, _likelihood=3)

But if that override is ever removed (by removing lines 121-123), then this PR would improve file guessing.

For example, let's say we have a file with a .txt extension, containing a SQLite database. Right now such a file is txt with no file guessing, because guess_extension() never runs. And it's txt with or without this PR. But in the hypothetical world where the override is removed, then without this PR, the file would be guessed as txt (likelihood: 3), not sqlite (likelihood: 1). With this PR, it would be guessed as sqlite (likelihood:10).

Copy link
Owner

@saulpw saulpw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notably, a text file that starts with "SQLite format considerations" can't be opened. But as long as -f txt works in that case I guess it's fine. (Though I'm also wondering how much overhead these guessers will add to the startup time for every file).

@midichef
Copy link
Contributor Author

Oh, good example. False positives would be something to keep in mind.

For sqlite files in particular, once it mattered, at that point we could switch to the more specific search string 'SQLite format 3\000'. It looks like 3 is the only version number ever used with this header format, and versions 2.0 and 2.1 had entirely different formats.

But I don't see any need to switch the guessing design right now. Low overhead is good.

If visidata goes in the direction of advanced file guessing, another tool to keep in mind is python-magic.

@anjakefala anjakefala merged commit a8e754a into saulpw:develop Dec 1, 2024
14 checks passed
@midichef midichef deleted the likely_binary_filetypes branch December 1, 2024 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants