[archive,sqlite-] guess sqlite/tar/zip filetypes confidently #2617

midichef · 2024-11-29T20:20:54Z

This increases the likelihoods guessed when using a few of the guessers that inspect binary data. Right now guess_sqlite/tar/zip give an inferred likelihood of 1:

visidata/visidata/_open.py

Line 66 in 7e75dee

return sorted(filetypes, key=lambda r: -r.get('_likelihood', 1))[0]

After this PR they would give a likelihood of 10, in line with guess_xls, guessurl_mimetype and, guessurl_airtable.

As Visidata operates now, I don't think this PR makes any difference to filetype determination in practice. But it could if changes are made to the guessing algorithm later.

And never guess that empty files are tarfiles.

midichef · 2024-11-29T20:23:33Z

One such change that would make this PR matter, would be lowering the priority of file extensions. Currently, file extensions override filetype guessing completely:

visidata/visidata/_open.py

Lines 121 to 124 in 7e75dee

    
           openfuncname = 'open_' + filetype 
        
           openfunc = getattr(vd, openfuncname, vd.getGlobals().get(openfuncname)) 
        
           if not openfunc: 
        
               opts = vd.guessFiletype(p)

Because of that override, guess_extension() never determines a filetype in practice. It never actually runs when there is a file extension. Its implementation details show it would assign a lower confidence to extension-based filetypes, using the relatively weak likelihood of 3:

visidata/visidata/_open.py

Lines 72 to 77 in 7e75dee

    
           def guess_extension(vd, path): 
        
               # try auto-detect from extension 
        
               ext = path.suffix[1:].lower() 
        
               openfunc = getattr(vd, f'open_{ext}', vd.getGlobals().get(f'open_{ext}')) 
        
               if openfunc: 
        
                   return dict(filetype=ext, _likelihood=3)

But if that override is ever removed (by removing lines 121-123), then this PR would improve file guessing.

For example, let's say we have a file with a .txt extension, containing a SQLite database. Right now such a file is txt with no file guessing, because guess_extension() never runs. And it's txt with or without this PR. But in the hypothetical world where the override is removed, then without this PR, the file would be guessed as txt (likelihood: 3), not sqlite (likelihood: 1). With this PR, it would be guessed as sqlite (likelihood:10).

saulpw

Notably, a text file that starts with "SQLite format considerations" can't be opened. But as long as -f txt works in that case I guess it's fine. (Though I'm also wondering how much overhead these guessers will add to the startup time for every file).

midichef · 2024-11-29T23:53:08Z

Oh, good example. False positives would be something to keep in mind.

For sqlite files in particular, once it mattered, at that point we could switch to the more specific search string 'SQLite format 3\000'. It looks like 3 is the only version number ever used with this header format, and versions 2.0 and 2.1 had entirely different formats.

But I don't see any need to switch the guessing design right now. Low overhead is good.

If visidata goes in the direction of advanced file guessing, another tool to keep in mind is python-magic.

[archive,sqlite-] guess sqlite/tar/zip filetypes confidently

12e8647

And never guess that empty files are tarfiles.

anjakefala added the waiting on maintainer label Nov 29, 2024

saulpw approved these changes Nov 29, 2024

View reviewed changes

saulpw approved these changes Nov 30, 2024

View reviewed changes

anjakefala merged commit a8e754a into saulpw:develop Dec 1, 2024
14 checks passed

midichef deleted the likely_binary_filetypes branch December 1, 2024 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[archive,sqlite-] guess sqlite/tar/zip filetypes confidently #2617

[archive,sqlite-] guess sqlite/tar/zip filetypes confidently #2617

midichef commented Nov 29, 2024

midichef commented Nov 29, 2024 •

edited

Loading

saulpw left a comment

midichef commented Nov 29, 2024

[archive,sqlite-] guess sqlite/tar/zip filetypes confidently #2617

[archive,sqlite-] guess sqlite/tar/zip filetypes confidently #2617

Conversation

midichef commented Nov 29, 2024

midichef commented Nov 29, 2024 • edited Loading

saulpw left a comment

Choose a reason for hiding this comment

midichef commented Nov 29, 2024

midichef commented Nov 29, 2024 •

edited

Loading