Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for additional media types #18054

Merged
merged 14 commits into from
May 20, 2024

Conversation

arash77
Copy link
Collaborator

@arash77 arash77 commented Apr 25, 2024

This pull request adds support for Ogg, Webm, Mpeg, Mov, Avi, Wmv, and Wma media types.
It also refactors the media datatype sniffing logic in media.py using magic numbers.

How to test the changes?

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@github-actions github-actions bot added this to the 24.1 milestone Apr 25, 2024
Copy link
Member

@hexylena hexylena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this PR as adding a sensible 'fallback' that can work if ffprobe isn't installed. That's potentially really useful for making life easier for small admins.

but I'm a bit wary of some of the changes which stop relying on ffprobe when it is available, I trust ffprobe's correctness far more than a (potentially) non-exhaustive list of magic numbers. In the case when it isn't available, the provide numbers are great let's use them. But since a number of the functions still rely on ffprobe, we aren't removing the potential dependency, so we don't gain much with the ffprobe+magic number combination, unless it were to be an ffprobe-less fallback.

I've made a number of comments directly on the PR, I think this will be an improvement when it's merged!

lib/galaxy/datatypes/media.py Show resolved Hide resolved
Comment on lines 201 to 290
if which("ffprobe"):
metadata, streams = ffprobe(filename)
return "mp4" in metadata["format_name"].split(",")
return False
return _get_file_format_from_magic_number(filename, "mp4")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to remove fallback here? (and other places) I think this would be better if you just replaced the return False line, if we've got ffprobe, let's use it and feel certain we're detecting 100% correctly even in 'odd' cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should we always look for ffprobe and if it doesn't work check with the magic number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also when using 'ffprobe' on 'mov' and 'mp4' files, it gives back the same result. One way to tell them apart is by looking at the magic number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should we always look for ffprobe and if it doesn't work check with the magic number?

Yes, not everyone is going to install ffmpeg, and your magic numbesr are a good, efficient fallback. (Unless any other devs want to chime in here?)

Also when using 'ffprobe' on 'mov' and 'mp4' files, it gives back the same result. One way to tell them apart is by looking at the magic number.

Thank you for pointing this out, you are correct. When I was looking further into it, I notice this output, ffmpeg bundles a lot of formats together on that line.

{
    "format": {
        "filename": "sample-5s.mp4",
        "nb_streams": 2,
        "nb_programs": 0,
        "format_name": "mov,mp4,m4a,3gp,3g2,mj2",
        "format_long_name": "QuickTime / MOV",

which is an interesting result. Probably due to this: https://en.wikipedia.org/wiki/QuickTime_File_Format#Relation_to_MP4

Which explains why ffmpeg will decode both of those with the same decoder (since it's an mp4 stream internally with slightly different header metadata), but will encode them as separate spec compliant containers.

  E mov             QuickTime / MOV
 D  mov,mp4,m4a,3gp,3g2,mj2 QuickTime / MOV

I wonder how much of a distinction we need to make here, for some relatively ancient formats, but probably some microscope/image tooling cares I guess.

As for implementation, I'm not sure what's the optimal implementation here. magic numbers can confirm it's actually one of the datatypes, but I still have some (perhaps unfounded) concern that they are not fully exhaustive/forward compatible of all the flavours of these datatypes that can be seen in the wild, so, instead if ffprobe says it's mp4, then we'll accept that.

Given how that magic numbers are used to spell ftypisom / ftypMSNV, I think that worry is not unfounded, that they'll might eventually a new ftyp that's also somehow an mp4 file.

So I think the optimal implementation looks like:

in mp4:

  • if ffprobe says it's an mp4
    • if magic numbers say it's a mov, return false
    • otherwise it's probably an mp4
      in mov
  • if the magic numbers say it's a mov it's a mov.

and based on their ordering in the datatypes to prefer mp4 to mov/qtff (given that it's a more common format).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the ffprobe better than the magic bits? The idea is to cut down on the number of subprocess calls on the headnodes, which is a good thing. So, should we use just magic bits if it is possible to eliminate ffprobe subprocess to fasten the process?

Copy link
Member

@hexylena hexylena May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to cut down on the number of subprocess calls on the headnodes, which is a good thing

We currently only do 1 subprocess call (or we should only be doing 1) for any media file due to the caching in this function:

@lru_cache(maxsize=128)
def _ffprobe(path):
return subprocess.run(
["ffprobe", "-loglevel", "quiet", "-show_format", "-show_streams", "-of", "json", path], capture_output=True
)
def ffprobe(path):
completed_process = _ffprobe(path)
completed_process.check_returncode()
data = json.loads(completed_process.stdout.decode("utf-8"))
return data["format"], data["streams"]

so this will not achieve any improvement there. Did you have a metric which was showing a high number of subprocess calls? Perhaps the cache size needs to be larger (or configurable)

it would only speed it up if we completely eliminated the ffprobe calls in every media type, but, I think we will not since other metadata is fetched from ffprobe's output.

Comment on lines 272 to 275
vp_check = any(
stream["codec_name"] in ["vp8", "vp9"] for stream in streams if stream["codec_type"] == "video"
)
return _get_file_format_from_magic_number(filename, "mkv") and not vp_check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this restriction makes sense for mkvs, they should be able to host vp8/9 encoded streams.

But this is also somewhat a general 'problem' with the datatypes in that we describe both containers and encodings, and we don't capture that metadata in galaxy... but on the other hand I'm not sure I've encountered many tools that are doing video processing and aren't using something like ffmpeg / libav / etc and would be able to handle different stream codecs within a container.

Do you have a reason to add this restriction here? If so, perhaps that would be best expressed as metadata of the file, and then annotate that metadata here, and whatever tools you use can make decisions based on that metadata.

Also I'm not sure of a benefit we get from this specific change; we're still doing the ffprobe so we aren't doing any less computation via using magic numbers (i.e. the usual reason for adding them), are magic numbers more accurate than checking for matroska?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that for mkv and webm formats, they look the same to ffprobe and have the same magic number. So the way that I found is that webm video codec could only be vp8 or vp9, That is why I am using this approach. And for computation you are right it doesn't make any difference that is why I put the mkv and webm at the end of the sniffer tags in datatypes_conf.xml.sample.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct, webms are mkvs, just with restricted format lists. I guess there is precedent for datatypes that are subsets (bed)

But in this case we would potentially (wrongly?) detect a user's uploaded mkv that happens to use vp8 as webm format, which is an odd scenario to encounter.

I'm not sure what the best behaviour is here and I'd be happy to hear input from other folks who deal with video data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also another way to find out the format in Python and it is with magic but I'm afraid it will slow the process.

>>> import magic
>>> f = magic.Magic(mime=True)
>>> f.from_file('testfile')

Comment on lines -243 to +313
return False
return _get_file_format_from_magic_number(filename, "mp3")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but yes cases like this, instead of return False, just return _get_file_format_from_magic_number(filename, "mp3") and keeping the ffprobe would potentially be good.

if which("ffprobe"):
metadata, streams = ffprobe(filename)
vp_check = any(
stream["codec_name"] in ["vp8", "vp9"] for stream in streams if stream["codec_type"] == "video"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

webm also supports av1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should I check av1, vp8, and vp9 for webm or is this approach entirely incorrect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on #18054 (comment), I don't know.

Perhaps yes, check for those formats and if they're present it's a "webm". I think we could add a set_peek method which says "WebM subset of MKV" (or a similar description) to communicate that to the user that it's still actually a matroshka container.

Comment on lines 40 to 42
"66 74 79 70 69",
"66 74 79 70 6D",
"66 74 79 70 4D",
Copy link
Member

@hexylena hexylena Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.wikipedia.org/wiki/List_of_file_signatures

66 74 79 70 69 73 6F 6D 	ftypisom 	4 	mp4 	ISO Base Media file (MPEG-4)
66 74 79 70 4D 53 4E 56 	ftypMSNV 	4 	mp4 	MPEG-4 video file 
66 74 79 70 68 65 69 63
66 74 79 70 6d 	ftypheic 	4 	heic 	High Efficiency Image Container (HEIC) 

6d is heic, not sure if you want to lump this in with mp4? (I assume you based this on files you had/have seen). Also might want to consider extending these magic numbers to include the whole string to avoid future false positives

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to grab m4v file format.
https://www.garykessler.net/library/file_sigs.html

[4 byte offset]
66 74 79 70 6D 70 34 32 [4 byte offset]
ftypmp42
M4V MPEG-4 video|QuickTime file

But, you are right it will be missed with the heic I will change it to 66 74 79 70 6D 70 34 32.
Also, I will consider extending the magic numbers.

"66 74 79 70 4D",
],
},
"flv": {"offset": 0, "hex": ["46 4C 56 01"]},
Copy link
Member

@hexylena hexylena Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also by wikipedia's list, the magic should just be FLV, so 01 in the last byte might be too restrictive.

Since many of these are human readable strings (like ftypisom) it might make sense for your implementation to support something like

Suggested change
"flv": {"offset": 0, "hex": ["46 4C 56 01"]},
"flv": {"offset": 0, "string": ["FLV"]},

and then check the correct number of bytes based off of the length of the string or so, to not have to specify a 4th byte if it isn't relevant would reduce some of these long hex lists.

"00 00 01 B7",
"00 00 01 B8",
"00 00 01 B9",
"00 00 01 BA",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these all seen in practice? Wikipedia's list suggests it should be a much reduced set

00 00 01 BA
00 00 01 B3 

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using this source:
https://www.garykessler.net/library/file_sigs.html

00 00 01 Bx ....
MPEG, MPG MPEG video file
Trailer:
00 00 01 B7 (...·)

@mvdbeek mvdbeek requested a review from dannon May 14, 2024 14:34
@dannon dannon merged commit 967d39f into galaxyproject:dev May 20, 2024
51 of 52 checks passed
Copy link

This PR was merged without a "kind/" label, please correct.

@arash77 arash77 deleted the add_ogg_datatype_media branch May 21, 2024 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants