-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for additional media types #18054
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this PR as adding a sensible 'fallback' that can work if ffprobe isn't installed. That's potentially really useful for making life easier for small admins.
but I'm a bit wary of some of the changes which stop relying on ffprobe when it is available, I trust ffprobe's correctness far more than a (potentially) non-exhaustive list of magic numbers. In the case when it isn't available, the provide numbers are great let's use them. But since a number of the functions still rely on ffprobe, we aren't removing the potential dependency, so we don't gain much with the ffprobe+magic number combination, unless it were to be an ffprobe-less fallback.
I've made a number of comments directly on the PR, I think this will be an improvement when it's merged!
lib/galaxy/datatypes/media.py
Outdated
if which("ffprobe"): | ||
metadata, streams = ffprobe(filename) | ||
return "mp4" in metadata["format_name"].split(",") | ||
return False | ||
return _get_file_format_from_magic_number(filename, "mp4") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason to remove fallback here? (and other places) I think this would be better if you just replaced the return False
line, if we've got ffprobe, let's use it and feel certain we're detecting 100% correctly even in 'odd' cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, should we always look for ffprobe
and if it doesn't work check with the magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also when using 'ffprobe' on 'mov' and 'mp4' files, it gives back the same result. One way to tell them apart is by looking at the magic number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, should we always look for ffprobe and if it doesn't work check with the magic number?
Yes, not everyone is going to install ffmpeg, and your magic numbesr are a good, efficient fallback. (Unless any other devs want to chime in here?)
Also when using 'ffprobe' on 'mov' and 'mp4' files, it gives back the same result. One way to tell them apart is by looking at the magic number.
Thank you for pointing this out, you are correct. When I was looking further into it, I notice this output, ffmpeg bundles a lot of formats together on that line.
{
"format": {
"filename": "sample-5s.mp4",
"nb_streams": 2,
"nb_programs": 0,
"format_name": "mov,mp4,m4a,3gp,3g2,mj2",
"format_long_name": "QuickTime / MOV",
which is an interesting result. Probably due to this: https://en.wikipedia.org/wiki/QuickTime_File_Format#Relation_to_MP4
Which explains why ffmpeg will decode both of those with the same decoder (since it's an mp4 stream internally with slightly different header metadata), but will encode them as separate spec compliant containers.
E mov QuickTime / MOV
D mov,mp4,m4a,3gp,3g2,mj2 QuickTime / MOV
I wonder how much of a distinction we need to make here, for some relatively ancient formats, but probably some microscope/image tooling cares I guess.
As for implementation, I'm not sure what's the optimal implementation here. magic numbers can confirm it's actually one of the datatypes, but I still have some (perhaps unfounded) concern that they are not fully exhaustive/forward compatible of all the flavours of these datatypes that can be seen in the wild, so, instead if ffprobe says it's mp4, then we'll accept that.
Given how that magic numbers are used to spell ftypisom
/ ftypMSNV
, I think that worry is not unfounded, that they'll might eventually a new ftyp
that's also somehow an mp4 file.
So I think the optimal implementation looks like:
in mp4:
- if ffprobe says it's an mp4
- if magic numbers say it's a mov, return false
- otherwise it's probably an mp4
in mov
- if the magic numbers say it's a mov it's a mov.
and based on their ordering in the datatypes to prefer mp4 to mov/qtff (given that it's a more common format).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the ffprobe better than the magic bits? The idea is to cut down on the number of subprocess calls on the headnodes, which is a good thing. So, should we use just magic bits if it is possible to eliminate ffprobe subprocess to fasten the process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to cut down on the number of subprocess calls on the headnodes, which is a good thing
We currently only do 1 subprocess call (or we should only be doing 1) for any media file due to the caching in this function:
galaxy/lib/galaxy/datatypes/media.py
Lines 22 to 33 in d7673ab
@lru_cache(maxsize=128) | |
def _ffprobe(path): | |
return subprocess.run( | |
["ffprobe", "-loglevel", "quiet", "-show_format", "-show_streams", "-of", "json", path], capture_output=True | |
) | |
def ffprobe(path): | |
completed_process = _ffprobe(path) | |
completed_process.check_returncode() | |
data = json.loads(completed_process.stdout.decode("utf-8")) | |
return data["format"], data["streams"] |
so this will not achieve any improvement there. Did you have a metric which was showing a high number of subprocess calls? Perhaps the cache size needs to be larger (or configurable)
it would only speed it up if we completely eliminated the ffprobe calls in every media type, but, I think we will not since other metadata is fetched from ffprobe's output.
lib/galaxy/datatypes/media.py
Outdated
vp_check = any( | ||
stream["codec_name"] in ["vp8", "vp9"] for stream in streams if stream["codec_type"] == "video" | ||
) | ||
return _get_file_format_from_magic_number(filename, "mkv") and not vp_check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this restriction makes sense for mkvs, they should be able to host vp8/9 encoded streams.
But this is also somewhat a general 'problem' with the datatypes in that we describe both containers and encodings, and we don't capture that metadata in galaxy... but on the other hand I'm not sure I've encountered many tools that are doing video processing and aren't using something like ffmpeg
/ libav
/ etc and would be able to handle different stream codecs within a container.
Do you have a reason to add this restriction here? If so, perhaps that would be best expressed as metadata of the file, and then annotate that metadata here, and whatever tools you use can make decisions based on that metadata.
Also I'm not sure of a benefit we get from this specific change; we're still doing the ffprobe
so we aren't doing any less computation via using magic numbers (i.e. the usual reason for adding them), are magic numbers more accurate than checking for matroska
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that for mkv
and webm
formats, they look the same to ffprobe
and have the same magic number. So the way that I found is that webm
video codec could only be vp8 or vp9, That is why I am using this approach. And for computation you are right it doesn't make any difference that is why I put the mkv and webm at the end of the sniffer tags in datatypes_conf.xml.sample
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct, webms are mkvs, just with restricted format lists. I guess there is precedent for datatypes that are subsets (bed)
But in this case we would potentially (wrongly?) detect a user's uploaded mkv
that happens to use vp8 as webm
format, which is an odd scenario to encounter.
I'm not sure what the best behaviour is here and I'd be happy to hear input from other folks who deal with video data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also another way to find out the format in Python and it is with magic but I'm afraid it will slow the process.
>>> import magic
>>> f = magic.Magic(mime=True)
>>> f.from_file('testfile')
return False | ||
return _get_file_format_from_magic_number(filename, "mp3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but yes cases like this, instead of return False
, just return _get_file_format_from_magic_number(filename, "mp3")
and keeping the ffprobe would potentially be good.
lib/galaxy/datatypes/media.py
Outdated
if which("ffprobe"): | ||
metadata, streams = ffprobe(filename) | ||
vp_check = any( | ||
stream["codec_name"] in ["vp8", "vp9"] for stream in streams if stream["codec_type"] == "video" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
webm also supports av1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, should I check av1
, vp8
, and vp9
for webm
or is this approach entirely incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on #18054 (comment), I don't know.
Perhaps yes, check for those formats and if they're present it's a "webm". I think we could add a set_peek
method which says "WebM subset of MKV" (or a similar description) to communicate that to the user that it's still actually a matroshka container.
lib/galaxy/datatypes/media.py
Outdated
"66 74 79 70 69", | ||
"66 74 79 70 6D", | ||
"66 74 79 70 4D", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.wikipedia.org/wiki/List_of_file_signatures
66 74 79 70 69 73 6F 6D ftypisom 4 mp4 ISO Base Media file (MPEG-4)
66 74 79 70 4D 53 4E 56 ftypMSNV 4 mp4 MPEG-4 video file
66 74 79 70 68 65 69 63
66 74 79 70 6d ftypheic 4 heic High Efficiency Image Container (HEIC)
6d is heic, not sure if you want to lump this in with mp4? (I assume you based this on files you had/have seen). Also might want to consider extending these magic numbers to include the whole string to avoid future false positives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to grab m4v file format.
https://www.garykessler.net/library/file_sigs.html
[4 byte offset]
66 74 79 70 6D 70 34 32 [4 byte offset]
ftypmp42
M4V MPEG-4 video|QuickTime file
But, you are right it will be missed with the heic I will change it to 66 74 79 70 6D 70 34 32
.
Also, I will consider extending the magic numbers.
lib/galaxy/datatypes/media.py
Outdated
"66 74 79 70 4D", | ||
], | ||
}, | ||
"flv": {"offset": 0, "hex": ["46 4C 56 01"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also by wikipedia's list, the magic should just be FLV
, so 01
in the last byte might be too restrictive.
Since many of these are human readable strings (like ftypisom
) it might make sense for your implementation to support something like
"flv": {"offset": 0, "hex": ["46 4C 56 01"]}, | |
"flv": {"offset": 0, "string": ["FLV"]}, |
and then check the correct number of bytes based off of the length of the string or so, to not have to specify a 4th byte if it isn't relevant would reduce some of these long hex lists.
"00 00 01 B7", | ||
"00 00 01 B8", | ||
"00 00 01 B9", | ||
"00 00 01 BA", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these all seen in practice? Wikipedia's list suggests it should be a much reduced set
00 00 01 BA
00 00 01 B3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using this source:
https://www.garykessler.net/library/file_sigs.html
00 00 01 Bx ....
MPEG, MPG MPEG video file
Trailer:
00 00 01 B7 (...·)
This PR was merged without a "kind/" label, please correct. |
This pull request adds support for Ogg, Webm, Mpeg, Mov, Avi, Wmv, and Wma media types.
It also refactors the media datatype sniffing logic in media.py using magic numbers.
How to test the changes?
License