-
Notifications
You must be signed in to change notification settings - Fork 4
Sanitization
This is a review of how information is passed around into, inside of and out of poca and how that can fail due to
- not-sanitized user input or feed input or
- wrong assumptions about character encodings
By "not-sanitized" I here mean "possibly containing characters that are expressly forbidden in that context". An example would be using question marks in a filename on an NTFS filesystem.
Wrong assumptions about character encodings would typically be assuming the ability to en- or decode UTF-8 when no such ability exists, resulting in UnicodeDecodeErrors or wrong characters displayed. It can also be the inability to actually display unicode characters even if they are accepted and decoded correctly, see WSL.
Unicode is a good thing 🇰🇷 🇿🇦 🇬🇷 Don't restrict the user or yourself to ASCII unless you have to ⚖️
Take responsibility only for your own f🍆🍆k-ups and help the user take responsibility for their 💩
Be best 🥰✨👍
- base_dir: Based solely on user input. Must be unchanging and legal. Errors are fatal. User is responsible for error.
- sub_dir: Based solely on user input. Must be unchanging and legal. Errors are fatal to subscription. User is responsible for error.
- filename/rename: Based partly on user input and partly on feed. 3 attempts at producing legal filename. Errors are not fatal, just results in new attempt. Poca is responsible for handling errors.
- Status: Not escaping & or < results in a fatal parser error. As do control characters.
- Ideal: Same as status (though better documentation would be good)
The only escaping that needs to be done in poca.xml is what is necessary to preserve the XML itself. That means that two characters need escaping:
-
<
or less than - to insert, use the code<
-
&
or ampersand - to insert, use the code&
This only applies to entering text into poca.xml. Either character appearing in feeds (where they will also be escaped because feeds are also xml) is nothing to worry about. All other unicode characters with the exception of ascii control characters can be entered or copied in directly. This is valid:
<title>🎞️ & 🍿</title>
and produces
drwxrwxr-x 2 mads mads 4,0K May 19 13:07 '🎞️ & 🍿'
- Status: If base_dir contains characters that are filesystem illegal, mkdirs fails for each subscription.
- Ideal: base_dir is recognised as unusable once and fatally
- Solution: Create base_dir, don't wait until subupgrade.
- Policy: Demand a legal directory name (i.e. don't try to fix).
- Status: If sub_dir contains characters that are filesystem illegal, mkdirs fail for that subscription.
- Ideal: As status but better error message. Annoyingly, ntfs-3g (or whatever is handling the mkdir) just replies "Invalid argument" which is next to useless.
- Solution: Separate FileExistsError and OSError in files.py:110. Better error messages with suggestion that OSErrors are related to illegal characters.
- Policy: As base_dir.
The SubUpdate thread class joins
- base_dir (✔️ at this point) and
- title of the subscription
to get a sub_dir.
config.subs() has previously tested that
- the sub has a title element and
- the element has a .text value.
So sub.title.text can be any value that has passed the xml parser smell test. That means any unicode character, including & and < if the user has used the proper escapes. What about:
- empty strings: no, not in valid_subs
- all whitespace: no, not in valid_subs
- ascii control characters: no, causes parser errors.
- forward slashes: "could not create ...." on linux.
- *:? etc. on NTFS: "could not create ...." Note that
<
can be used with escapes but is an illegal character on NTFS.
We use whatever the user says to use and report failure when creating it.
- Status: Works fine
- Solution: Better documentation of behaviour
Any unicode character can be inserted into the metadata of
- id2.3 (mutagen using utf-16 encoding)
- id2.4 (mutagen using utf-8 encoding)
- m4a (mutagen using ??? Only tested with one file and subscription but Nautilus shows it in the audio tab of the file)
- Ogg comments (presumably - it's all utf-8, right?)
How well it works on players probably varies.
Can there be issues with characters in filters? Regex protected characters? We do tell the user to expect regex... so... we cool?
Any possible interaction between XML escapes and regex special characters? No... Neither <
nor &
have special meaning in regex.
Is Python's re
good with unicode? All of unicode? Can I search/match an emoji? Yes, apparently:
In [64]: flower_re = re.compile("🌸.+$")
In [65]: text = "This is a 🌸 cherry blossom"
In [66]: flower_re.findall(text)
Out[66]: ['🌸 cherry blossom']
Do we need to do anything other than test whether it is a valid url?
- Status: Some scrubbing, but random. Lots of potential for filename crashes.
- Ideal: Filename creation degrades gracefully and gradually until 0% risk of failure.
- Solution: Regardless of rename option, three filename candidates are created for use.
- Policy: Make it work. Don't hold user responsible for picking legal options that result in weird shit because feeds are unpredictable.
The entryinfo package goes through three stages corresponding to the Feed, Combo and Wanted classes.
- Feed: At this stage it's just the 'raw' dictionary fresh from feedparser. We still don't know if we are going to need it. Maybe we already know it (in jar) or it might get filtered or max_num'ed out. So: No point in doing anything other than counting it.
- Combo: At this stage we find out if we've seen the entry before. Those we haven't get validated, i.e. we check if they have an enclosure. Those that don't gets marked as
valid = False
. This results in a list and dictionary of viable contenders for the status of... - Wanted: This is where the wheat gets separated from the chaff. Only once the wanted entries have been found are the relevant entryinfo's expanded. The restrictions are in chronological order:
- entries noted as user deleted gets removed from the list
- invalid entries (see Combo) gets removed
- filters are applied
- finally, the remaining are max_num limited
validate
is simple: Find and enclosure, check that there is a valid url.
... And then for some reason find the filename even though it doesn't impact the validation (the valid value doesn't change even if we find nothing) Note that neither urlparse unquote or basename can really fail (a single character string passes the parsing with flying colors)
expand
gathers a number of data from the entryinfo dictionary and makes them more easily available to the rest of the processing. With regards to filename, this is what happens:
- The filename found in validate is split into a
basename
and anext
-
user_vars
is added as a dictionary, containing- original_filename (basename) (twice for some reason?)
- subscription title (again, twice)
- episode_title (from the feed if any else a "missing" string)
- the uid
- published date (YYYY-MM-DD)
- if rename is called for,
basename
is changed (by sending back and forth the entire dictionary) - is stdout does not use utf-8,
basename
gets encoded as ascii with a flag to ignore errors and decoded (implicitly as utf-8) 🤦 I guess it works, sorta. Anything that doesn't fit into ASCII just gets discarded. -
filename
gets reconstructed by joiningbasename
andext
. -
abs_path
is created as the joining ofsub_dir
andfilename
-
Uniform UIDs using uuid.uuid5(url)belongs in 2.0, not point release because of breakage. Note that abandoned branch 'filename_sanitization' contains the attempt to using uuid. - Grouping filename parsing into function: 1) from urlparse, 2) unquoting, 3) splitting base and ext
- Grouping stats (length, size) into function
- Get user_vars.
- New function to generate tuple of four names using a) the original file name and b) a rename section if any exists. The function should generate an ideal filename and then degrade it twice to get options 2 and 3.
Option 4 is the uuid. It is to be preferred over falling back on the original filename as that can hypothetically have contained only forbidden characters. Unlikely but... - In the download function we generate and set abspath dynamically and return entryinfo with the successful value.
permissive : 2021-04-21 How did sailors become pirates? (Sea of Thieves).mp3
ntfs : 2021-04-21 How did sailors become pirates (Sea of Thieves).mp3
restrictive : 2021-04-21HowdidsailorsbecomepiratesSeaofThieves.mp3
fallback : 56ea07c0.mp3
We include a fallback option: a filename composed of a randomly generated eight byte hexdecimal integer (uuid.uuid4().hex[:8]
)
I think this is needed because of the possibility that the restrictive
function reduces the filename to an empty string. Example:
name_base: 表外漢字
permissive: 表外漢字
ntfs: 表外漢字
restrictive:
fallback: 6799e898
If we assume a filesystem that only accepts ascii filenames, both permissive and ntfs will fail. Should the internals assume ascii to decode a utf-8 encoded "表外漢字", the very first byte would be e8 which is out of range(128) and so cause an error. If they ignore errors we end up with an empty string which is not a legitimate filename on any filesystem.
Reducing to 8 characters also helps if a FAT filesystem is mounted using the Bollocks to that, nobody who's on Python 3.x mounts a FAT filesystem using a driver fit for the 1980s. Fallback is used a) fot the above case and b) for autofixing things like acast.com naming all files media.mp3. the fallback option will be prefixed by a date, so files are chronologically alphabetized.msdos
driver. Filenames longer than 8 characters can be used when reading and writing, but anything beyond the 8th is cut off anyway, so 6799e898ABC and 6799e898DEF would be identical.
We are not massively overhauling the output for 1.1 (save it for 2.0 if at all). Is the sys.stdout test sufficient? No, not if we count the WSL failure. Is that a special case that we code for? If so there is a solution here: https://www.scivision.dev/python-detect-wsl/
- Status:
- Ideal:
- Solution:
- Policy: