Sanitization

Purpose

This is a review of how information is passed around into, inside of and out of poca and how that can fail due to

not-sanitized user input or feed input or
wrong assumptions about character encodings

By "not-sanitized" I here mean "possibly containing characters that are expressly forbidden in that context". An example would be using question marks in a filename on an NTFS filesystem.

Wrong assumptions about character encodings would typically be assuming the ability to en- or decode UTF-8 when no such ability exists, resulting in UnicodeDecodeErrors or wrong characters displayed. It can also be the inability to actually display unicode characters even if they are accepted and decoded correctly, see WSL.

Policy

Values

Unicode is a good thing 🇰🇷 🇿🇦 🇬🇷 Don't restrict the user or yourself to ASCII unless you have to ⚖️
Take responsibility only for your own f🍆🍆k-ups and help the user take responsibility for their 💩
Be best 🥰✨👍

Implementation

base_dir: Based solely on user input. Must be unchanging and legal. Errors are fatal. User is responsible for error.
sub_dir: Based solely on user input. Must be unchanging and legal. Errors are fatal to subscription. User is responsible for error.
filename/rename: Based partly on user input and partly on feed. 3 attempts at producing legal filename. Errors are not fatal, just results in new attempt. Poca is responsible for handling errors.

User input

poca.xml

Status: Not escaping & or < results in a fatal parser error. As do control characters.
Ideal: Same as status (though better documentation would be good)

The only escaping that needs to be done in poca.xml is what is necessary to preserve the XML itself. That means that two characters need escaping:

< or less than - to insert, use the code <
& or ampersand - to insert, use the code &

This only applies to entering text into poca.xml. Either character appearing in feeds (where they will also be escaped because feeds are also xml) is nothing to worry about. All other unicode characters with the exception of ascii control characters can be entered or copied in directly. This is valid:

<title>🎞️   &amp; 🍿</title>

and produces

drwxrwxr-x 2 mads mads 4,0K May 19 13:07 '🎞️  & 🍿'

base_dir

Status: If base_dir contains characters that are filesystem illegal, mkdirs fails for each subscription.
Ideal: base_dir is recognised as unusable once and fatally
Solution: Create base_dir, don't wait until subupgrade.
Policy: Demand a legal directory name (i.e. don't try to fix).

sub_dir (title/directory name)

Status: If sub_dir contains characters that are filesystem illegal, mkdirs fail for that subscription.
Ideal: As status but better error message. Annoyingly, ntfs-3g (or whatever is handling the mkdir) just replies "Invalid argument" which is next to useless.
Solution: Separate FileExistsError and OSError in files.py:110. Better error messages with suggestion that OSErrors are related to illegal characters.
Policy: As base_dir.

The SubUpdate thread class joins

base_dir (✔️ at this point) and
title of the subscription

to get a sub_dir.

config.subs() has previously tested that

the sub has a title element and
the element has a .text value.

So sub.title.text can be any value that has passed the xml parser smell test. That means any unicode character, including & and < if the user has used the proper escapes. What about:

empty strings: no, not in valid_subs
all whitespace: no, not in valid_subs
ascii control characters: no, causes parser errors.
forward slashes: "could not create ...." on linux.
*:? etc. on NTFS: "could not create ...." Note that < can be used with escapes but is an illegal character on NTFS.

We use whatever the user says to use and report failure when creating it.

metadata

Status: Works fine
Solution: Better documentation of behaviour

Any unicode character can be inserted into the metadata of

id2.3 (mutagen using utf-16 encoding)
id2.4 (mutagen using utf-8 encoding)
m4a (mutagen using ??? Only tested with one file and subscription but Nautilus shows it in the audio tab of the file)
Ogg comments (presumably - it's all utf-8, right?)

How well it works on players probably varies.

filters

Can there be issues with characters in filters? Regex protected characters? We do tell the user to expect regex... so... we cool?

Any possible interaction between XML escapes and regex special characters? No... Neither < nor & have special meaning in regex.

Is Python's re good with unicode? All of unicode? Can I search/match an emoji? Yes, apparently:

In [64]: flower_re = re.compile("🌸.+$")
In [65]: text = "This is a 🌸 cherry blossom"
In [66]: flower_re.findall(text)
Out[66]: ['🌸 cherry blossom']

poca-subscribe add

Do we need to do anything other than test whether it is a valid url?

Filenames

Status: Some scrubbing, but random. Lots of potential for filename crashes.
Ideal: Filename creation degrades gracefully and gradually until 0% risk of failure.
Solution: Regardless of rename option, three filename candidates are created for use.
Policy: Make it work. Don't hold user responsible for picking legal options that result in weird shit because feeds are unpredictable.

Current flow

The entryinfo package goes through three stages corresponding to the Feed, Combo and Wanted classes.

Feed: At this stage it's just the 'raw' dictionary fresh from feedparser. We still don't know if we are going to need it. Maybe we already know it (in jar) or it might get filtered or max_num'ed out. So: No point in doing anything other than counting it.
Combo: At this stage we find out if we've seen the entry before. Those we haven't get validated, i.e. we check if they have an enclosure. Those that don't gets marked as valid = False. This results in a list and dictionary of viable contenders for the status of...
Wanted: This is where the wheat gets separated from the chaff. Only once the wanted entries have been found are the relevant entryinfo's expanded. The restrictions are in chronological order:
- entries noted as user deleted gets removed from the list
- invalid entries (see Combo) gets removed
- filters are applied
- finally, the remaining are max_num limited

validate is simple: Find and enclosure, check that there is a valid url.

... And then for some reason find the filename even though it doesn't impact the validation (the valid value doesn't change even if we find nothing) Note that neither urlparse unquote or basename can really fail (a single character string passes the parsing with flying colors)

expand gathers a number of data from the entryinfo dictionary and makes them more easily available to the rest of the processing. With regards to filename, this is what happens:

The filename found in validate is split into a basename and an ext
user_vars is added as a dictionary, containing
- original_filename (basename) (twice for some reason?)
- subscription title (again, twice)
- episode_title (from the feed if any else a "missing" string)
- the uid
- published date (YYYY-MM-DD)
if rename is called for, basename is changed (by sending back and forth the entire dictionary)
is stdout does not use utf-8, basenamegets encoded as ascii with a flag to ignore errors and decoded (implicitly as utf-8) 🤦 I guess it works, sorta. Anything that doesn't fit into ASCII just gets discarded.
filenamegets reconstructed by joining basenameand ext.
abs_path is created as the joining of sub_dir and filename

Redesigned flow

~~Uniform UIDs using uuid.uuid5(url)~~ belongs in 2.0, not point release because of breakage. Note that abandoned branch 'filename_sanitization' contains the attempt to using uuid.
Grouping filename parsing into function: 1) from urlparse, 2) unquoting, 3) splitting base and ext
Grouping stats (length, size) into function
Get user_vars.
New function to generate tuple of four names using a) the original file name and b) a rename section if any exists. The function should generate an ideal filename and then degrade it twice to get options 2 and 3. ~~Option 4 is the uuid. It is to be preferred over falling back on the original filename as that can hypothetically have contained only forbidden characters. Unlikely but...~~
In the download function we generate and set abspath dynamically and return entryinfo with the successful value.

Example of filename degradation

permissive : 2021-04-21 How did sailors become pirates? (Sea of Thieves).mp3
ntfs : 2021-04-21 How did sailors become pirates (Sea of Thieves).mp3
restrictive : 2021-04-21HowdidsailorsbecomepiratesSeaofThieves.mp3
fallback : 56ea07c0.mp3

Fallback

We include a fallback option: a filename composed of a randomly generated eight byte hexdecimal integer (uuid.uuid4().hex[:8])

I think this is needed because of the possibility that the restrictive function reduces the filename to an empty string. Example:

name_base:   表外漢字
permissive:  表外漢字
ntfs:        表外漢字
restrictive: 
fallback:    6799e898

If we assume a filesystem that only accepts ascii filenames, both permissive and ntfs will fail. Should the internals assume ascii to decode a utf-8 encoded "表外漢字", the very first byte would be e8 which is out of range(128) and so cause an error. If they ignore errors we end up with an empty string which is not a legitimate filename on any filesystem.

Reducing to 8 characters also helps if a FAT filesystem is mounted using the msdos driver. Filenames longer than 8 characters can be used when reading and writing, but anything beyond the 8th is cut off anyway, so 6799e898ABC and 6799e898DEF would be identical. Bollocks to that, nobody who's on Python 3.x mounts a FAT filesystem using a driver fit for the 1980s. Fallback is used a) fot the above case and b) for autofixing things like acast.com naming all files media.mp3. the fallback option will be prefixed by a date, so files are chronologically alphabetized.

Output

We are not massively overhauling the output for 1.1 (save it for 2.0 if at all). Is the sys.stdout test sufficient? No, not if we count the WSL failure. Is that a special case that we code for? If so there is a solution here: https://www.scivision.dev/python-detect-wsl/

Template

Status:
Ideal:
Solution:
Policy:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly