Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[$1000] Web - Inconsistent parsing behavior on copying and editing an invalid URL with HTTP protocol #23535

Closed
1 of 6 tasks
kbecciv opened this issue Jul 25, 2023 · 48 comments
Closed
1 of 6 tasks
Assignees
Labels
Bug Something is broken. Auto assigns a BugZero manager. Daily KSv2 Internal Requires API changes or must be handled by Expensify staff

Comments

@kbecciv
Copy link

kbecciv commented Jul 25, 2023

If you haven’t already, check out our contributing guidelines for onboarding and email contributors@expensify.com to request to join our Slack channel!


Action Performed:

  1. Open https://staging.new.expensify.com/
  2. Send this message to a chat : http://_
  3. Wait for a while and try to edit the message with "Edit comment". Examine the message parsed
  4. Copy the message with using "Copy to Clipboard" and paste it on the message textbox. Examine the message displayed in the textbox

Expected Result:

Either the URL isn't parsed at all (since it doesn't have the valid TLD and isn't a reserved name (like localhost for example)), or the behavior on both editing and upon copying should be consistent

Actual Result:

he message before edit and copied message is different (Message before edit message shown as "http://_", copied message shown as "http://_")

Workaround:

Unknown

Platforms:

Which of our officially supported platforms is this issue occurring on?

  • Android / native
  • Android / Chrome
  • iOS / native
  • iOS / Safari
  • MacOS / Chrome / Safari
  • MacOS / Desktop

Version Number: 1.3.46.7
Reproducible in staging?: n/a
Reproducible in production?: n/a
If this was caught during regression testing, add the test name, ID and link from TestRail:
Email or phone of affected tester (no customers):
Logs: https://stackoverflow.com/c/expensify/questions/4856
Notes/Photos/Videos: Any additional supporting documentation

2nd.-.Invalid.URL.Parsing.mp4

Expensify/Expensify Issue URL:
Issue reported by: @kerupuksambel
Slack conversation: https://expensify.slack.com/archives/C049HHMV9SM/p1689596021560619

View all open jobs on GitHub

Upwork Automation - Do Not Edit
  • Upwork Job URL: https://www.upwork.com/jobs/~01e96ccfaa92b9a019
  • Upwork Job ID: 1683921325168218112
  • Last Price Increase: 2023-08-15
@kbecciv kbecciv added Daily KSv2 Needs Reproduction Reproducible steps needed Bug Something is broken. Auto assigns a BugZero manager. labels Jul 25, 2023
@melvin-bot
Copy link

melvin-bot bot commented Jul 25, 2023

Triggered auto assignment to @isabelastisser (Bug), see https://stackoverflow.com/c/expensify/questions/14418 for more details.

@melvin-bot
Copy link

melvin-bot bot commented Jul 25, 2023

Bug0 Triage Checklist (Main S/O)

  • This "bug" occurs on a supported platform (ensure Platforms in OP are ✅)
  • This bug is not a duplicate report (check E/App issues and #expensify-bugs)
    • If it is, comment with a link to the original report, close the issue and add any novel details to the original issue instead
  • This bug is reproducible using the reproduction steps in the OP. S/O
    • If the reproduction steps are clear and you're unable to reproduce the bug, check with the reporter and QA first, then close the issue.
    • If the reproduction steps aren't clear and you determine the correct steps, please update the OP.
  • This issue is filled out as thoroughly and clearly as possible
    • Pay special attention to the title, results, platforms where the bug occurs, and if the bug happens on staging/production.
  • I have reviewed and subscribed to the linked Slack conversation to ensure Slack/Github stay in sync

@samh-nl
Copy link
Contributor

samh-nl commented Jul 25, 2023

Proposal

Please re-state the problem that we are trying to solve in this issue.

Inconsistent parsing behavior on copying and editing an invalid URL with HTTP protocol

What is the root cause of that problem?

The frontend recognizes the invalid URL, however subsequently it gets replaced by the backend.
There is an inconsistency in what is considered a valid URL between the frontend and backend.

What changes do you think we should make in order to solve the problem?

We should correct the regular expression for detecting URLs to make it consistent with the backend.

What alternative solutions did you explore? (Optional)

@omarirfan68
Copy link

omarirfan68 commented Jul 25, 2023

Proposal

Please re-state the problem that we are trying to solve in this issue.

Web - Inconsistent parsing behavior on copying and editing an invalid URL with HTTP protocol

What is the root cause of that problem?

The root cause of the problem is to add there regular expression like that

function ChatMessage({ message }) {
  const isValidURL = /^(?:(?:https?|ftp):\/\/)?[\w-]+(?:\.[\w-]+)+(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?$/i.test(message);

  return (
    <div>
      {isValidURL ? (
        <a href={message} target="_blank" rel="noopener noreferrer">
          {message}
        </a>
      ) : (
        message
      )}
    </div>
  );
}

What changes do you think we should make in order to solve the problem?

We can make changes in ReportActionCompose.js to make it work properly after adding a regular expression to check the link!

What alternative solutions did you explore? (Optional)

@isabelastisser isabelastisser added External Added to denote the issue can be worked on by a contributor and removed Needs Reproduction Reproducible steps needed labels Jul 25, 2023
@melvin-bot melvin-bot bot changed the title Web - Inconsistent parsing behavior on copying and editing an invalid URL with HTTP protocol [$1000] Web - Inconsistent parsing behavior on copying and editing an invalid URL with HTTP protocol Jul 25, 2023
@melvin-bot
Copy link

melvin-bot bot commented Jul 25, 2023

Job added to Upwork: https://www.upwork.com/jobs/~01e96ccfaa92b9a019

@melvin-bot melvin-bot bot added the Help Wanted Apply this label when an issue is open to proposals by contributors label Jul 25, 2023
@melvin-bot
Copy link

melvin-bot bot commented Jul 25, 2023

Current assignee @isabelastisser is eligible for the External assigner, not assigning anyone new.

@melvin-bot
Copy link

melvin-bot bot commented Jul 25, 2023

Triggered auto assignment to Contributor-plus team member for initial proposal review - @abdulrahuman5196 (External)

@melvin-bot melvin-bot bot added the Overdue label Jul 27, 2023
@jjcoffee
Copy link
Contributor

jjcoffee commented Jul 28, 2023

Proposal

Please re-state the problem that we are trying to solve in this issue.

URLs with protocol added (e.g. https://) are not autolinked consistently with URLs that do not have the protocol added.

What is the root cause of that problem?

A "loose" regex was introduced by this PR in order to handle localhost-style URLs. The PR's implementation applies the loose regex to any URL which has a protocol (e.g. https://)):

const LOOSE_URL_WEBSITE_REGEX = `${URL_PROTOCOL_REGEX}([-\\w]+(\\.[-\\w]+)*)(?:\\:${ALLOWED_PORTS}|\\b|(?=_))`;

As you can see URL_PROTOCOL_REGEX is used to require a protocol (e.g. https://), but the domain matching regex is different from URL_WEBSITE_REGEX (which is the "strict" regex used for all other URLs). This is where the incosistency stems, as the loose domain regex explicitly allows incorrect values, e.g. a starting underscore (via [-\\w]), or a starting hyphen, whereas URL_WEBSITE_REGEX explicitly disallows the same:

const URL_WEBSITE_REGEX = `${URL_PROTOCOL_REGEX}?((?:www\\.)?[a-z0-9](?:[-a-z0-9]*[a-z0-9])?\\.)+(?:${TLD_REGEX})(?:\\:${ALLOWED_PORTS}|\\b|(?=_))`;

This results in invalid URLs being briefly shown as links by FE (as it parses them as valid links), before the BE strips the URL from the link (as it is invalid).

What changes do you think we should make in order to solve the problem?

We don't really need a completely separate regex for these localhost-style URLs, as it results in quite a bit of code duplication for little benefit and makes it more difficult to maintain than one single regex.

When we're talking about localhost-style URLs, we're really just wanting to turn almost any URL with a protocol into a clickable link, regardless of TLD presence/validity. The exceptions would include invalid domains, such as those that start or end in a hyphen, or invalid characters like _ as in this issue.

We can therefore adjust the regex in URL_WEBSITE_REGEX to handle URLs that have a protocol but no TLD, which we can do by reusing our existing domain regex that matches BE behaviour well, but splitting it out into its own variable so that we can rework it for matching localhost-style URLs:

const URL_DOMAIN_REGEX = '(?:www\\.)?[a-z0-9](?:[-a-z0-9]*[a-z0-9])?';
const URL_WEBSITE_REGEX = `(?:${URL_PROTOCOL_REGEX}?(${URL_DOMAIN_REGEX}\\.)+(?:${TLD_REGEX})|${URL_PROTOCOL_REGEX}(${URL_DOMAIN_REGEX}(?:\\.[a-z0-9])?)+)(?:\\:${ALLOWED_PORTS}|\\b|(?=_))`;

The second part is the one that handles localhost-style URLs, and just adds a bit of regex to prevent ending a URL with a ..

We then just need to tweak sanitizeURL to update this line to handle matching a localhost-style URL (match[6] represents the whole match-group for the second half of the above regex):

const website = match[3] || match[6] ? match[2] : `https://${match[2]}`;

This passes all existing tests here, as well as the following additional tests:

expect(regexToTest.test('http://my.localhost.local-domain')).toBeTruthy();
expect(regexToTest.test('http://-localhost')).toBeFalsy();
expect(regexToTest.test('http://_')).toBeFalsy();
expect(regexToTest.test('http://_localhost')).toBeFalsy();
expect(regexToTest.test('http://-example.com')).toBeFalsy();
expect(regexToTest.test('http://example-.com')).toBeFalsy();
expect(regexToTest.test('http://my.localhost....local-domain:8080')).toBeFalsy();
expect(regexToTest.test('http://my.localhost.')).toBeFalsy();

What alternative solutions did you explore? (Optional)

We can also just keep the loose regex to support localhost-style URLs, but tighten it back up so that it still blocks incorrect values, such as leading (and ending) hyphens or underscores. This is easy to do by just reusing the domain part of URL_WEBSITE_REGEX, so that we get:

const LOOSE_URL_WEBSITE_REGEX = `${URL_PROTOCOL_REGEX}([a-z0-9](?:[-a-z0-9]*[a-z0-9])?\\.?)+(?:\\:${ALLOWED_PORTS}|\\b|(?=_))`;

@melvin-bot
Copy link

melvin-bot bot commented Jul 28, 2023

@isabelastisser, @abdulrahuman5196 Uh oh! This issue is overdue by 2 days. Don't forget to update your issues!

@abdulrahuman5196
Copy link
Contributor

Reviewing today

@melvin-bot melvin-bot bot removed the Overdue label Jul 31, 2023
@Antasel
Copy link
Contributor

Antasel commented Jul 31, 2023

Proposal

Please re-state the problem that we are trying to solve in this issue.

The string (http://_) is parsed as Url, but it doesn't have the valid TLD.

What is the root cause of that problem?

LOOSE_URL_WEBSITE_REGEX is allowing underscore(_) which is not valid on domain name. Additionally for domain label, starting and ending with hyphen is not allowed, but the regex validates those cases

What changes do you think we should make in order to solve the problem?

To fix the root cause, we should tweak a bit to block underscore, and starting and ending with hyphen on LOOSE_URL_WEBSITE_REGEX.

while doing that, we can combine URL_WEBSITE_REGEX and LOOSE_URL_WEBSITE_REGEX into one regex pattern, so it makes maintaining consistency rather than maintaining 2 regex of URL validation.
we can do it simply using OR( | ) , and introduce a variable(DOMAIN_LABEL_REGEX ) to make regex pattern easy readable.

a more tweak is to add total/each label 's length limitation. According to RFC1034, each label is zero to 63 octets in length, and the total number of octets that represent a domain name is limited to 255.

const DOMAIN_LABEL_REGEX = '[a-z0-9](?:[-a-z0-9]{0,61}[a-z0-9])?';
const URL_WEBSITE_REGEX = `(?:${URL_PROTOCOL_REGEX}(?:(?![a-z0-9-.]{256})(${DOMAIN_LABEL_REGEX}\\.)*(?:(${DOMAIN_LABEL_REGEX})))|${URL_PROTOCOL_REGEX}?(?:(?![a-z0-9-.]{256})(${DOMAIN_LABEL_REGEX}\\.)+(?:${TLD_REGEX})))(?:\\:${ALLOWED_PORTS}|\\b|(?=_))`

and remove capturing parenthesis on MARKDOWN_URL_REGEX. or we can remove variable MARKDOWN_URL_REGEX and directly use URL_REGEX.

const MARKDOWN_URL_REGEX = `${URL_REGEX}`;

it's tested with following cases

expect(regexToTest.test('http://my.localhost.local-domain')).toBeTruthy();
expect(regexToTest.test('http://-localhost')).toBeFalsy();
expect(regexToTest.test('http://_')).toBeFalsy();
expect(regexToTest.test('http://_localhost')).toBeFalsy();
expect(regexToTest.test('http://-example.com')).toBeFalsy();
expect(regexToTest.test('http://example-.com')).toBeFalsy();
expect(regexToTest.test('http://my.localhost....local-domain:8080')).toBeFalsy();
expect(regexToTest.test('http://my.localhost.')).toBeFalsy();
expect(regexToTest.test('http://mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars')).toBeTruthy();
expect(regexToTest.test('http://mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa64chars')).toBeFalsy();
expect(regexToTest.test('http://mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars.mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars.mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars.mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.total255length')).toBeTruthy();
expect(regexToTest.test('http://mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars.mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars.mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars.mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.total256length')).toBeFalsy();

and does not make original autotest cases failed.

@abdullah1111
Copy link

Proposal

Please re-state the problem that we are trying to solve in this issue.

String "http://_" get parsed as Url but it doesn't have the valid TLD so it should be processed as the text

What is the root cause of that problem?

In the lib expensify-common has expensifyMark.js which process the comment based on regex rules on these rules "autolink" rules regex is accepting this as valid url and processing it
const regex = new RegExp(`(?![^<]*>|[^<>]*<\\/)([_*~]*?)${MARKDOWN_URL_REGEX}\\1(?!((?:(?!<a).)+)?<\\/a>|[^<]*(<\\/pre>|<\\/code>))`,'gi',);

What changes do you think we should make in order to solve the problem?

We have update the regular expression to check for the TLD as well
so we have to make following changes
Replace this (((((ht|f)tps?:\\/\\/)([\\w]+(\\.[-\\w]+)*) with (((ht|f)tps?:\/\/)(([-\w]+(\.[-\w]+)*\.[\w]*)|localhost)
I have also add the localhost validation as well

@melvin-bot
Copy link

melvin-bot bot commented Aug 1, 2023

📣 @abdullah1111! 📣
Hey, it seems we don’t have your contributor details yet! You'll only have to do this once, and this is how we'll hire you on Upwork.
Please follow these steps:

  1. Get the email address used to login to your Expensify account. If you don't already have an Expensify account, create one here. If you have multiple accounts (e.g. one for testing), please use your main account email.
  2. Get the link to your Upwork profile. It's necessary because we only pay via Upwork. You can access it by logging in, and then clicking on your name. It'll look like this. If you don't already have an account, sign up for one here.
  3. Copy the format below and paste it in a comment on this issue. Replace the placeholder text with your actual details.
    Screen Shot 2022-11-16 at 4 42 54 PM
    Format:
Contributor details
Your Expensify account email: <REPLACE EMAIL HERE>
Upwork Profile Link: <REPLACE LINK HERE>

@abdullah1111

This comment was marked as off-topic.

@melvin-bot
Copy link

melvin-bot bot commented Aug 1, 2023

✅ Contributor details stored successfully. Thank you for contributing to Expensify!

@melvin-bot
Copy link

melvin-bot bot commented Aug 1, 2023

📣 It's been a week! Do we have any satisfactory proposals yet? Do we need to adjust the bounty for this issue? 💸

@isabelastisser
Copy link
Contributor

Hey @abdulrahuman5196, can you please review the proposals? Thanks!

@isabelastisser
Copy link
Contributor

Hi @abdulrahuman5196, please review the proposals so we can move this along.

@melvin-bot melvin-bot bot added Overdue and removed Overdue labels Aug 16, 2023
@isabelastisser
Copy link
Contributor

Hey @abdulrahuman5196, can you please review the proposal so we can move this along? Thanks!

@melvin-bot melvin-bot bot removed the Overdue label Aug 21, 2023
@melvin-bot
Copy link

melvin-bot bot commented Aug 22, 2023

@isabelastisser @abdulrahuman5196 this issue is now 4 weeks old and preventing us from maintaining WAQ, can you:

  • Decide whether any proposals currently meet our guidelines and can be approved as-is today
  • If no proposals meet that standard, please take this issue internal and treat it as one of your highest priorities
  • If you have any questions, don't hesitate to start a discussion in #expensify-open-source

Thanks!

@melvin-bot melvin-bot bot added Internal Requires API changes or must be handled by Expensify staff and removed External Added to denote the issue can be worked on by a contributor Help Wanted Apply this label when an issue is open to proposals by contributors labels Aug 22, 2023
@melvin-bot
Copy link

melvin-bot bot commented Aug 22, 2023

Current assignee @abdulrahuman5196 is eligible for the Internal assigner, not assigning anyone new.

@isabelastisser
Copy link
Contributor

Decide whether any proposals currently meet our guidelines and can be approved as-is today

@abdulrahuman5196, can you please review the updated proposal today? Thanks!

@abdulrahuman5196
Copy link
Contributor

Reviewing now

@abdulrahuman5196
Copy link
Contributor

I need to spend more time reviewing if this doesn't bring back the original localhost issue and keeping the new one strict. I am little worried on this, so will test and update here @jjcoffee

The new regex is in two halves (split by the OR (|) operator), each of which handles the two types of validation

@Antasel I not sure if I could consider the proposal as new one, its almost similar to @jjcoffee 's proposal, added length limitation according to [RFC1034](https://datatracker.ietf.org/doc/html/rfc1034) this could be added advantage but that is not the core issue.

@jjcoffee
Copy link
Contributor

jjcoffee commented Aug 24, 2023

@abdulrahuman5196 Take your time! Let me know if you need anything more from me 🙂

Side-note: BE doesn't have any length limitations when it filters URLs so it's more of a nice-to-have to add it on FE. Localhost URLs also don't have length limitations so the test case in @Antasel's proposal is technically incorrect, so this URL:

http://mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars.mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars.mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars.mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.total256length

works in Chrome.

@abdullah1111
Copy link

Hi @jjcoffee Im not sure what you are taking about, have you taged me by mistake?

@Antasel
Copy link
Contributor

Antasel commented Aug 24, 2023

Side-note: BE doesn't have any length limitations when it filters URLs so it's more of a nice-to-have to add it on FE. Localhost URLs also don't have length limitations so the test case in @Antasel's proposal is technically incorrect, so this URL:

http://mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars.mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars.mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa63chars.mylocalhostaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.total256length

works in Chrome.

@isabelastisser @jjcoffee @abdulrahuman5196 it's because Backend did not use FILTER_VALIDATE_URL to filter url . (more precisely, FILTER_VALIDATE_URL is limiting total lengh to 253 )

@abdullah1111 Good question! I've had another look into this, and I think what we really want to do is use exactly the same validation in the existing URL_REGEX, but make the TLD part optional if a protocol is used. This matches with the BE's validation, which looks like it uses FILTER_VALIDATE_URL, though I've asked on Slack if anyone can give any insights! Updated proposal with this new approach.

@jjcoffee didn't you test FILTER_VALIDATE_URL when commenting it ?

@jjcoffee
Copy link
Contributor

@abdullah1111 Damn sorry, meant to tag @abdulrahuman5196 - typing too fast 😅

@melvin-bot melvin-bot bot added the Overdue label Aug 28, 2023
@isabelastisser
Copy link
Contributor

@abdulrahuman5196, what are the next steps here? Do we need to choose a proposal? This issue was changed to Internal. Should we have an Expensify engineer take over?

@melvin-bot melvin-bot bot removed the Overdue label Aug 28, 2023
@abdulrahuman5196
Copy link
Contributor

abdulrahuman5196 commented Aug 28, 2023

No need for internal engineer yet. I have to re-evaluate the proposals

@melvin-bot melvin-bot bot added the Overdue label Aug 30, 2023
@isabelastisser
Copy link
Contributor

Bump @abdulrahuman5196

@melvin-bot melvin-bot bot removed the Overdue label Aug 30, 2023
@abdulrahuman5196
Copy link
Contributor

Will be working on proposal review today

@abdulrahuman5196
Copy link
Contributor

@isabelastisser I am not seeing this issue anymore on staging. Can you check again?

@isabelastisser
Copy link
Contributor

Hey @abdulrahuman5196, thanks for the heads up; I just tried and can no longer reproduce this in staging.

I will process your payment in Upwork for your time reviewing the proposals and @kerupuksambel for reporting the issue.

@isabelastisser
Copy link
Contributor

Payment summary:

@abdulrahuman5196 $500 for reviewing the proposals.
@kerupuksambel $250 for reporting the issue.

The offers were sent in Upwork, please accept them, and I will process the payments. Thanks!

@kerupuksambel
Copy link

@isabelastisser Offer accepted. Thank you!

@Antasel
Copy link
Contributor

Antasel commented Sep 5, 2023

@isabelastisser @abdulrahuman5196
we still have inconsistency issue when send/edit http://_ on staging app

Screencast.from.6-9-23.04.31.00.webm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something is broken. Auto assigns a BugZero manager. Daily KSv2 Internal Requires API changes or must be handled by Expensify staff
Projects
None yet
Development

No branches or pull requests

9 participants