Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better URL extraction logic #548

Closed
MrOrz opened this issue Sep 24, 2023 · 10 comments · Fixed by #569
Closed

Better URL extraction logic #548

MrOrz opened this issue Sep 24, 2023 · 10 comments · Fixed by #569

Comments

@MrOrz
Copy link
Member

MrOrz commented Sep 24, 2023

In https://www.facebook.com/groups/cofacts/posts/3648747782023691/?__cft__[0]=AZWNmv5K_H7F-skP4SOkIgZkb_Zv2i6ot3SXeHigYKawA2MnWSlDmycGq3hfNilD_slvYWqz1M-TriCfusgM7iiguSYqfbf8hBuuDN7Jx98GrMObD8796wLbjw5EJpsyuCzpU12KXm3U_jICgBIgX7KrpyAMBof29c6JJxdT4fDSyBQGDqA4okhj5v4I9uaQN3U&__tn__=%2CO%2CP-R , Cofacts collaborators has pointed out that LINE can actually break URLs and Mandarin characters as expected.

On the contrary, currently our URL matching mechanism just matches all non-space characters following http:// and https://. We should improve this so that URLs followed by Mandarin characters don't break on the website.

@weii41392
Copy link

Hey, I'd like to send a PR. Do you think linkify-react would be a great solution for this?

@MrOrz
Copy link
Member Author

MrOrz commented Nov 13, 2023

Hi @weii41392 thanks for taking a look at the issue!
The currently implemented linkify also shortens URL in a way that the user can still copy-paste the full URL by selecting the text. Introducing a whole library and trying to integrate it while retaining the original behavior may be quite challenging.

On the other hand, simplify fixing the regexp used in current implementation may be a more straight forward solution -- we can focus more on how to correctly spliiting the URLs rather than worry about integrating anything.

@weii41392
Copy link

I understand that introducing a new library for a single function would be a pain, but when I was searching for URL regex patterns, I found each of them sacrifices in some way.

I find linkify a good choice as it was built exactly for linkifying plaintext and is implemented with a parser, which I believe can take care of context more than regex.

P.S. At the time I was writing this comment I found that linkify also failed to handle 全形 parentheses, so I guess we can either choose some regex from the above list or implement some parser for this.

@MrOrz
Copy link
Member Author

MrOrz commented Nov 13, 2023

Thanks for the investigation! The URL regex patterns looks cool.

From the original request we should not include 全形 characters in the URL: https://www.facebook.com/groups/cofacts/posts/3648747782023691/

When fact checkers copy-paste URLs from the browser URL bar, the URLs should always be URL encoded (with %XX). However, if the user copy paste text from another reply on Cofacts, then it may copy the decoded URL, since we display all URLs in decoded format (it makes URLs more human readable). Dropping all 全型 characters may break URLs with decoded 全型 inside.

Personally I think that when users provide URLs, they may do things like "請參考http://google.com。", 「(請參考http://google.com)」 . Therefore, maybe we can cut down URLs when encountering punctuations (both 半形 and 全形) .

Suddenly I recall that we are extracting hyperlinks in rumors-api as well, because we scrap the URL content when new messages and replies are created. It seems that url-regex is used. Would you test if it satisfies the need?

@weii41392
Copy link

weii41392 commented Nov 17, 2023

url-regex seems to be one of the regex patterns in the above list.
In the following I test both url-regex and linkify in the presence of halfwidth/fullwidth parentheses.
In a nutshell, url-regex doesn't work well with both halfwidth/fullwidth parentheses and linkify can handle halfwidth parentheses.
I guess this is because halfwidth parenthesis is defined as non-accepting symbol here while fullwidth parenthesis isn't. Note that the same issue may also happen to braces({}), brackets([]), and angle brackets (<>).

I would lean towards using linkify and sanitizing the results (say removing closing symbols).

import urlRegex from 'url-regex';
import { tokenize } from "linkifyjs";

const links = [
    "http://foo.com/blah_blah",
    "http://foo.com/blah_blah_(wikipedia)_(again)"
];

const texts = [
    `${links[0]} ${links[1]}`,
    `網址1(${links[0]}) 網址2(${links[1]})`,
    `網址1(${links[0]}) 網址2(${links[1]})`,
];

for (const text of texts) {
    const urls = text.match(urlRegex()) || [];
    urls.forEach((url) => console.log(`"${url}"`));
}

// url-regex
// texts[0]: succeed without parentheses
// "http://foo.com/blah_blah"
// "http://foo.com/blah_blah_(wikipedia)_(again)"

// texts[1]: fail to handle halfwidth parentheses
// "http://foo.com/blah_blah)"
// "http://foo.com/blah_blah_(wikipedia)_(again))"

// texts[2]: fail to handle fullwidth parentheses
// "http://foo.com/blah_blah)"
// "http://foo.com/blah_blah_(wikipedia)_(again))"

for (const text of texts) {
    const tokens = tokenize(text);
    tokens.filter(token => token.isLink).forEach((token) => console.log(`"${token.v}"`));
}

// linkifyjs
// texts[0]: succeed without parentheses
// "http://foo.com/blah_blah"
// "http://foo.com/blah_blah_(wikipedia)_(again)"

// texts[1]: succeed with halfwidth parentheses
// "http://foo.com/blah_blah"
// "http://foo.com/blah_blah_(wikipedia)_(again)"

// texts[2]: fail to handle fullwidth parentheses
// "http://foo.com/blah_blah)"
// "http://foo.com/blah_blah_(wikipedia)_(again))"

P.S. Add a reference to the issue to linkify: Hypercontext/linkifyjs#460

@weii41392
Copy link

Introducing a whole library and trying to integrate it while retaining the original behavior may be quite challenging.

Also, code changes to linkify() would not be an issue. (Sadly, this implementation doesn't work with fullwidth symbols...)

import { tokenize } from 'linkifyjs';

const tokenized = tokenize(str).map((token, i) =>
  token.isLink ? (
    <a key={`link${i}`} href={token.v} {...props}>
      {shortenUrl(token.v, maxLength)}
    </a>
  ) : (
    token.toString()
  )
);

@MrOrz
Copy link
Member Author

MrOrz commented Nov 18, 2023

Thank you for your thorough investigation and demonstration! My previous comment was towards linkify-react. tokenize() from linkifyjs looks very promising! Let's proceed with linkifyjs 👍

@Tucchhaa
Copy link
Contributor

Tucchhaa commented Apr 5, 2024

Tried to type the link from the facebook to LINE and Telegram. Both messengers included full-width parenthesis to the url:
image
image

So, I suppose that should be the standard way to handle full-width parenthesis in url. Since also none of the regexp nor linkifyjs support them, it should be no need of handling it for us.

Though half-width brackets aren't handled correctly on the webiste:
image

The issue in linkifyjs is still open, but it's related to full-width brackets inside url, which seems not to be a standard.

So what I suggest is to use linkifyjs to handle half-width brackets in correct way, and forget about full-width brackets, since nobody supports them. @MrOrz What do you think?

@MrOrz
Copy link
Member Author

MrOrz commented Apr 8, 2024

Thanks for the analysis. I think we can proceed with linkifyjs 👍

@Tucchhaa
Copy link
Contributor

Tucchhaa commented Apr 9, 2024

Ok. I will create a PR by the end of this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants