Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML string containing underscores gets escaped and shown in output markdown #63

Open
jiggneshhgohel opened this issue Feb 9, 2016 · 11 comments

Comments

@jiggneshhgohel
Copy link

Ruby 2.3.0

Rails 4.2.5

Rails Console output:

2.3.0 :005 > html_str = "<p><strong>Username</strong> : %{user_name}</p>"
 => "<p><strong>Username</strong> : %{user_name}</p>" 
2.3.0 :006 > md = ReverseMarkdown.convert(html_str)
 => " **Username** : %{user\\_name}\n\n" 
2.3.0 :007 > puts md
 **Username** : %{user\_name}
 => nil 

As can be seen when my HTML string contains words separated by underscore(s), the underscore(s) gets escaped which is correct. But is there any way we can hide those in the output string? I have a use-case wherein I want to convert the HTML string to its Markdown version and render the markdown version as it is.

@xijo
Copy link
Owner

xijo commented Feb 11, 2016

Hello @jiggneshhgohel

I'm not quite sure how to help you there. We could introduce an option to disable the escaping, but I think the use cases for that are quite rare.

On the other hand could you replace escaped key chars with the non-escaped versions whenever you need it. Wouldn't that be the easier solution?

Let me know what you think!
Jo

@jiggneshhgohel
Copy link
Author

@xijo

On the other hand could you replace escaped key chars with the non-escaped versions whenever you need it. Wouldn't that be the easier solution?

That's the most obvious option when there is no provision for desired feature in the library. But my expectancy of using a library like yours is to expect raw original markdown from the HTML I need to convert. My HTML didn't contained any escape characters thus I wasn't expecting it in my converted to markdown.

Thanks.

@anujbiyani
Copy link

I think the issue is that the md = ReverseMarkdown.convert(html_str) is double-escaping the underscore, no? I would expect

2.3.0 :006 > md = ReverseMarkdown.convert(html_str)
 => " **Username** : %{user\\_name}\n\n" 

to read as

2.3.0 :006 > md = ReverseMarkdown.convert(html_str)
 => " **Username** : %{user\_name}\n\n" # only \_ here instead of \\_

The issue I'm having is with a string like

"https://github.com/xijo/reverse_markdown"

ReverseMarkdown converts it to

"https://github.com/xijo/reverse\\_markdown\n\n"

and then the marked library converts it to

"<p><a href="https://github.com/xijo/reverse\_markdown">https://github.com/xijo/reverse\_markdown</a></p>
"

which is correct from marked's perspective because it's processing only one of the escapes, but incorrect from the user's perspective because the original string had no escapes and the outputted markdown has escapes.

(I'm using ReverseMarkdown because sometimes the input string contains actual markup, sometimes it's just plain text.)

@anujbiyani
Copy link

https://www.markdownguide.org/basic-syntax/#italic

To italicize text, add one asterisk or underscore before and after a word or phrase. To italicize the middle of a word for emphasis, add one asterisk without spaces around the letters.

So I think the actual problem is that reverse markdown should never escape underscores that are in the middle of a word.

@xijo
Copy link
Owner

xijo commented Jan 4, 2019

@anujbiyani Thanks for your input on that issue.

I like the idea to leave underscored in words unmodified. It won't solve all issues though, because something like foo, _bar = something will still raise problems, but it's a good solution for most of the cases.

Would you like to open a PR?

@anujbiyani
Copy link

Sure, happy to try and tackle this!

One question: why are * and _ escaped at all? (Referring to this function.) Aren't unmatched * and _ valid without the escaping, and won't most markdown->HTML parsers simply ignore them?

@xijo
Copy link
Owner

xijo commented Jan 5, 2019

@anujbiyani Good question! I did some digging in the past and I think initial the description mentions this case: https://daringfireball.net/projects/markdown/syntax#backslash

The example sounds pretty reasonable to me so I wouldn't ditch the whole escaping. What about:

  1. don't escape in the middle of a word
  2. add an configuration flag that disables escaping

This way it should be save for the majority and doesn't change the default behavior all to drastically. But everyone is free to opt out of escaping if it fits their case.

What do you think?

@anujbiyani
Copy link

anujbiyani commented Jan 7, 2019

Happy to do it via configuration flag to maintain backwards compatibility! Follow-up question, though: I believe the link you provided is referring to user-provided backslashes whereas the code I linked in ReverseMarkdown is escaping underscores and asterisks that don't have specifically two backslashes.

I think that method should either:

  1. be removed (if it doesn't have to exist)
  2. be a no-op (if it does have to exist)
  3. add a second backslash if there already is one backslash (if we need the double-backslash instead of just the single)

I'm trying to understand what the right behavior should be for when the flag is enabled 😄 let me know what you think!

@xijo
Copy link
Owner

xijo commented Jan 8, 2019

You're right, the backslash escape is used for user provided escapes. But, if we complete a full life cycle (HTML - MD - HTML) and we would not escape the initial *, then it would be treated as if it were markdown and therefore the information would be lost. It's a little confusing, so let me give you an example:

With escaping

<div>I _emphasize_ that</div>   # Original HTML (user input)
I \_emphasize\_ that            # MD after conversion
<p>I _emphasize_ that</p>       # After using random MD interpreter

Without escaping (as proposed)

<div>I _emphasize_ that</div>     # Original HTML (user input)
I _emphasize_ that                # MD after conversion
<p>I <em>emphasize</em> that</p>  # Now there is an additional em tag

Does this example clarify my problem with just skipping the whole escaping?

In my opinion it is correct to treat the original HTML entities as user input and therefore escape them.

@anujbiyani
Copy link

anujbiyani commented Jan 8, 2019

Ohh that helps a ton, thanks! So now I agree that removing the escaping entirely isn't actually a good idea.

What if we just go with option 1 from your post above and skip the disabling escaping via config?

  1. don't escape in the middle of a word
  2. add an configuration flag that disables escaping

Or did you mean that the configuration flag should apply to whether or not we escape in the middle of the word?

Sorry for all the back-and-forth, I find escaping/not-escaping incredibly confusing from a keeping-it-straight-in-my-head perspective 😖 .

@bsbodden
Copy link

bsbodden commented Feb 13, 2020

Also, don't convert/parse tag attributes, for example:

one_tag = "<img src=\"https://someimageserver.com/img__1fu9uz__.png\" alt=\"foo\" />"
ReverseMarkdown.convert(one_tag)

results in:

=> " ![foo](https://someimageserver.com/img __1fu9uz__.png)"

notice the space

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants