This is a project I'm using to migrate content - it may or may not do exactly what you want for your content, but hopefully it's useful.
Reverse engineer markdown from an HTML page, including:
- Re-linking and downloading of images
- Front Matter metadata generation (Currently only YAML is supported)
dotnet tool install dotnet-html2md -g
Usage:
html2md --uri|-u <URI> [--uri|-u <URI> [ ... ]] --output|-o <OUTPUT LOCATION>
Options:
--image-output|-i <IMAGE OUTPUT LOCATION>
If no image output location is specified then they will be written to the same folder as the markdown file.
--include-tags|--it|-t <TAG|XPATH,[TAG|XPATH[,...]]>
If unspecified the entire body tag will be processed, otherwise only text contained in the specified tags will be processed.
--exclude-tags|--et|-e <TAG|XPATH,[TAG|XPATH[,...]]>
Allows for specific tags to be ignored.
--image-path-prefix|--ipp <IMAGE PATH PREFIX>
The prefix to apply to all rendered image URLs - helpful when you're going to be serving images from a
different location, relative or absolute.
--default-code-language <LANGUAGE>
The default language to use on code blocks converted from pre tags - defaults to csharp
--code-language-class-map <CLASSNAME:LANGUAGE,CLASSNAME:LANGUAGE,...>
Map between a pre tag's class names and languages. E.g. you might map the class name "sh_csharp" to "csharp"
and "sh_powershell" to "powershell".
--front-matter-data <PROPERTY:[XPATH|{{MACRO}}|{{'CONSTANT'}}]>
Allows for configuration of information to be extracted to a Front Matter property. This can be an XPath to an element
or attribute in the HTML page, a string constant or a supported macro.
Supported macros:
RelativeUriPath: The relative path of the page being converted. e.g. for https://example.com/pages/page-1 the macro would
return /pages/page-1
--front-matter-data-list <PROPERTY:XPATH[:Date]>
Allows for configuration of list-based information to be extracted to a Front Matter property. You can optionally specify
that the data should be formatted as a date. (Values not convertable to dates will be rendered as-is.)
--logging <None|Trace|Debug|Information|Warning|Error|Critical>
By default no logging takes place - you can turn on logging at different levels with this flag.
Install-Package Html2md.Core
var converter = new MarkdownConverter(new ConversionOptions());
ConversionResult converted = await converter.ConvertAsync("https://goatly.net/some-article");
// Alternatively you can convert multiple pages at once:
ConversionResult converted = await converter.ConvertAsync(
new[]
{
"https://goatly.net/some-article",
"https://goatly.net/some-other-article",
});
You can also extract Front Matter metadata:
var options = new ConversionOptions
{
FrontMatter =
{
Enabled = true,
SingleValueProperties =
{
{ "Title", "//h1" },
{ "Author", "{{'Mike Goatly'}}" },
{ "RedirectFrom", @"{{RelativeUriPath}}" }
},
ArrayValueProperties =
{
{ "Tags", @"//p[@class='tags']/a" }
}
}
}
var converter = new MarkdownConverter(options);
ConversionResult converted = await converter.ConvertAsync("https://goatly.net/some-article");
Where the resulting markdown would be:
---
Title: Article Title
Author: Mike Goatly
RedirectFrom: /some-article
Tags:
- Help
- Coding
---
ConvertedDocument
is the result of a conversion process, containing:
Documents
: The markdown representations of all the converted pages.Images
: A collection of images referenced in the documents. Each image includes the downloaded raw data as a byte array.
In ConversionOptions
you can specify:
ImagePathPrefix
: The prefix to apply to all rendered image URLs - helpful when you're going to be serving images from a different location, relative or absolute.DefaultCodeLanguage
: The default code language to apply to code blocks mapped frompre
tags. The default iscsharp
.IncludeTags
: The set of tags or XPaths for tags to include in the conversion process. If this is empty then all elements will processed.ExcludeTags
: The set of tags or XPaths for tags to exclude from the conversion process. You can use this if there are certain parts of a document you don't want translating to markdown, e.g. aside, nav, etc.CodeLanguageClassMap
: A dictionary mapping between class names that can appear onpre
tags and the language they map to.E.g. you might map the class name "sh_csharp" to "csharp" and "sh_powershell" to "powershell".FrontMatter
: Configuration for how Front Matter metadata should be emitted into a converted document.Enabled
: Whether Front Matter metadata should be emitted. Defaults tofalse
.SingleValueProperties
: Configuration of information to be extracted to a Front Matter property. This can be an XPath to an element or attribute in the HTML page, a string constant or a supported macro. Supported macros:- RelativeUriPath: The relative path of the page being converted. e.g. for https://example.com/pages/page-1 the macro would return /pages/page-1
ArrayValueProperties
: Configuration of list-based information to be extracted to a Front Matter property.
<em>italic</em>
becomes *italic*
<strong>bold</strong>
becomes **bold**
Linked images from the same domain (relative or absolute) are downloaded and returned in the
Images
collection of the ConvertedDocument
. Images from a different domain are not downloaded and the
urls are left untouched.
With ConversionOptions.ImagePathPrefix
of ""
:
<img src="/images/img.png" alt="My image">
becomes ![My image](img.png)
With ConversionOptions.ImagePathPrefix
of "/static/images/"
:
<img src="/images/img.png" alt="My image">
becomes ![My image](/static/images/img.png)
<a href="https://goatly.net">Some blog</a>
becomes [Some blog](https://goatly.net)
If the link is to an image, then the image is downloaded and the link's URL is updated as with images.
Paragraph tags cause an additional new line to be inserted after the paragraph's text.
<p>para 1</p><p>para 2</p>
becomes:
para 1
para 2
<blockquote>quoted text</blockquote>
becomes:
> quoted text
Nested styling is also supported, though you'll currently get additional lines if you use multiple paragraphs. This doesn't seem to bother any renderers I've seen so far:
<blockquote>
<p>Para 1</p>
<p>Para 2</p>
</blockquote>
becomes:
> Para 1
>
> Para 2
>
>
Header tags get converted to the markdown equivalent:
<h2>Header 2</h2><h3>Header 3</h3>
becomes:
## Header 2
### Header 3
Tables are converted, though markdown tables are much more limited than HTML tables.
Where a header row is present in the source it is used as the markdown table header:
<table>
<thead>
<tr>
<th>Header 1</th>
<th>Header 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>1-1</td>
<td>1-2</td>
</tr>
<tr>
<td>2-1</td>
<td>2-2</td>
</tr>
</tbody>
</table>
Becomes:
| Header 1 | Header 2 |
|-|-|
| 1-1 | 1-2 |
| 2-1 | 2-2 |
If no header row is found, the first row of the table is assumed to be the header:
<table>
<tbody>
<tr>
<td>1-1</td>
<td>1-2</td>
</tr>
<tr>
<td>2-1</td>
<td>2-2</td>
</tr>
</tbody>
</table>
Becomes:
| 1-1 | 1-2 |
|-|-|
| 2-1 | 2-2 |
<pre>content</pre>
becomes:
```
content
```
However, if the pre
tag has a code
class name it will have the DefaultCodeLanguage
in the ConversionOptions
applied to it:
<pre class="code">content</pre>
with a DefaultCodeLanguage
of csharp
becomes:
``` csharp
content
```
Additionally, if you have configured the CodeLanguageClassMap
mapping lang_ps
to powershell
:
<pre class="lang_ps">content</pre>
becomes:
``` powershell
content
```
As would <pre class="code lang_ps">content</pre>
, as the class name lookup will be inspected before falling back to the default code language.
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
becomes:
- Item 1
- Item 2
- Item 3
<ol>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ol>
becomes:
1. Item 1
1. Item 2
1. Item 3
Markdown renders should automatically apply the correct numbering to lists like this.