GitHub - akaalias/obsidian-extract-pdf: Extract PDFs to Markdown within Obsidian

Extract PDF text to Markdown

Allows you to extract the basic textual content of a PDF into a Markdown file. Works well with headings, paragraphs and lists.

Demo

How to use this plugin

After you've installed and activated the plugin:

Drag your PDF into Obsidian
Open the PDF within Obsidian
Make sure the pane with your PDF is focused
Click the "PDF to Markdown" button in the sidebar
Edit the generated markdown file to your needs

Tips & Tricks for editing the generated markdown file

I just went ahead and turned a 500 page PDF into markdown and found that it worked better and faster than I expected.

Bulk-removing page footers

The book I used had the same footer on every page. That means they got copied into the markdown file over and over, too.

For bulk search-and-replace I use the Atom editor (https://atom.io):

Copy the footer text into your clipboard
Download and install Atom
Open Atom and open the Markdown file inside
Use "Find -> Find in Buffer" and paste the footer text
Use the button "Replace" or "Replace All" to remove footer text

Remove a single space before a new line of text

Weirdly, sometimes, new lines of text had a space infront of them. Such as:

Some text

...which resulted in Obisidian treating it as a sub-block of the preceding line.

To remove the space for those lines, I used a regular expression search-and-replace:

In "Find in current buffer" activate "Regex Search" (The .* icon)
Enter ^([ ]|\t)+ into the search field
Use the button "Replace" or "Replace All" to remove the space

Known issues

First-time use

If you had a PDF open in Obisidian before you installed and activated the plugin, hitting the button may not work. I've had this issue with other plugins as well. The code just doesn't hook up to already-open files.

The solution is to simply close the PDF note and re-open it. That will allow the plugin to hook into it.

Limited PDF parsing

Please understand that this is a basic, best-effort tool to get basic text and headings from a PDF. It really just gets the text from a pdf and turns it into Markdown. The plugin doesn't handle anything more complex, like tables, images, annotations etc:

Does not turn PDF highlights and annotations into MD highlights
Does not retain PDF numbered lists
Does not skip text in headers and footers

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
patches		patches
src		src
.gitignore		.gitignore
README.md		README.md
data.json		data.json
demo.gif		demo.gif
main.js		main.js
manifest.json		manifest.json
package.json		package.json
rollup.config.js		rollup.config.js
styles.css		styles.css
tsconfig.json		tsconfig.json
tsconfig.testing.json		tsconfig.testing.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract PDF text to Markdown

Demo

How to use this plugin

Tips & Tricks for editing the generated markdown file

Bulk-removing page footers

Remove a single space before a new line of text

Known issues

First-time use

Limited PDF parsing

About

Releases 7

Contributors 2

Languages

akaalias/obsidian-extract-pdf

Folders and files

Latest commit

History

Repository files navigation

Extract PDF text to Markdown

Demo

How to use this plugin

Tips & Tricks for editing the generated markdown file

Bulk-removing page footers

Remove a single space before a new line of text

Known issues

First-time use

Limited PDF parsing

About

Topics

Resources

Stars

Watchers

Forks

Releases 7

Contributors 2

Languages