Feature request: Pandoc integration #2043

Closed
opened 2026-02-05 02:42:02 +03:00 by OVERLORD · 10 comments
Owner

Originally created by @maggie44 on GitHub (Jan 16, 2021).

Hi @ssddanbrown,

I was thinking Pandoc integration as an optional module. It would add some efficiencies to the various exports by keeping the assets seperate as discussed above (and potentially resolve some other outstanding issues), but also provide a bunch of additional options, such as EPUB (#1949), Word doc, video export support (#883; #2412) and a bunch more.

Here are a few shortcuts to try it out:

  1. Here is Pandoc: https://pandoc.org
  2. In most repositories so apt-get install pandoc or brew install pandoc should do the trick (if installing in a docker container, may need to install build-essential and/or curl).
  3. An example Markdown I have tested with:

test.md

# Test file
Test MD File.

[![Build Status](https://cdn.vox-cdn.com/thumbor/zEZJzZFEXm23z-Iw9ESls2jYFYA=/89x0:1511x800/1600x900/cdn.vox-cdn.com/uploads/chorus_image/image/55717463/google_ai_photography_street_view_2.0.jpg)](https://travis-ci.org/joemccann/dillinger)
Dillinger is a cloud-enabled, mobile-ready, offline-storage, AngularJS powered HTML5 Markdown editor.

  - Type some Markdown
  - Convert some Markdown

![](https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4)

# New Features!

  - sdfsdf
  - sdfsdvldkvnc
 
You can also:
  - send

Execute the command:

pandoc test.md -o example2.html --extract-media ./assets

More info relating to this originally discussed in: https://github.com/BookStackApp/BookStack/issues/2412

Originally created by @maggie44 on GitHub (Jan 16, 2021). Hi @ssddanbrown, I was thinking Pandoc integration as an optional module. It would add some efficiencies to the various exports by keeping the assets seperate as discussed above (and potentially resolve some other outstanding issues), but also provide a bunch of additional options, such as EPUB (#1949), Word doc, video export support (#883; #2412) and a bunch more. Here are a few shortcuts to try it out: 1. Here is Pandoc: https://pandoc.org 2. In most repositories so` apt-get install pandoc` or `brew install pandoc `should do the trick (if installing in a docker container, may need to install build-essential and/or curl). 3. An example Markdown I have tested with: test.md ``` # Test file Test MD File. [![Build Status](https://cdn.vox-cdn.com/thumbor/zEZJzZFEXm23z-Iw9ESls2jYFYA=/89x0:1511x800/1600x900/cdn.vox-cdn.com/uploads/chorus_image/image/55717463/google_ai_photography_street_view_2.0.jpg)](https://travis-ci.org/joemccann/dillinger) Dillinger is a cloud-enabled, mobile-ready, offline-storage, AngularJS powered HTML5 Markdown editor. - Type some Markdown - Convert some Markdown ![](https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4) # New Features! - sdfsdf - sdfsdvldkvnc You can also: - send ``` Execute the command: `pandoc test.md -o example2.html --extract-media ./assets` More info relating to this originally discussed in: https://github.com/BookStackApp/BookStack/issues/2412
OVERLORD added the 🔨 Feature Request label 2026-02-05 02:42:02 +03:00
Author
Owner

@maggie44 commented on GitHub (Jan 16, 2021):

@ssddanbrown in response to the last comment over in https://github.com/BookStackApp/BookStack/issues/2412, indeed, these ostensibly simple things often get more complex very quickly.

In terms of workflow, after giving it some thought perhaps a similar integration as WKHTMLTOPDF. The user installs Pandoc manually, using the Pandoc docs for their environment (apt-get Pandoc for example in Ubuntu). Then adds in a PANDOC=True variable to the .env file so that BookStack doesn't have any responsibility for the Pandoc install.

When PANDOC=True there could be some new fields in the export dropdown menu: EPUB; HTML Archive (or something more logically named instead of HTML Archive.

Hopefully then passing the same content being pulled for the current export features to Pandoc on the system locally, followed by a return of the output to download.

By using the same method as WKHTMLTOPDF, it doesn't make as mission critical to maintain and allows for some dev experimentation. Similarly, only using EPUB and HTML Archive rather than replacing the current PDF and html export processes, as certainly not confident enough in it to recommend that off the bat.

I realise a lot of this is preaching to the choir, but seems you have plenty of tickets and things on your plate, so figure the more thought/detail given to a feature request and the use case considered before making the request the better.

Big thanks for the work on this, it is going to become quite a central part of our EdTech COVID response work.

@maggie44 commented on GitHub (Jan 16, 2021): @ssddanbrown in response to the last comment over in https://github.com/BookStackApp/BookStack/issues/2412, indeed, these ostensibly simple things often get more complex very quickly. In terms of workflow, after giving it some thought perhaps a similar integration as WKHTMLTOPDF. The user installs Pandoc manually, using the Pandoc docs for their environment (apt-get Pandoc for example in Ubuntu). Then adds in a `PANDOC=True` variable to the .env file so that BookStack doesn't have any responsibility for the Pandoc install. When PANDOC=True there could be some new fields in the export dropdown menu: EPUB; HTML Archive (or something more logically named instead of HTML Archive. Hopefully then passing the same content being pulled for the current export features to Pandoc on the system locally, followed by a return of the output to download. By using the same method as WKHTMLTOPDF, it doesn't make as mission critical to maintain and allows for some dev experimentation. Similarly, only using EPUB and HTML Archive rather than replacing the current PDF and html export processes, as certainly not confident enough in it to recommend that off the bat. I realise a lot of this is preaching to the choir, but seems you have plenty of tickets and things on your plate, so figure the more thought/detail given to a feature request and the use case considered before making the request the better. Big thanks for the work on this, it is going to become quite a central part of our EdTech COVID response work.
Author
Owner

@maggie44 commented on GitHub (Jan 24, 2021):

After further thought, how about simplifying this down to allowing the original markdown that bookstack uses to be exported? When included in the api this would allow us to utilise third party processing of exported data (like pandoc) without the extra support burden.

@maggie44 commented on GitHub (Jan 24, 2021): After further thought, how about simplifying this down to allowing the original markdown that bookstack uses to be exported? When included in the api this would allow us to utilise third party processing of exported data (like pandoc) without the extra support burden.
Author
Owner

@ssddanbrown commented on GitHub (Jan 24, 2021):

Hi @maggie0002 ,
If you're using the Markdown editor to edit pages, The pages API should already provide the stored markdown content (pages.show endpoint).

@ssddanbrown commented on GitHub (Jan 24, 2021): Hi @maggie0002 , If you're using the Markdown editor to edit pages, The pages API should already provide the stored markdown content (pages.show endpoint).
Author
Owner

@maggie44 commented on GitHub (Jan 25, 2021):

Hi @maggie0002 ,
If you're using the Markdown editor to edit pages, The pages API should already provide the stored markdown content (pages.show endpoint).

Whoops, sorry, thought it defaulted to Markdown. I meant an API point to export the WYSIWYG content as is, rather than converting first to HTML or PDF. I don't see that in the API docs.

@maggie44 commented on GitHub (Jan 25, 2021): > Hi @maggie0002 , > If you're using the Markdown editor to edit pages, The pages API should already provide the stored markdown content (pages.show endpoint). Whoops, sorry, thought it defaulted to Markdown. I meant an API point to export the WYSIWYG content as is, rather than converting first to HTML or PDF. I don't see that in the API docs.
Author
Owner

@ssddanbrown commented on GitHub (Jan 26, 2021):

That (pages => read) endpoint should give you the HTML that's used when viewing a page. This is pretty much the same as the HTML loaded in the WYSIWYG editor but with a pass to remove some potentially dangerous elements.

@ssddanbrown commented on GitHub (Jan 26, 2021): That (pages => read) endpoint should give you the HTML that's used when viewing a page. This is pretty much the same as the HTML loaded in the WYSIWYG editor but with a pass to remove some potentially dangerous elements.
Author
Owner

@maggie44 commented on GitHub (Jan 26, 2021):

Helpful, and interesting, thanks. My understanding then is the difference is just that the export -> html function takes that same html seen in the pages -> read endpoint, passes it to a processor that converts pictures etc into an embedded html file. But without headers, which presumably is what the html processor takes care of (among other things).

Will experiment with that endpoint and report back anything useful.

@maggie44 commented on GitHub (Jan 26, 2021): Helpful, and interesting, thanks. My understanding then is the difference is just that the export -> html function takes that same html seen in the pages -> read endpoint, passes it to a processor that converts pictures etc into an embedded html file. But without headers, which presumably is what the html processor takes care of (among other things). Will experiment with that endpoint and report back anything useful.
Author
Owner

@maggie44 commented on GitHub (Jan 26, 2021):

Helpful, and interesting, thanks. My understanding then is the difference is just that the export -> html function takes that same html seen in the pages -> read endpoint, passes it to a processor that converts pictures etc into an embedded html file. But without headers, which presumably is what the html processor takes care of (among other things).

Will experiment with that endpoint and report back anything useful.

Didn't get very far. Turns out the HTML the API pipes out is missing headings, css, all the formatting, would be a lot of work to go from there to something usable.

Is there a way to access the HTML used by the exporter but with the original HREF to the images and/or video rather than the embedded images? It would be a fairly simple (in theory) mirror of that page to then get it with exported content. Wget for example has a --mirror option I could experiment with as a light-weight solution.

@maggie44 commented on GitHub (Jan 26, 2021): > Helpful, and interesting, thanks. My understanding then is the difference is just that the export -> html function takes that same html seen in the pages -> read endpoint, passes it to a processor that converts pictures etc into an embedded html file. But without headers, which presumably is what the html processor takes care of (among other things). > > Will experiment with that endpoint and report back anything useful. Didn't get very far. Turns out the HTML the API pipes out is missing headings, css, all the formatting, would be a lot of work to go from there to something usable. Is there a way to access the HTML used by the exporter but with the original HREF to the images and/or video rather than the embedded images? It would be a fairly simple (in theory) mirror of that page to then get it with exported content. Wget for example has a --mirror option I could experiment with as a light-weight solution.
Author
Owner

@ssddanbrown commented on GitHub (Jan 27, 2021):

Is there a way to access the HTML used by the exporter but with the original HREF to the images and/or video rather than the embedded images?

No way to get that directly, Although the main content HTML is what you'd get out of the API; The export just wraps it up in a template with some extra styles. The export uses this template, With these export styles.

@ssddanbrown commented on GitHub (Jan 27, 2021): > Is there a way to access the HTML used by the exporter but with the original HREF to the images and/or video rather than the embedded images? No way to get that directly, Although the main content HTML is what you'd get out of the API; The export just wraps it up in a template with some extra styles. The export uses [this template](https://github.com/BookStackApp/BookStack/blob/master/resources/views/pages/export.blade.php), With [these export styles](https://github.com/BookStackApp/BookStack/blob/release/public/dist/export-styles.css).
Author
Owner

@maggie44 commented on GitHub (May 27, 2021):

Having given it some more thought, how would you feel about PanDoc as an optional exporter similar to how wkhtmltopdf is currently integrated? This wrapper is proving useful: https://github.com/ueberdosis/pandoc

Would also help resolve some other issues that I don't think we will find a way around:

linuxserver/docker-bookstack#80
#2459

@maggie44 commented on GitHub (May 27, 2021): Having given it some more thought, how would you feel about PanDoc as an optional exporter similar to how wkhtmltopdf is currently integrated? This wrapper is proving useful: https://github.com/ueberdosis/pandoc Would also help resolve some other issues that I don't think we will find a way around: linuxserver/docker-bookstack#80 #2459
Author
Owner

@ssddanbrown commented on GitHub (May 31, 2021):

Hi @maggie0002,
Sorry for my lack of response.

To be honest, I'd not be very keen. Supporting both of the existing PDF export options has already proved a lot more challenging than hoped and consumed a lot of my time in the various requests & issues that have generated from it. The range of conversion formats that pandoc would open up would worry me, and I think that it's optimistic that it'll solve more issues than it'll create as an alternative PDF generator, especially since I believe pandoc will use WKHTMLtoPDF by default anyway for HTML to PDF conversions.

@ssddanbrown commented on GitHub (May 31, 2021): Hi @maggie0002, Sorry for my lack of response. To be honest, I'd not be very keen. Supporting both of the existing PDF export options has already proved a lot more challenging than hoped and consumed a lot of my time in the various requests & issues that have generated from it. The range of conversion formats that pandoc would open up would worry me, and I think that it's optimistic that it'll solve more issues than it'll create as an alternative PDF generator, especially since I believe pandoc will use WKHTMLtoPDF by default anyway for HTML to PDF conversions.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/BookStack#2043