OCR for images pasted in pages #4980

Open
opened 2026-02-05 09:30:48 +03:00 by OVERLORD · 3 comments
Owner

Originally created by @c0shea on GitHub (Oct 1, 2024).

Describe the feature you'd like

It would be awesome if BookStack could perform some basic OCR on any images pasted into pages and store the text in the database for improving search results, similar to how the HTML is converted to just text and stored for searching.

Describe the benefits this would bring to existing BookStack users

This would improve search results, especially when a lot of our documentation contains screenshots whose text otherwise doesn't appear anywhere in the page HTML content itself.

Can the goal of this request already be achieved via other means?

Aside from being meticulous with organization and things like tagging, not really. You could manually go through the effort of doing OCR on the images but there's nowhere to put the text for the search engine to see it other than in the page content itself.

Have you searched for an existing open/closed issue?

  • I have searched for existing issues and none cover my fundamental request

How long have you been using BookStack?

5+ years

Additional context

This is similar to #3767 but different in that I'm not looking to index attachments. This would index content that is already part of the page itself and humans would have some expectation of being able to search for.

Originally created by @c0shea on GitHub (Oct 1, 2024). ### Describe the feature you'd like It would be awesome if BookStack could perform some basic OCR on any images pasted into pages and store the text in the database for improving search results, similar to how the HTML is converted to just text and stored for searching. ### Describe the benefits this would bring to existing BookStack users This would improve search results, especially when a lot of our documentation contains screenshots whose text otherwise doesn't appear anywhere in the page HTML content itself. ### Can the goal of this request already be achieved via other means? Aside from being meticulous with organization and things like tagging, not really. You could manually go through the effort of doing OCR on the images but there's nowhere to put the text for the search engine to see it other than in the page content itself. ### Have you searched for an existing open/closed issue? - [X] I have searched for existing issues and none cover my fundamental request ### How long have you been using BookStack? 5+ years ### Additional context This is similar to #3767 but different in that I'm not looking to index attachments. This would index content that is already part of the page itself and humans would have some expectation of being able to search for.
OVERLORD added the 🔨 Feature Request label 2026-02-05 09:30:48 +03:00
Author
Owner

@kazyka commented on GitHub (Oct 2, 2024):

I think paperless-ng would be what you are looking for, if you haven't tried it :)

@kazyka commented on GitHub (Oct 2, 2024): I think paperless-ng would be what you are looking for, if you haven't tried it :)
Author
Owner

@virtadpt commented on GitHub (Oct 2, 2024):

This would be kind of handy, because I sometimes upload photographs of pages of books as references to my wiki.

https://github.com/naptha/tesseract.js would be an ideal way of implementing it.

@virtadpt commented on GitHub (Oct 2, 2024): This would be kind of handy, because I sometimes upload photographs of pages of books as references to my wiki. https://github.com/naptha/tesseract.js would be an ideal way of implementing it.
Author
Owner

@sam-marteau commented on GitHub (Feb 2, 2026):

As an alternative, I suggest adding an API endpoint and the relevant database attributes for updating searchable image metadata. This would enable the use of external OCR for indexing.

@sam-marteau commented on GitHub (Feb 2, 2026): As an alternative, I suggest adding an API endpoint and the relevant database attributes for updating searchable image metadata. This would enable the use of external OCR for indexing.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/BookStack#4980