Search has issues with words adjacent to puncutation characters #1709

Closed
opened 2026-02-05 01:40:45 +03:00 by OVERLORD · 9 comments
Owner

Originally created by @Knaui on GitHub (May 5, 2020).

for example: it wont find "house" in "big-house"

but it will find "big"

this is the case for book or page titles and for page content

tested with BookStack version v0.29.0

Originally created by @Knaui on GitHub (May 5, 2020). for example: it wont find "house" in "big-house" but it will find "big" this is the case for book or page titles and for page content tested with BookStack version v0.29.0
OVERLORD added the 🏭 Back-End label 2026-02-05 01:40:45 +03:00
Author
Owner

@kshitijsharma97 commented on GitHub (Aug 4, 2020):

I also tried same ting in my dev instance I had the same issue.
If I try the 1st word then the page will come but if I try the whole name with hyphen nothing came in search result.

But if you pull the changes and update the version to BookStack v0.29.3.
The issue with this hyphen separated search is resolved.

@kshitijsharma97 commented on GitHub (Aug 4, 2020): I also tried same ting in my dev instance I had the same issue. If I try the 1st word then the page will come but if I try the whole name with hyphen nothing came in search result. But if you pull the changes and update the version to **BookStack v0.29.3.** The issue with this hyphen separated search is resolved.
Author
Owner

@ssddanbrown commented on GitHub (Jul 12, 2021):

Updating the title to be more generic in the interest of merging down some issues.

Related to #1037

@ssddanbrown commented on GitHub (Jul 12, 2021): Updating the title to be more generic in the interest of merging down some issues. - #2843 - #2523 Related to #1037
Author
Owner

@Wookbert commented on GitHub (Jul 23, 2021):

@ssddanbrown

I’ve just realized that searching word parts which are combined through hyphens, doesn't work either.

Example: Searching for historian does not find the page on CCU-Historian, while searching for ccu does. Note that hyphens are a very common element in for instance German language. You often have word combinations which are connected through 2 or even 3 hyphens.

An english language example would be Remote-robot-assisted, which IMO should be retrieved when searching for any of the three words individually, but also e.g. robot-assisted, robot assisted or robotassisted. (Same applies for any spelling of the Remote robot combination).

@Wookbert commented on GitHub (Jul 23, 2021): @ssddanbrown I’ve just realized that searching word parts which are combined through hyphens, doesn't work either. Example: Searching for `historian` does not find the page on `CCU-Historian`, while searching for `ccu` does. Note that hyphens are a very common element in for instance German language. You often have word combinations which are connected through 2 or even 3 hyphens. An english language example would be `Remote-robot-assisted`, which IMO should be retrieved when searching for any of the three words individually, but also e.g. `robot-assisted`, `robot assisted` or `robotassisted`. (Same applies for any spelling of the `Remote robot` combination).
Author
Owner

@dweinerATL commented on GitHub (Aug 5, 2021):

@ssddanbrown we are running into something similar. Running BookStack v21.05.4 for a science fiction authors book series. One of her races are called Ke!endarian. If you search for Ke!endarian, no results. If you search for Kel, you get the expected response. We have found that the search will work if you search for "Ke!endarian" however.

@dweinerATL commented on GitHub (Aug 5, 2021): @ssddanbrown we are running into something similar. Running BookStack v21.05.4 for a science fiction authors book series. One of her races are called `Ke!endarian`. If you search for `Ke!endarian`, no results. If you search for `Kel`, you get the expected response. We have found that the search will work if you search for `"Ke!endarian"` however.
Author
Owner

@ssddanbrown commented on GitHub (Nov 13, 2021):

As part of #3043 I've made a change to auto-convert any search terms, that would experience this issue, into exact match terms instead which will run a direct, although less efficient, content match. Doesn't directly solve this but should provide a much better user-experience in such situations. Will be part of the next feature release.

@ssddanbrown commented on GitHub (Nov 13, 2021): As part of #3043 I've made a change to auto-convert any search terms, that would experience this issue, into exact match terms instead which will run a direct, although less efficient, content match. Doesn't directly solve this but should provide a much better user-experience in such situations. Will be part of the next feature release.
Author
Owner

@caius-martinus commented on GitHub (Oct 18, 2023):

Hello @ssddanbrown,
I think issue isn't solved at least in 23.08.2, here is how to reproduce: create a page with the content /abc123 on a single line. Now search abc1 and you should observe it doesn't match. However /abc1 would.

@caius-martinus commented on GitHub (Oct 18, 2023): Hello @ssddanbrown, I think issue isn't solved at least in 23.08.2, here is how to reproduce: create a page with the content `/abc123` on a single line. Now search `abc1` and you should observe it doesn't match. However `/abc1` would.
Author
Owner

@sNiXx commented on GitHub (Nov 21, 2023):

I can confirm this issue is still present on 23.10.2. I also just verified on the demo instance (currently 23.10.4) and hyphenated words are not correctly found. For instance, the pages prod-linode-sparkjet or dev-internal-sparklebike on the demo instance cannot be found if the last term (i.e. sparkjet or sparklebike) is used to search.

@sNiXx commented on GitHub (Nov 21, 2023): I can confirm this issue is still present on 23.10.2. I also just verified on the demo instance (currently 23.10.4) and hyphenated words are not correctly found. For instance, the pages _prod-linode-sparkjet_ or _dev-internal-sparklebike_ on the demo instance cannot be found if the last term (i.e. _sparkjet_ or _sparklebike_) is used to search.
Author
Owner

@watschi commented on GitHub (Jan 8, 2025):

Facing the same issue with hyphenated words, which are pretty common in german text.
Quick and dirty solution (needs to be applied after any update):

  • Edit app/Search/SearchIndex.php, add a hyphen (-) to $delimiters (at Link)
  • Run php artisan bookstack:regenerate-search
  • For the word Test-Word, Test, Word and Test-Word will return the desired content

@ssddanbrown Any reason to exclude - from the delimiters? Feels like this should be included by default, maybe it's an oversight, maybe I'm missing something 🙂

@watschi commented on GitHub (Jan 8, 2025): Facing the same issue with hyphenated words, which are pretty common in german text. Quick and dirty solution (needs to be applied after any update): - Edit `app/Search/SearchIndex.php`, add a hyphen (`-`) to `$delimiters` (at [Link]( https://github.com/BookStackApp/BookStack/blob/33b46882f39106697a704dadd25f221f0dc2ff53/app/Search/SearchIndex.php#L19C1-L19C66 )) - Run `php artisan bookstack:regenerate-search` - For the word `Test-Word`, `Test`, `Word` and `Test-Word` will return the desired content @ssddanbrown Any reason to exclude `-` from the delimiters? Feels like this should be included by default, maybe it's an oversight, maybe I'm missing something 🙂
Author
Owner

@ssddanbrown commented on GitHub (Feb 14, 2025):

@watschi

Any reason to exclude - from the delimiters? Feels like this should be included by default, maybe it's an oversight, maybe I'm missing something 🙂

Really it was because they felt more part of a term rather than something to split them by, but I can see the issue that would result.

I spent some time on this today to change up the indexing a bit via #5488.
I've tried to come to a compromise to help address some of the most problematic areas, in addition to adding - as a delimiter.
Now, for the text cat-dog BookStack will now index that as cat, dog and cat-dog.
That way, searching for either work will work but the full term will also work via our proper indexed term system.
The same is done for dots/periods (which I thought could be important for numbering among other things).

There will still be gaps and limitations in search due to the nature of the trying to keep content indexed, using prefix matching, and the use of custom tokenization, but this should solve some of the most common issues here reported about hyphenated words.
Therefore I'm going to close this off but new focus areas can be raised as needed (If not already open).

The mentioned changes will be part of the next feature release.
Note, that you'd need to regenerate the search index after updating to gain these index improvements.

Thanks all for your input!

@ssddanbrown commented on GitHub (Feb 14, 2025): @watschi > Any reason to exclude - from the delimiters? Feels like this should be included by default, maybe it's an oversight, maybe I'm missing something 🙂 Really it was because they felt more part of a term rather than something to split them by, but I can see the issue that would result. I spent some time on this today to change up the indexing a bit via #5488. I've tried to come to a compromise to help address some of the most problematic areas, in addition to adding `-` as a delimiter. Now, for the text `cat-dog` BookStack will now index that as `cat`, `dog` and `cat-dog`. That way, searching for either work will work but the full term will also work via our proper indexed term system. The same is done for dots/periods (which I thought could be important for numbering among other things). There will still be gaps and limitations in search due to the nature of the trying to keep content indexed, using prefix matching, and the use of custom tokenization, but this should solve some of the most common issues here reported about hyphenated words. Therefore I'm going to close this off but new focus areas can be raised as needed (If not already open). The mentioned changes will be part of the next feature release. Note, that you'd need to [regenerate the search index](https://www.bookstackapp.com/docs/admin/commands/#regenerate-the-search-index) after updating to gain these index improvements. Thanks all for your input!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/BookStack#1709