mirror of
https://github.com/BookStackApp/BookStack.git
synced 2026-02-13 11:19:37 +03:00
Chinese search cannot find words in the middle of a sentence. #624
Open
opened 2026-02-04 21:28:46 +03:00 by OVERLORD
·
23 comments
No Branch/Tag Specified
development
l10n_development
release
v25-12
llm_only
vectors
v25-11
docker_env
drawio_rendering
user_permissions
ldap_host_failover
svg_image
prosemirror
captcha_example
fix/video-export
v25.12.3
v25.12.2
v25.12.1
v25.12
v25.11.6
v25.11.5
v25.11.4
v24.11.4
v25.11.3
v25.11.2
v25.11.1
v25.11
v25.07.3
v25.07.2
v25.07.1
v25.07
v25.05.2
v25.05.1
v25.05
v25.02.5
v25.02.4
v25.02.3
v25.02.2
v25.02.1
v25.02
v24.12.1
v24.12
v24.10.3
v24.10.2
v24.10.1
v24.10
v24.05.4
v24.05.3
v24.05.2
v24.05.1
v24.05
v24.02.3
v24.02.2
v24.02.1
v24.02
v23.12.3
v23.12.2
v23.12.1
v23.12
v23.10.4
v23.10.3
v23.10.2
v23.10.1
v23.10
v23.08.3
v23.08.2
v23.08.1
v23.08
v23.06.2
v23.06.1
v23.06
v23.05.2
v23.05.1
v23.05
v23.02.3
v23.02.2
v23.02.1
v23.02
v23.01.1
v23.01
v22.11.1
v22.11
v22.10.2
v22.10.1
v22.10
v22.09.1
v22.09
v22.07.3
v22.07.2
v22.07.1
v22.07
v22.06.2
v22.06.1
v22.06
v22.04.2
v22.04.1
v22.04
v22.03.1
v22.03
v22.02.3
v22.02.2
v22.02.1
v22.02
v21.12.5
v21.12.4
v21.12.3
v21.12.2
v21.12.1
v21.12
v21.11.3
v21.11.2
v21.11.1
v21.11
v21.10.3
v21.10.2
v21.10.1
v21.10
v21.08.6
v21.08.5
v21.08.4
v21.08.3
v21.08.2
v21.08.1
v21.08
v21.05.4
v21.05.3
v21.05.2
v21.05.1
v21.05
v21.04.6
v21.04.5
v21.04.4
v21.04.3
v21.04.2
v21.04.1
v21.04
v0.31.8
v0.31.7
v0.31.6
v0.31.5
v0.31.4
v0.31.3
v0.31.2
v0.31.1
v0.31.0
v0.30.7
v0.30.6
v0.30.5
v0.30.4
v0.30.3
v0.30.2
v0.30.1
v0.30.0
v0.29.3
v0.29.2
v0.29.1
v0.29.0
v0.28.3
v0.28.2
v0.28.1
v0.28.0
v0.27.5
v0.27.4
v0.27.3
v0.27.2
v0.27.1
v0.27
v0.26.4
v0.26.3
v0.26.2
v0.26.1
v0.26.0
v0.25.5
v0.25.4
v0.25.3
v0.25.2
v0.25.1
v0.25.0
v0.24.3
v0.24.2
v0.24.1
v0.24.0
v0.23.2
v0.23.1
v0.23.0
v0.22.0
v0.21.0
v0.20.3
v0.20.2
v0.20.1
v0.20.0
v0.19.0
v0.18.5
v0.18.4
v0.18.3
v0.18.2
v0.18.1
v0.18.0
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.0
v0.16.3
v0.16.2
v0.16.1
v0.16.0
v0.15.3
v0.15.2
v0.15.1
v0.15.0
v0.14.3
v0.14.2
v0.14.1
v0.14.0
v0.13.1
v0.13.0
v0.12.2
v0.12.1
v0.12.0
v0.11.2
v0.11.1
v0.11.0
v0.10.0
v0.9.3
v0.9.2
v0.9.1
v0.9.0
v0.8.2
v0.8.1
v0.8.0
v0.7.6
v0.7.5
v0.7.4
v0.7.3
0.7.2
v.0.7.1
v0.7.0
v0.6.3
v0.6.2
v0.6.1
v0.6.0
v0.5.0
Labels
Clear labels
🎨 Design
📖 Docs Update
🐛 Bug
🐛 Bug
:cat2:🐈 Possible duplicate
💿 Database
☕ Open to discussion
💻 Front-End
🐕 Support
🚪 Authentication
🌍 Translations
🔌 API Task
🏭 Back-End
⛲ Upstream
🔨 Feature Request
🛠️ Enhancement
🛠️ Enhancement
🛠️ Enhancement
❤️ Happy feedback
🔒 Security
🔍 Pending Validation
💆 UX
📝 WYSIWYG Editor
🌔 Out of scope
🔩 API Request
:octocat: Admin/Meta
🖌️ View Customization
❓ Question
🚀 Priority
🛡️ Blocked
🚚 Export System
♿ A11y
🔧 Maintenance
> Markdown Editor
pull-request
Mirrored from GitHub Pull Request
No Label
Milestone
No items
No Milestone
Projects
Clear projects
No project
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: starred/BookStack#624
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jasoncheng7115 on GitHub (Mar 31, 2018).
For Bug Reports
When the word I'm looking for is the first word, or there's a space in front of it, it's ok.

But if the word is in the middle of a sentence, it cannot be found.

Whether this is a full-text retrieval of related issues?
Thanks!
@alexwyl commented on GitHub (Mar 4, 2019):
The same problem in version: v0.25.1, I have just tried BookStack...
@lotustalk commented on GitHub (Mar 13, 2019):
The same problem in version: v0.24.3, I use a docker
@derky1202 commented on GitHub (Sep 2, 2019):
still the same problem in v26.4. hope it could get solved. thanks
@sosize commented on GitHub (Sep 25, 2019):
you can use "成功" for search, maybe the word segmentation has the bug, hope fix it
@LeonLiuY commented on GitHub (Nov 7, 2019):
Confirmed this issue still in v0.27.5
One of my team member is hesitating because of this. Would like to see it fixed.
@hlj commented on GitHub (Dec 11, 2019):
Hope fix this issue soon.
@ssddanbrown commented on GitHub (Dec 12, 2019):
Sorry about this issue. It essentially stems from my unfamiliarity with non-English text.
At the moment BookStack splits up page content, on certain characters such as spaces and some punctuation, into terms which are put in the database for indexing then a "Starts With" match of those are checked against on a normal search.
As @sosize has mentioned, you can wrap a search in quotes, at which point BookStack will perform a "contains" against the content directly instead of the above "Starts With". This is not the default simply due to performance. ("Starts With" searches can use indexes much more effectively than "Contains").
I'm not really sure how we could utilise the "Starts With" system for such characters. Perhaps the search should default to a "Contains" search if such characters are found in a term?
@sosize commented on GitHub (Dec 28, 2019):
@ssddanbrown Can this be set as config control ?Select “Starts With” or “Contains” for search type.
More is hope full-text search.
Or how to quickly modify the code?
@lishuai199502 commented on GitHub (Apr 2, 2020):
can i replace all the "startWith" with "contains",or how to modify the source code ,sorry ,i'm a noob
@lishuai199502 commented on GitHub (Apr 2, 2020):
Hi,all the guys,I fixed this problem in v0.28.3.Just add a '%' in SearchService.php.
In detail.
in \app\Entities\SearchService.php,about line 196.
modify
$query->orWhere('term', 'like', $inputTerm . '%');
to
$query->orWhere('term', 'like', '%'.$inputTerm . '%');
Just try.
@0x9394 commented on GitHub (Aug 18, 2020):
@ssddanbrown hi, can above fix be merge to the source?
after modify SearchService.php now I can search both chinese and english in text body.
@chimin-roh commented on GitHub (Aug 19, 2022):
(i'm korean and same problems occur)
I know this issue closed, but i'll post some info in the hopes it will help others in the future.
My bookstack version: v22.07.03
in\app\Entities\Tools\SearchRunner.php about 222 line and 281 line
※ can find middle term
$query->orWhere('term', 'like', $inputTerm . '%');
to
$query->orWhere('term', 'like', '%'.$inputTerm . '%');
※ can sort correctly
$termQuery->orWhere('term', 'like', $term . '%');
to
$termQuery->orWhere('term', 'like', '%'.$term . '%');
@derky1202 commented on GitHub (Sep 17, 2022):
nice job. thanks
@charlietag commented on GitHub (Jul 23, 2023):
I've made a PR for to make it configurable in .env
Hope I'm making it in the right way
#4393
@ssddanbrown commented on GitHub (Jul 23, 2023):
For me to properly look at addressing this, it would be useful if people could help me a little in understanding how the languages in question work. Apologies for my naivety on the subject.
orange catin English, would the equivalent Chinese search query contain a space?@charlietag commented on GitHub (Jul 23, 2023):
Hi @ssddanbrown, thanks for helping to solve non-English languages.
I hope the following will help you to understand what I try to solve
Assume senario like this
Pages
In chinese, it would be
Database table (search_terms)
And in
normal seaerchmode, the query is designed to bestarts with, because each value in table columntermonly stores one vocabulary.So it's ok in English.
In chinese, it would be stored in
search_termslike this.And as you can see, column
termstores multiple words in one valueEnglish vs Chinese
What we actually prefer
But I'm not sure this is a good design for indexing level.
Re-design
I'm not good at indexing area.
I have a question that why not just search from pages table using like '%term%'.
And let database deal with index thing?
@charlietag commented on GitHub (Jul 23, 2023):
Normal search
So if we search
orange cat, in Chinese, it would be橘子 貓.And since
Table "search_terms"contains nothing like橘子 貓, I will get nothing.And if I search for the following, it will failed:
English (failed) - my users like to copy paste to search things...
rangeatChinese (failed)
橘貓What I hope it would be
I hope I can search things like above (failed part)
Exact search
I can use
exact searchto achieve purpose above.English (success)
"range""at"Chinese (success)
"橘""貓"But general users will not remeber to add quotes(") when search things
@ssddanbrown commented on GitHub (Jul 23, 2023):
Thanks for the info @charlietag.
The database won't use indexes for queries like that. The search index is specifically built so prefix-based matching can be performed while making use of database indexes. Additionally contains matching in the context of how this are currently built would significantly increase the accidental matches of partial included terms, and therefore impact the scoring.
Databases do often have fulltext indexes for "contains" search (Which BookStack used to use) but those have their own complications and there's a reason we moved away from things.
My intention has been to alter how we split the terms for indexing and search, for different character ranges, much like you've suggested, but I just want to better understand how searches and words translate in different languages, hence my last comment.
I would still like to invite others, particularly those using other Asian languages, to answer my previous comment.
@10935336 commented on GitHub (Jul 24, 2023):
I'm not a language expert.
So this answer may not be entirely accurate.
But there are also some cases where a single character maps to a single Latin word.
A search for a Chinese character usually does not return useful results.
But sometimes people still search for a single Chinese character like "
cat“ ”猫"Here are some searches recorded by google analytics on my website:
The words are not separated by spaces in Chinese, Japanese and Korea.
Unlike most languages, Chinese does not use spaces to separate characters into words.
When searching in Chinese, you would not use spaces to separate terms in a query. Instead, you would enter the characters for each term next to each other without spaces.
So usually search engines use a tokenizer to break a sentence into words:
In the example of
orange cat, it can be an橘猫or橘色猫or橘色的猫(orange color's cat).So there seems to be no easy way to segment words.
To be honest, it is very difficult to search Sino-Tibetan languages well. So many applications I have seen choose to use elasticsearch as their Search Engine.
Even in elasticsearch, many people are not satisfied with the official tokenizer and many other tokenizers have been created:
Update:
This may be the solution you want. Jieba is a popular (32.7K star) Chinese word segmentation component,
and this is its PHP ported version:
But it seems that jieba consumes a bit of memory, this module is more lightweight
@matteotw commented on GitHub (Apr 23, 2024):
I also couldn't search Chinese words successfully. (English keywords are OK.) I have no experience about it, just guess it could be optimised through something like Asian language parser.
https://docs-develop.pleroma.social/backend/configuration/howto_search_cjk/
https://pgroonga.github.io/
@kernelry commented on GitHub (Jul 1, 2024):
Version:v24.02.2
I think I solved the problem, Modify the code on line 213 of
/var/www/BookStack/app/Search/SearchRunner.php:Before modification:
only one result...

After modification:
have seven result!

@charlietag commented on GitHub (Jul 27, 2024):
Hi @kernelry
Actually, that's what I've proposed to author.
But he has his own consideration.
For now we can only workaround.
Let's hope it will be fixed in the future version.
https://github.com/BookStackApp/BookStack/pull/4393
@johnroyer commented on GitHub (Jan 31, 2025):
I take a look in search functionality. Parse text to tokens is good on English-liked languages. But it is not a good idea on CJK-liked (Chinese, Japanese, Korean) language, because they do not use spaces to separate words and phrases.
MySQL support full-text indexing (
MATCH ... AGAINST), but it do not do really well on searching.I would like to introduce Meilisearch as full-text indexing engine. Meilisearch use N-grams to generate tokens, better then using spaces. By using N-grams, Meilisearch support CJK languages.
I create an demo project with BookStack and Meilisearch: https://github.com/johnroyer/BookStack-Meilisearch . It use Meiliseach to show search suggestions.
@ssddanbrown : Search engine is a hard work. I hope you can put your time on implement functionalities, rather then search engine. Thanks a lot.