Invalid markup from page includes #3453

Closed
opened 2026-02-05 06:45:24 +03:00 by OVERLORD · 6 comments
Owner

Originally created by @gaufde on GitHub (Jan 14, 2023).

Describe the Bug

The page include system searches for tags of the form {{@\s?([0-9].*?)}}.

This string is then directly replaced with a new string (see the parsePageIncludes and fetchSectionOfPage methods in BookStack\Entities\Tools\PageContent.

If the new string contains HTML, then the resulting markup will be invalid.

For example, if the replacement string is "<h2 id="bkmrk-heading">this is a heading</h2>" markup such as <p id="bkmrk-example-id">{{@18#bkmrk-another-id}}</p> gets converted into <p id="bkmrk-example-id"><h2 id="bkmrk-heading">this is a heading</h2></p>, which is not valid.

This causes unexpected results such as blank paragraphs being created by the browser.

Steps to Reproduce

See the description of the bug.

If you want to see the side-effect, simply force Bookstack to include some HTML into a page include. This is done when including a table from one page into another page. The page with the embedded table will likely have an unexpected empty paragraph appear.

Expected Behaviour

If the replacement string does not include any HTML markup, then it should work exactly as it does right now.

However, if the replacement string does include some HTML, then Bookstack needs to do something to avoid nesting the HTML and avoid creating empty paragraphs. One solution would be to replace the entire block with the new content.

From the example a couple sections above, <p id="bkmrk-example-id">{{@18#bkmrk-another-id}}</p> would become <h2 id="bkmrk-heading">this is a heading</h2>.

This does mean that the ID bkmrk-example-id would not appear on the page. However, I don't think that causes any issues, and it may even be the preferred behavior.

Screenshots or Additional Context

No response

Browser Details

No response

Exact BookStack Version

BookStack v22.11.1

PHP Version

No response

Hosting Environment

N/A

Originally created by @gaufde on GitHub (Jan 14, 2023). ### Describe the Bug The page include system searches for tags of the form `{{@\s?([0-9].*?)}}`. This string is then directly replaced with a new string (see the `parsePageIncludes` and `fetchSectionOfPage` methods in `BookStack\Entities\Tools\PageContent`. If the new string contains HTML, then the resulting markup will be invalid. For example, if the replacement string is `"<h2 id="bkmrk-heading">this is a heading</h2>"` markup such as `<p id="bkmrk-example-id">{{@18#bkmrk-another-id}}</p>` gets converted into `<p id="bkmrk-example-id"><h2 id="bkmrk-heading">this is a heading</h2></p>`, which is not valid. This causes unexpected results such as blank paragraphs being created by the browser. ### Steps to Reproduce See the description of the bug. If you want to see the side-effect, simply force Bookstack to include some HTML into a page include. This is done when including a table from one page into another page. The page with the embedded table will likely have an unexpected empty paragraph appear. ### Expected Behaviour If the replacement string does not include any HTML markup, then it should work exactly as it does right now. However, if the replacement string does include some HTML, then Bookstack needs to do something to avoid nesting the HTML and avoid creating empty paragraphs. One solution would be to replace the entire block with the new content. From the example a couple sections above, `<p id="bkmrk-example-id">{{@18#bkmrk-another-id}}</p>` would become `<h2 id="bkmrk-heading">this is a heading</h2>`. This does mean that the ID `bkmrk-example-id` would not appear on the page. However, I don't think that causes any issues, and it may even be the preferred behavior. ### Screenshots or Additional Context _No response_ ### Browser Details _No response_ ### Exact BookStack Version BookStack v22.11.1 ### PHP Version _No response_ ### Hosting Environment N/A
OVERLORD added the 🐛 Bug:cat2:🐈 Possible duplicate labels 2026-02-05 06:45:24 +03:00
Author
Owner

@ssddanbrown commented on GitHub (Jan 14, 2023):

Thanks @gaufde for reporting. The fundamental issue has already been discovered/reported within issue #3385 (which was opened based upon using the whole page contents, but it's the same underlying issue at play) therefore I'm going to close this off as a duplicate.

@ssddanbrown commented on GitHub (Jan 14, 2023): Thanks @gaufde for reporting. The fundamental issue has already been discovered/reported within issue #3385 (which was opened based upon using the whole page contents, but it's the same underlying issue at play) therefore I'm going to close this off as a duplicate.
Author
Owner

@gaufde commented on GitHub (Jan 14, 2023):

@ssddanbrown Okay that makes sense, it does seem like this is a duplicate then.

Out of curiosity, is a fix for this issue planned for any upcoming releases? I just looked at that other thread and it seems ambiguous since you say it may be difficult to fix.

@gaufde commented on GitHub (Jan 14, 2023): @ssddanbrown Okay that makes sense, it does seem like this is a duplicate then. Out of curiosity, is a fix for this issue planned for any upcoming releases? I just looked at that other thread and it seems ambiguous since you say it may be difficult to fix.
Author
Owner

@ssddanbrown commented on GitHub (Jan 14, 2023):

Out of curiosity, is a fix for this issue planned for any upcoming releases?

@gaufde I do recognize it as a bug, so would like to address at some point, but a fix is not currently in progress and I'm not rushing to it too fast since:

  • It's for a lesser-used power feature.
  • A workaround can be used if problematic (As touched in my my message in the other thread)
  • This logic has existed since the start of the include features, so is not a break in previous functionality.
  • It's a tricky one to address, that may require opinionated logic that may go ahead expectations of the user.
@ssddanbrown commented on GitHub (Jan 14, 2023): > Out of curiosity, is a fix for this issue planned for any upcoming releases? @gaufde I do recognize it as a bug, so would like to address at some point, but a fix is not currently in progress and I'm not rushing to it too fast since: - It's for a lesser-used power feature. - A workaround can be used if problematic (As touched in my my message in the other thread) - This logic has existed since the start of the include features, so is not a break in previous functionality. - It's a tricky one to address, that may require opinionated logic that may go ahead expectations of the user.
Author
Owner

@gaufde commented on GitHub (Jan 14, 2023):

may require opinionated logic that may go ahead expectations of the user.

@ssddanbrown What are your concerns here?

I was thinking through a fix based on the code I wrote for #3854. Seems like the page included tags could be searched for using DOMXpath rather than regex. Then, once the content is retrieved from the other page, it could be used to replace the node(s) found by the XPath query.

To handle cases where the incoming string is not HTML, there might need to be a regex based function to search for starting and ending html tags. If the replacement string doesn’t start with an HTML tag (besides strong or em), then the string could be used to replace the content of the node rather than replacing the entire node itself. At lest that is what it should do in essence. I believe the function would need to make a duplicate node with the new content and replace the old node.

lastly, the function would need to take the new DOM tree and convert it back to an HTML string since that is what the current function returns.

I don’t have much experience with the DOM functions in PHP, but based on what I’ve seen so far this seems like a more robust way to modify HTML strings.

I understand that this isn’t a high priority bug, but I think it is important if you want the “Page Include Parse” logical theme event to be used to its full potential. I’m willing to help as I can since this is a bug that is currently somewhat important to fix for my workflow. Let me know what you think!

@gaufde commented on GitHub (Jan 14, 2023): > may require opinionated logic that may go ahead expectations of the user. @ssddanbrown What are your concerns here? I was thinking through a fix based on the code I wrote for #3854. Seems like the page included tags could be searched for using DOMXpath rather than regex. Then, once the content is retrieved from the other page, it could be used to replace the node(s) found by the XPath query. To handle cases where the incoming string is not HTML, there might need to be a regex based function to search for starting and ending html tags. If the replacement string doesn’t start with an HTML tag (besides strong or em), then the string could be used to replace the content of the node rather than replacing the entire node itself. At lest that is what it should do in essence. I believe the function would need to make a duplicate node with the new content and replace the old node. lastly, the function would need to take the new DOM tree and convert it back to an HTML string since that is what the current function returns. I don’t have much experience with the DOM functions in PHP, but based on what I’ve seen so far this seems like a more robust way to modify HTML strings. I understand that this isn’t a high priority bug, but I think it is important if you want the “Page Include Parse” logical theme event to be used to its full potential. I’m willing to help as I can since this is a bug that is currently somewhat important to fix for my workflow. Let me know what you think!
Author
Owner

@ssddanbrown commented on GitHub (Jan 14, 2023):

Yeah, revising to a DOMDocument approach was my expectation to handle cases properly.

The complications arise in regards to context of inserted nodes. Dealing with the content just based upon containing non-strong/em tags is simplified. The range of valid inline elements is much wider, and I'm sure there are cases where insertion of block level HTML can be valid, so the parent would also need to be inspected.

Additionally, the action to replace the old node is an assumption of intended behavior. An include tag could be mixed alongside content which complicates matters. While replace would makes sense if the context of it being the only content, it may otherwise be more practical to insert afterwards instead in some cases. Would maybe need matching to the browser's current auto-rejigging of content.

@ssddanbrown commented on GitHub (Jan 14, 2023): Yeah, revising to a DOMDocument approach was my expectation to handle cases properly. The complications arise in regards to context of inserted nodes. Dealing with the content just based upon containing non-strong/em tags is simplified. The range of valid inline elements is much wider, and I'm sure there are cases where insertion of block level HTML can be valid, so the parent would also need to be inspected. Additionally, the action to replace the old node is an assumption of intended behavior. An include tag could be mixed alongside content which complicates matters. While replace would makes sense if the context of it being the only content, it may otherwise be more practical to insert afterwards instead in some cases. Would maybe need matching to the browser's current auto-rejigging of content.
Author
Owner

@gaufde commented on GitHub (Jan 30, 2023):

Okay, this gives me a good sense of all the cases you are thinking about. You are right my description was somewhat simplified. I had not considered that people might nest content along side an include tag.

I tried to get some code started for this project. However, I ran into some complications I wasn't expecting. Turns out getting the proper DOM nodes can't be done with regex and DOMDocument/Xpath easily. You can of course use contains in an XPath expression, but then you would only be able to match for {{@ rather than the entire page include tag.

I'll chip away at it next chance I get to devote some time to this project, but it won't be as immediate as I had hoped.

@gaufde commented on GitHub (Jan 30, 2023): Okay, this gives me a good sense of all the cases you are thinking about. You are right my description was somewhat simplified. I had not considered that people might nest content along side an include tag. I tried to get some code started for this project. However, I ran into some complications I wasn't expecting. Turns out getting the proper DOM nodes can't be done with regex and DOMDocument/Xpath easily. You can of course use contains in an XPath expression, but then you would only be able to match for `{{@` rather than the entire page include tag. I'll chip away at it next chance I get to devote some time to this project, but it won't be as immediate as I had hoped.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/BookStack#3453