Skip to content

Conversation

untitaker
Copy link
Owner

No description provided.

@untitaker
Copy link
Owner Author

seems that lychee handles those cases by excluding the script tag from scraping, but that's actually a bug in html5gum. the contents of script tags shouldn't be interpreted as HTML at all.

https://github.com/lycheeverse/lychee/blob/eff77d634283610dbdd481db2d9b1d1162cdea84/lychee-lib/src/extract/html/mod.rs#L44

cc @mre

@untitaker
Copy link
Owner Author

untitaker/html5gum#98

@mre
Copy link

mre commented Oct 29, 2024

Thanks for the info. Could you remind me once this is merged or maybe even update the code in lychee? 🙃

@untitaker
Copy link
Owner Author

sure

@untitaker
Copy link
Owner Author

html5gum 0.6.1 is out, and can be bumped in lychee. I guess dependabot will do it, there shouldn't be any code changes required. You may want to check out CallbackEmitter if you want to simplify some code, though it has a bit of overhead (still way faster than most other things)

For some reason it's not possible to hit this bug in lychee. I don't know why, but the code I linked earlier is not related. I guess it doesn't matter.

@untitaker untitaker marked this pull request as ready for review October 29, 2024 21:33
@untitaker untitaker merged commit 055a110 into main Oct 29, 2024
11 checks passed
@untitaker untitaker deleted the json-script branch October 29, 2024 21:33
@mre
Copy link

mre commented Oct 29, 2024

Cool. Let's wait until the update comes through, then. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants