-
Notifications
You must be signed in to change notification settings - Fork 12
URLInTextProcessor: Test against the WHATWG URL test suite #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URLInTextProcessor: Test against the WHATWG URL test suite #179
Conversation
We may want to reject these as well, given that the invalid port looks well-formed. Why? Because we know that the original URL could not have pointed to a valid resource, so we don’t have to try and rewrite it, given that we know it will still point to an invalid resource.
Yes, and worth pointing out that the WHATWG parser and tests are attempting to validate entire strings which are potentially complete URLs. This is a major source of difference that I have found when trying to fulfill a purpose like you are doing, which is determining which substrings inside a text are valid URLs. I would love to have a more standardized approach to this, but it may not exist. |
I rephrased the example to:
I wonder what's your take on that |
components/DataLiberation/Tests/URLInTextProcessorWHATWGComplianceTest.php
Outdated
Show resolved
Hide resolved
components/DataLiberation/Tests/URLInTextProcessorWHATWGComplianceTest.php
Outdated
Show resolved
Hide resolved
components/DataLiberation/Tests/URLInTextProcessorWHATWGComplianceTest.php
Show resolved
Hide resolved
\/\/ # The protocol does not have to be there, but when | ||
# it is, is must be followed by \/\/ | ||
(?<scheme>[a-z0-9\+]+?:)? # | ||
(?:\/|\/\/)? # The protocol may optionally be followed by one or two slashes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which supported URLs do we have where there’s only a single slash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHATWG surprised me by how lenient it is on slashes:
> (new URL("https://example.com/")).toString()
'https://example.com/'
> (new URL("https:/example.com/")).toString()
'https://example.com/'
> (new URL("https:example.com/")).toString()
'https://example.com/'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤨
Safari’s URL bar works with one, but diverts to the search engine if no slash is there.
This is annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've adjusted it to match zero or more slashes after all to make it capture the entire file:///path
URL and reject it as a whole (as opposed to matching just the //path
subset and assuming it's a protocol-relative URL)
9125fdf
to
c2e05d1
Compare
I'll go ahead and merge this PR. Let's keep this conversation going separately. I'm happy to start rejecting those URLs if we'll conclude it's a better trade-off than not rejecting them. |
What this does
This PR vets our URL text matcher against the WHATWG URL test suite and tightens the parsing to make smarter, pragmatic matches in real-world prose. The goal isn’t to be a full, spec-compliant parser; it’s to extract migration-relevant URLs from text with high precision and minimal false positives.
Why not strictly WHATWG?
The WHATWG parser assumes a base URL, a controlled environment, and full parse/serialize semantics. Inline detection in user content has different constraints: no base URL, lots of punctuation and whitespace noise, and a strong bias toward HTTP(S) links used for site migrations. The matcher therefore makes deliberate trade-offs.
Deliberate deviations from WHATWG
Reject URLs with embedded credentials
https://user:[email protected]
is not matched. Rewriting URLs that presume transferable credentials is hazardous and rarely correct for migrations.Reject non-HTTP(S) schemes
Out of scope for site moves; all of these are rejected:
gopher://site.com
,blob:afgh2-48189d
,ahttp://site.com
,mailto:[email protected]
,file://asset.zip
.If we need additional schemes later, we can add them intentionally.
Reject non‑absolute‑looking references
While we do rely on a base URL, inputs like
::
,/index.html
,?query
are still ignored. Bare-domain forms likemysite.org/?query
are still matched.Handle trailing punctuation sensibly
https://mysite.com/path/..
is interpreted ashttps://mysite.com/path/
rather than collapsing to the origin. A final period is far more likely sentence punctuation than../
. If a user truly writeshttps://mysite.com/path/../
, we parse it as expected.Fuzzy matching for malformed ports
"http://w.org:100000 plugins are in the directory" → failure
."http://w.org:100000 plugins are in the directory" → "http://wp.org/"
(truncate at the invalid port).This is a best‑effort extraction of the valid prefix rather than an all‑or‑nothing rejection.
Whitespace handling
"http://example\t.\norg" → "http://example.org/"
."http://example/"
.This reflects how URLs actually appear in text blocks where whitespace often terminates a link.
Tests
The WHATWG URL test suite is copied and adapted for “URL-in-text” matching, with additional cases covering credentials, schemes, punctuation, malformed ports, and whitespace. CI exercises the full set.