URLInTextProcessor: Test against the WHATWG URL test suite #179

adamziel · 2025-09-10T13:52:53Z

@dmsnell raised a good question in WordPress/wordpress-importer#202:

Have we audited against the WHATWG spec to see if we accept things we should reject, or reject things we should accept? Off the top of my head I don’t remember how auth in URLs is handled in the spec, but I doubt there’s a strong case for rewriting URLs with embedded auth.

What this does

This PR vets our URL text matcher against the WHATWG URL test suite and tightens the parsing to make smarter, pragmatic matches in real-world prose. The goal isn’t to be a full, spec-compliant parser; it’s to extract migration-relevant URLs from text with high precision and minimal false positives.

Why not strictly WHATWG?

The WHATWG parser assumes a base URL, a controlled environment, and full parse/serialize semantics. Inline detection in user content has different constraints: no base URL, lots of punctuation and whitespace noise, and a strong bias toward HTTP(S) links used for site migrations. The matcher therefore makes deliberate trade-offs.

Deliberate deviations from WHATWG

Reject URLs with embedded credentials
https://user:[email protected] is not matched. Rewriting URLs that presume transferable credentials is hazardous and rarely correct for migrations.

Reject non-HTTP(S) schemes
Out of scope for site moves; all of these are rejected:
gopher://site.com, blob:afgh2-48189d, ahttp://site.com, mailto:[email protected], file://asset.zip.
If we need additional schemes later, we can add them intentionally.

Reject non‑absolute‑looking references
While we do rely on a base URL, inputs like ::, /index.html, ?query are still ignored. Bare-domain forms like mysite.org/?query are still matched.

Handle trailing punctuation sensibly
https://mysite.com/path/.. is interpreted as https://mysite.com/path/ rather than collapsing to the origin. A final period is far more likely sentence punctuation than ../. If a user truly writes https://mysite.com/path/../, we parse it as expected.

Fuzzy matching for malformed ports

WHATWG: "http://w.org:100000 plugins are in the directory" → failure.
Inline detection: "http://w.org:100000 plugins are in the directory" → "http://wp.org/" (truncate at the invalid port).

This is a best‑effort extraction of the valid prefix rather than an all‑or‑nothing rejection.

Whitespace handling

WHATWG: "http://example\t.\norg" → "http://example.org/".
Inline detection: stops at the first whitespace, yielding "http://example/".

This reflects how URLs actually appear in text blocks where whitespace often terminates a link.

Tests

The WHATWG URL test suite is copied and adapted for “URL-in-text” matching, with additional cases covering credentials, schemes, punctuation, malformed ports, and whitespace. CI exercises the full set.

dmsnell · 2025-09-10T14:35:05Z

Fuzzy matching for malformed ports

We may want to reject these as well, given that the invalid port looks well-formed. Why? Because we know that the original URL could not have pointed to a valid resource, so we don’t have to try and rewrite it, given that we know it will still point to an invalid resource.

This reflects how URLs actually appear in text blocks where whitespace often terminates a link.

Yes, and worth pointing out that the WHATWG parser and tests are attempting to validate entire strings which are potentially complete URLs. This is a major source of difference that I have found when trying to fulfill a purpose like you are doing, which is determining which substrings inside a text are valid URLs.

I would love to have a more standardized approach to this, but it may not exist.

adamziel · 2025-09-10T14:35:24Z

Refined GitHub highlights a URL we'd reject :D

adamziel · 2025-09-10T14:36:35Z

We may want to reject these as well, given that the invalid port looks well-formed. Why? Because we know that the original URL could not have pointed to a valid resource, so we don’t have to try and rewrite it, given that we know it will still point to an invalid resource.

I rephrased the example to:

WHATWG: "http://w.org:100000 plugins are in the directory" → failure.
Inline detection: "http://w.org:100000 plugins are in the directory" → "http://wp.org/" (truncate at the invalid port).

I wonder what's your take on that

components/DataLiberation/Tests/URLInTextProcessorTest.php

components/DataLiberation/Tests/URLInTextProcessorWHATWGComplianceTest.php

dmsnell · 2025-09-10T15:13:01Z

components/DataLiberation/URL/class-urlintextprocessor.php

-                \/\/                                                 # The protocol does not have to be there, but when
-                                                                     # it is, is must be followed by \/\/
+                (?<scheme>[a-z0-9\+]+?:)?                            #
+                (?:\/|\/\/)?                                         # The protocol may optionally be followed by one or two slashes


which supported URLs do we have where there’s only a single slash?

WHATWG surprised me by how lenient it is on slashes:

> (new URL("https://example.com/")).toString() 'https://example.com/' > (new URL("https:/example.com/")).toString() 'https://example.com/' > (new URL("https:example.com/")).toString() 'https://example.com/'

🤨

Safari’s URL bar works with one, but diverts to the search engine if no slash is there.

This is annoying.

I've adjusted it to match zero or more slashes after all to make it capture the entire file:///path URL and reject it as a whole (as opposed to matching just the //path subset and assuming it's a protocol-relative URL)

adamziel · 2025-09-10T16:42:55Z

We may want to reject these as well, given that the invalid port looks well-formed. Why? Because we know that the original URL could not have pointed to a valid resource, so we don’t have to try and rewrite it, given that we know it will still point to an invalid resource.

I rephrased the example to:

WHATWG: "http://w.org:100000 plugins are in the directory" → failure.

Inline detection: "http://w.org:100000 plugins are in the directory" → "http://wp.org" (truncate at the invalid port).

I wonder what's your take on that

I'll go ahead and merge this PR. Let's keep this conversation going separately. I'm happy to start rejecting those URLs if we'll conclude it's a better trade-off than not rejecting them.

adamziel added the enhancement New feature or request label Sep 10, 2025

adamziel mentioned this pull request Sep 10, 2025

Rewrite URLs in the imported content WordPress/wordpress-importer#202

Open

adamziel requested a review from dmsnell September 10, 2025 14:33