Skip to content

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Sep 10, 2025

@dmsnell raised a good question in WordPress/wordpress-importer#202:

Have we audited against the WHATWG spec to see if we accept things we should reject, or reject things we should accept? Off the top of my head I don’t remember how auth in URLs is handled in the spec, but I doubt there’s a strong case for rewriting URLs with embedded auth.

What this does

This PR vets our URL text matcher against the WHATWG URL test suite and tightens the parsing to make smarter, pragmatic matches in real-world prose. The goal isn’t to be a full, spec-compliant parser; it’s to extract migration-relevant URLs from text with high precision and minimal false positives.

Why not strictly WHATWG?

The WHATWG parser assumes a base URL, a controlled environment, and full parse/serialize semantics. Inline detection in user content has different constraints: no base URL, lots of punctuation and whitespace noise, and a strong bias toward HTTP(S) links used for site migrations. The matcher therefore makes deliberate trade-offs.

Deliberate deviations from WHATWG

Reject URLs with embedded credentials
https://user:[email protected] is not matched. Rewriting URLs that presume transferable credentials is hazardous and rarely correct for migrations.

Reject non-HTTP(S) schemes
Out of scope for site moves; all of these are rejected:
gopher://site.com, blob:afgh2-48189d, ahttp://site.com, mailto:[email protected], file://asset.zip.
If we need additional schemes later, we can add them intentionally.

Reject non‑absolute‑looking references
While we do rely on a base URL, inputs like ::, /index.html, ?query are still ignored. Bare-domain forms like mysite.org/?query are still matched.

Handle trailing punctuation sensibly
https://mysite.com/path/.. is interpreted as https://mysite.com/path/ rather than collapsing to the origin. A final period is far more likely sentence punctuation than ../. If a user truly writes https://mysite.com/path/../, we parse it as expected.

Fuzzy matching for malformed ports

  • WHATWG: "http://w.org:100000 plugins are in the directory" → failure.
  • Inline detection: "http://w.org:100000 plugins are in the directory" → "http://wp.org/" (truncate at the invalid port).

This is a best‑effort extraction of the valid prefix rather than an all‑or‑nothing rejection.

Whitespace handling

  • WHATWG: "http://example\t.\norg" → "http://example.org/".
  • Inline detection: stops at the first whitespace, yielding "http://example/".

This reflects how URLs actually appear in text blocks where whitespace often terminates a link.

Tests

The WHATWG URL test suite is copied and adapted for “URL-in-text” matching, with additional cases covering credentials, schemes, punctuation, malformed ports, and whitespace. CI exercises the full set.

@dmsnell
Copy link
Member

dmsnell commented Sep 10, 2025

Fuzzy matching for malformed ports

We may want to reject these as well, given that the invalid port looks well-formed. Why? Because we know that the original URL could not have pointed to a valid resource, so we don’t have to try and rewrite it, given that we know it will still point to an invalid resource.

This reflects how URLs actually appear in text blocks where whitespace often terminates a link.

Yes, and worth pointing out that the WHATWG parser and tests are attempting to validate entire strings which are potentially complete URLs. This is a major source of difference that I have found when trying to fulfill a purpose like you are doing, which is determining which substrings inside a text are valid URLs.

I would love to have a more standardized approach to this, but it may not exist.

@adamziel
Copy link
Collaborator Author

adamziel commented Sep 10, 2025

Refined GitHub highlights a URL we'd reject :D

CleanShot 2025-09-10 at 16 35 11@2x

@adamziel
Copy link
Collaborator Author

adamziel commented Sep 10, 2025

We may want to reject these as well, given that the invalid port looks well-formed. Why? Because we know that the original URL could not have pointed to a valid resource, so we don’t have to try and rewrite it, given that we know it will still point to an invalid resource.

I rephrased the example to:

  • WHATWG: "http://w.org:100000 plugins are in the directory" → failure.
  • Inline detection: "http://w.org:100000 plugins are in the directory" → "http://wp.org/" (truncate at the invalid port).

I wonder what's your take on that

\/\/ # The protocol does not have to be there, but when
# it is, is must be followed by \/\/
(?<scheme>[a-z0-9\+]+?:)? #
(?:\/|\/\/)? # The protocol may optionally be followed by one or two slashes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which supported URLs do we have where there’s only a single slash?

Copy link
Collaborator Author

@adamziel adamziel Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHATWG surprised me by how lenient it is on slashes:

> (new URL("https://example.com/")).toString()
'https://example.com/'

> (new URL("https:/example.com/")).toString()
'https://example.com/'

> (new URL("https:example.com/")).toString()
'https://example.com/'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤨

Safari’s URL bar works with one, but diverts to the search engine if no slash is there.

This is annoying.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is!

Copy link
Collaborator Author

@adamziel adamziel Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adjusted it to match zero or more slashes after all to make it capture the entire file:///path URL and reject it as a whole (as opposed to matching just the //path subset and assuming it's a protocol-relative URL)

@adamziel adamziel force-pushed the test-inline-url-matcher-against-whatwg-url-test-suite branch from 9125fdf to c2e05d1 Compare September 10, 2025 15:58
@adamziel
Copy link
Collaborator Author

adamziel commented Sep 10, 2025

We may want to reject these as well, given that the invalid port looks well-formed. Why? Because we know that the original URL could not have pointed to a valid resource, so we don’t have to try and rewrite it, given that we know it will still point to an invalid resource.

I rephrased the example to:

  • WHATWG: "http://w.org:100000 plugins are in the directory" → failure.
  • Inline detection: "http://w.org:100000 plugins are in the directory" → "http://wp.org" (truncate at the invalid port).

I wonder what's your take on that

I'll go ahead and merge this PR. Let's keep this conversation going separately. I'm happy to start rejecting those URLs if we'll conclude it's a better trade-off than not rejecting them.

@adamziel adamziel merged commit 93bc672 into trunk Sep 10, 2025
21 of 22 checks passed
@adamziel adamziel deleted the test-inline-url-matcher-against-whatwg-url-test-suite branch September 10, 2025 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants