Skip to content

Prevent misconfigured clients from unknowingly generating extra traffic #8605

Open
@dstufft

Description

@dstufft

Currently in PyPI's API we have a number of "helper" redirects which serve to canonicalize an URL to ensure that our cached objects only live under a single URL. Off the top of my head, those redirects are:

  • Redirect pypi.python.org to pypi.org
  • Add a trailing slash if one is missing
  • Normalize the name of a project

Most of our clients have "internalized" these rules, and are careful to request URLs that avoid these redirects, and the rules that these redirects use are designed to be able to be implemented at the client side instead of needing to be implemented in the server.

The fact that most of our clients implement these rules themselves, means that these redirects are rarely cached and serve as a way for clients to unknowingly generate a large amount of traffic that hit our origin servers directly. Everything will appear to work, since most clients will just transparently handle the redirect, but if they ever start hitting PyPI hard there is potential for it to cause us issues. Additionally, it's also just generating unneeded traffic, which could be optimized to provide a better overall experience AND to be nicer to our use of donated infrastructure.

I see two real options we could go down:

  1. Move the redirect handling into Fastly, and add some kind of logging so we can monitor for large increases in redirect traffic to see the overall "health" and look for possible problems.

    This keeps things working as they do now, but means that a misconfigured/less optimal client is less likely to generate origin traffic, and Fastly shouldn't have any problems keeping up with the traffic. It does make it harder to have visibility into the issue (but honestly I don't think we don't have good visibility now, we only noticed this because traffic spiked so high it was causing issues handling the scale so this would probably enable us to have better visibility).

  2. Remove the redirects completely, and require that clients correctly generate URLs instead of allowing them to fall back to using additional traffic to paper over incorrectly generated URLs.

    This makes it glaringly obvious to clients that they've not correctly implemented URL generation, and they need to adjust things, since it just won't work if they don't generate it correctly. We already have precedence for this, in our API routes we do not redirect HTTP to HTTPS, and instead return an error, forcing clients to "get it right" from the start. The biggest downside to this, is that it is a breaking change for all of the clients, particularly older versions of pip that come from before the url generation rules were stable and able to be applied entirely client side. That being said, those versions of pip should not be wholly out of luck, users of them will basically just have to manually apply the third rule when specifying names to install (e.g., don't do pip install Django, do pip install django because those versions of pip won't normalize Django to django).

    To be clear, this would only apply to "API" URLs, URLs that are intended for humans to access should retain these redirects as they're an important part of the UX, but automated tooling should be able to be expected to generate correct URLs and not rely on redirects.

Personally, I think I'm in favor of option 2. We probably would want to see how many requests rely on those redirects today, and if it's a lot phase it in gently (brown outs, etc), but I think that it provides the best overall outcome both for optimizing the amount of traffic that is being generated, but also removing the human factor of having PyPI operators need to monitor 3xx requests and reach out/block people who are doing things incorrectly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CDN/networkIssues related to our CDN, users having problems connecting to PyPIneeds discussiona product management/policy issue maintainers and users should discuss

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions