Skip to content

Conversation

duckinator
Copy link
Contributor

@duckinator duckinator commented Oct 26, 2020

TODO:

The pip cache purge/pip cache remove commands now:

  • removes wheel cache folders without .whl files
  • clears the HTTP cache
  • removes selfcheck.json, which pip does not use anymore.

This PR does not account for having pip cache purge remove things in the pip/selfchecks/ directory when appropriate, which was also discussed in #7372.

@pradyunsg
Copy link
Member

Heads up: The MacOS workers on Azure Pipelines are failing right now -- #9030.

@duckinator
Copy link
Contributor Author

duckinator commented Oct 26, 2020

Thanks for the heads up. I also just discovered that os.scandir() didn't arrive until Python 3.5, so will need to use something else.

@pradyunsg
Copy link
Member

You could wait for a couple of weeks, and we'll drop Python 2 support from master. :)

@duckinator
Copy link
Contributor Author

Making it work without using os.scandir() is apparently pretty annoying.

Do y'all support Python 3 versions before 3.5? If you don't then, screw it, I'll just wait for you to drop Python 2 support lmao

@pradyunsg
Copy link
Member

Do y'all support Python 3 versions before 3.5?

Nope. Python 3.5 and Python 2 get dropped after 20.3. :)

@duckinator
Copy link
Contributor Author

Okay, in that case let's leave this for after Py2 is dropped. 👍

@duckinator
Copy link
Contributor Author

duckinator commented Dec 4, 2020

Rebased off of master (ab7ff0a).

(It seems Py2 is still in the list of tests; just wanted to keep the PR from getting stale.)

@duckinator
Copy link
Contributor Author

duckinator commented Feb 3, 2021

Rebased off master (d108e49).

I'm not sure what the reason for the failing CI tasks is. I do see this error in the failed tasks, however:

ERROR: InvocationError for command /Users/runner/work/pip/pip/.tox/lint/bin/pre-commit run --all-files --show-diff-on-failure --hook-stage=manual (exited with code 1)

@duckinator
Copy link
Contributor Author

duckinator commented Feb 3, 2021

As a note, anything requiring more than rebasing this PR will probably have to wait a month or two at least. Pretty busy with trying to get a house and such. 😅

@uranusjr
Copy link
Member

uranusjr commented Feb 3, 2021

This seems to be the specific error:

flake8...................................................................Failed
- hook id: flake8
- exit code: 1

src/pip/_internal/commands/cache.py:190:17: E127 continuation line over-indented for visual indent

@duckinator
Copy link
Contributor Author

ah, thank you @uranusjr. Had trouble finding that in the CI output. 😅

I fixed that problem (or at least, that's what tox -e lint says locally), but it still ends with this and has exit code 1, for some reason:

isort....................................................................Passed
mypy.....................................................................Passed                            
use logger.warning(......................................................Passed
check for eval().........................................................Passed
rst ``code`` is two backticks............................................Passed
NEWS fragment........................................(no files to check)Skipped
check-manifest...........................................................Passed
ERROR: InvocationError for command /home/puppy/dev/python/duckinator/pip/.tox/lint/bin/pre-commit run --all-files --sho
w-diff-on-failure --hook-stage=manual (exited with code 1)
_______________________________________________________ summary _______________________________________________________
ERROR:   lint: commands failed

Not sure what's up with that.

@sbidoul
Copy link
Member

sbidoul commented Feb 3, 2021

@duckinator the flake8 error seems to be still present.

@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

@BrownTruck BrownTruck added the needs rebase or merge PR has conflicts with current master label Jun 11, 2021
@pradyunsg
Copy link
Member

pradyunsg commented Sep 18, 2021

Hiya @duckinator -- would you be interested in updating this PR? :)

@pradyunsg pradyunsg added the S: awaiting response Waiting for a response/more information label Sep 18, 2021
@duckinator
Copy link
Contributor Author

@pradyunsg I'll rebase this sometime in the next few days. If it's not done by Wednesday (September 22nd) feel free to ping me here again. 👍

@duckinator
Copy link
Contributor Author

Running a bit behind on things this week, but still hoping to get to it in the next few days. Sorry for the delay!

@pradyunsg pradyunsg added this to the 21.3 milestone Sep 22, 2021
@pradyunsg
Copy link
Member

No worries and thanks for the update! Please don’t feel pressured to update this ASAP. :)

I’m gonna put this into the release milestone to remind myself to keep an eye out for updates here — it’s totally fine if this gets pushed down the road.

@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label Sep 24, 2021
@duckinator duckinator force-pushed the issue-7372 branch 2 times, most recently from c75cae3 to 11db9a5 Compare September 24, 2021 06:34
@duckinator
Copy link
Contributor Author

Rebased, and simplified(!) because some of the changes in the last ~year let me simplify my code. ^.^

@duckinator duckinator changed the title Have pip cache purge and pip cache remove handle directories without .whl files Have pip cache purge and pip cache remove handle directories that only have non-.whl files Aug 6, 2025
@duckinator duckinator changed the title Have pip cache purge and pip cache remove handle directories that only have non-.whl files Have pip cache purge and pip cache remove remove directories that only have non-.whl files Aug 6, 2025
@duckinator duckinator force-pushed the issue-7372 branch 3 times, most recently from 8284080 to 50beb9b Compare August 6, 2025 17:32
@duckinator duckinator force-pushed the issue-7372 branch 2 times, most recently from c229662 to a807060 Compare August 6, 2025 17:37
@duckinator
Copy link
Contributor Author

duckinator commented Aug 6, 2025

Things sure did happen over the last 5 years.

Thank you all so much for being patient with this and making sure I was kept in the loop on changing requirements!

@pradyunsg this is finally ready for review, after nearly half a freaking decade.

With these changes, pip cache purge/pip cache remove will now:

  • remove wheel cache folders without .whl files (even if they have other contents, as mentioned by @sbidoul)
  • clears the HTTP cache
  • removes selfcheck.json, which pip does not use anymore

In addition, this PR:

  • has been rebased off main (c46141c)
  • is fully typed
  • has no lint errors
  • has a news entry

@duckinator duckinator changed the title Have pip cache purge and pip cache remove remove directories that only have non-.whl files Make pip cache purge and pip cache remove delete additional unneeded files. Aug 6, 2025
@duckinator

This comment was marked as outdated.

@duckinator duckinator force-pushed the issue-7372 branch 3 times, most recently from 1816296 to cd26c8d Compare August 6, 2025 19:02
@duckinator
Copy link
Contributor Author

duckinator commented Aug 6, 2025

EDIT: My concept for how to fix it worked, I just had the initial implementation wrong.

I figured out a fix, but I'm not sure how to test it.

In practice, both subdirs_without_wheels() and subdirs_without_files() are implemented as directory tree walkers which try to collect as much as possible in a single pass. The problem is this collects directories multiple times.

I fixed it by making the final returned value be a reverse-sorted set instead of a reverse-sorted list. This means that parent directories are only removed once, and they get removed after everything they contain.

This is an example directory structure which reproduces the problem:

/home/puppy/.cache/pip/http/4/6/e/1/8
/home/puppy/.cache/pip/http/4/6/e/1
/home/puppy/.cache/pip/http/4/6/e
/home/puppy/.cache/pip/http/4/6
/home/puppy/.cache/pip/http/4/4/e/5/b
/home/puppy/.cache/pip/http/4/4/e/5
/home/puppy/.cache/pip/http/4/4/e
/home/puppy/.cache/pip/http/4/4
/home/puppy/.cache/pip/http/4

@ichard26
Copy link
Member

ichard26 commented Aug 7, 2025

@duckinator glad to see that you aren't (too?) frustrated that we left this PR out to dry so long it probably fossilized. Just as a FYI, there is (surprise!) still limited review capacity so it may take some time for us to review your PR. However, I do want to see your work be merged at some point, so when I come back from my OSS "break", I'll make sure to review your PR!

@duckinator
Copy link
Contributor Author

@ichard26 honestly, I was worried I was annoying y'all because I kept disappearing for months or years at a time then coming back and doing a barrage of work only to disappear again. 😅

Looking forward to getting this wrapped up whenever you get to it. Enjoy your break. 🙂

Copy link
Member

@ichard26 ichard26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally got around to doing a review. The idea looks good, but I think the implementation can be simplified. Sorry for making you do more work :)

Also, where are the unit tests? The discussion history seems to suggest that they were written at some point. Did they forgotten to be committed?

Thanks a lot for your persistence!

return format_size(directory_size(path))


def subdirs_without_files(path: str) -> Generator[Path]:
Copy link
Member

@ichard26 ichard26 Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to break it to you, but this seems to be buggy still. The root of the problem is that when it discards a subtree because a file is present, we need to discard the entire chain of parent directories. This logic seems to instead discard (skip) the inner subdirectories of the non-empty directory.

Imagine we have this structure:

root
├── d
│   ├── d
│   │   └── 3
│   │       └── uh.txt
│   └── f
│       └── d

When the code reaches root/f/d/d and discovers it's empty, it will return all of its parents, but we should only delete the root/d/f/d and root/d/f parents. root and root/d still contain a file (at root/d/d/3/uh.txt).

I spent some working on an alternative implementation. I haven't thoroughly tested this, but it seems to be more robust:

def subdirs_without_files(path: str) -> Generator[Path]:
    """Yields every subdirectory of +path+ that has no files under it."""

    directories = []
    non_empty = set()

    for root, _, filenames in os.walk(path, topdown=False):
        root_path = Path(root)
        if filenames:
            # This directory contains a file, mark it and its parent
            # directories (but not ".") as non empty.
            non_empty.update(root_path.parents[:-1])
            non_empty.add(root_path)

        directories.append(root_path)

    for d in directories:
        if d not in non_empty:
            yield d

The gist is that we walk the entire directory structure looking for directories which contain a file, all while building a list of all directories. If a file is found, then the containing directory and its parents are marked as non-empty. Afterwards, we go through the directories and yield any we haven't marked earlier. And sine we're walking bottom-up, they will be naturally yielded in reverse order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not very surprised. Your approach makes sense. I'm not sure why I didn't use os.walk().

I'll need to add tests (see my other comment) and then I should be able to test if your approach fixes the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at my code again, one thing you may need to fix is root_path.parents[:-1]. path.parents include . as the final parent for relative paths. I was working solely with relative paths while working on this code. I'd imagine the actual pip code is passing absolute paths in which case the last parent should not be skipped.

In practice, this probably won't affect anything since the final parent is not going to be under the pip cache directory, but if you see weird behaviour that's why.

I suppose a better design would be a mix of your approach and my approach where we manually walk the tree, using a stack to keep track of the parents we care about, and then add those to the exclusion list, but it probably doesn't make a difference?

@ichard26 ichard26 added this to the 25.3 milestone Aug 13, 2025
@duckinator
Copy link
Contributor Author

duckinator commented Aug 13, 2025

Thanks for the review! I've responded to the things you mentioned (& accepted your commit suggestion).

I'll try to work on this over the next few days. Please feel free to ping me if you don't hear anything by August 17th — I don't want it to slip through the cracks again! 🙂

I finally got around to doing a review. The idea looks good, but I think the implementation can be simplified. Sorry for making you do more work :)

No worries! I could tell it was rough, and expected it to need a second pass.

Also, where are the unit tests? The discussion history seems to suggest that they were written at some point. Did they forgotten to be committed?

I'm not sure what happened here. If I did write them, I suspect they got lost when I switched OSes at some point in the last 5 years.

Thanks a lot for your persistence!

Glad to help! I'm really hoping I can get it wrapped up before the half-decade mark rolls around in October. 😂

duckinator and others added 2 commits August 22, 2025 18:18
…files.

These commands now remove:
- wheel cache folders without `.whl` files.
- empty folders in the HTTP cache.
- `selfcheck.json`, which pip does not use anymore.
@duckinator
Copy link
Contributor Author

Haven't forgotten about this, but it's taking me a bit longer than expected to get to it.

For now, I went ahead and rebased off main (d52011f) so it at least won't get stale.

@notatallshaw notatallshaw modified the milestones: 25.3, 26.0 Oct 14, 2025
@notatallshaw
Copy link
Member

I'm moving this to 26.0 but ping me this week if you think it's ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bot:chronographer:provided C: cache Dealing with cache and files in it state: up for grabs (PR) Good idea, but needs a new champion as the PR author is busy or unreachable. type: enhancement Improvements to functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants