Skip to content

Conversation

BorisQuanLi
Copy link

This PR addresses part of Issue #5 (Add Support for Contributor Content).

A new documentation file, docs/contributor-datasets.md, provides a step-by-step guide for discovering and downloading datasets from the /contrib/ and /projects/ directories in the Common Crawl public data bucket.

  • The new file is placed in the docs/ folder, separate from the main README.md, to avoid disrupting the existing documentation structure.
  • The guide includes AWS CLI commands for listing and downloading dataset files, and a Python script for downloading, decompressing, and inspecting .paths.gz files.
  • These examples are intended to help users discover available datasets and preview their contents before using cc-downloader.
  • Notes on current limitations are provided.

Note:
Whether this documentation should remain as a standalone file or be merged into the main README.md or another existing documentation file is open for discussion.

Breaking Changes

None.

@BorisQuanLi
Copy link
Author

Hi @commoncrawl team, I just wanted to check if there's any feedback or updates on this PR. Let me know if there are any changes you'd like me to make!

@pjox pjox self-assigned this Sep 23, 2025
@pjox pjox added the documentation Improvements or additions to documentation label Sep 23, 2025
@pjox pjox self-requested a review September 28, 2025 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants