A Python package to crawl solved problems from Baekjoon Online Judge (BOJ) for one or multiple users.
- Crawls the BOJ status page for a given user
- Extracts information about solved problems including:
- Submission ID
- Problem ID
- Problem Title (from title attribute)
- Programming Language
- Submission Time (from title attribute)
- Saves the results to a JSON file in a user-specific folder
- Includes rate limiting (2 seconds between requests) to prevent server overload
- Optional date filtering to get solutions from a specific month or date range
- Batch crawling support for multiple users
- Proxy server support for HTTP, HTTPS, and SOCKS proxies
- Retry logic for handling 403 Forbidden errors
- Monthly reporting and statistics generation
pip install boj-crawler
- Clone the repository:
git clone https://github.com/yourusername/boj-crawler.git
cd boj-crawler
- Install the package:
pip install -e .
If you need SOCKS proxy support, install the additional dependency:
pip install requests[socks]
You can use the crawler in your Python code:
from boj_crawler import BOJCrawler
# Create a crawler instance
crawler = BOJCrawler("username")
# Get solved problems
problems = crawler.get_solved_problems()
# Save to JSON
crawler.save_to_json(problems)
from boj_crawler import BOJCrawler
# Using a single proxy for both HTTP and HTTPS
proxies = {
'http': 'http://proxy.example.com:8080',
'https': 'http://proxy.example.com:8080'
}
crawler = BOJCrawler("username", proxies=proxies)
problems = crawler.get_solved_problems()
crawler.save_to_json(problems)
# Using different proxies for HTTP and HTTPS
proxies = {
'http': 'http://proxy.example.com:8080',
'https': 'https://secure-proxy.example.com:8443'
}
crawler = BOJCrawler("username", proxies=proxies)
# With date filtering and proxy
crawler = BOJCrawler("username", start_date="240101", end_date="240331", proxies=proxies)
# Get all solved problems
boj-crawler -u username
# Get problems solved in January 2024
boj-crawler -u username -m 202401
# Get problems in a date range
boj-crawler -u username -s 240315 -e 240415
# Using proxy servers
boj-crawler -u username --proxy-all http://proxy.example.com:8080
boj-crawler -u username --proxy-http http://proxy.example.com:8080 --proxy-https https://secure-proxy.example.com:8443
# Combine proxy with date filtering
boj-crawler -u username --proxy-all http://proxy.example.com:8080 -s 240101 -e 240331
- Create a text file with usernames (one per line), for example
usernames.txt
:
user1
user2
user3
- Run the batch crawler:
# Get all solved problems for all users
boj-batch-crawler -f usernames.txt
# Get problems solved in January 2024 for all users
boj-batch-crawler -f usernames.txt -m 202401
# Get problems in a date range for all users
boj-batch-crawler -f usernames.txt -s 240315 -e 240415
# Using proxy with batch crawling
boj-batch-crawler -f usernames.txt --proxy-all http://proxy.example.com:8080
# Combine proxy with filtering and disable reports
boj-batch-crawler -f usernames.txt --proxy-all http://proxy.example.com:8080 -m 202401 --no-report
- HTTP proxies:
http://proxy.example.com:8080
- HTTPS proxies:
https://proxy.example.com:8080
- SOCKS proxies:
socks5://proxy.example.com:1080
(requiresrequests[socks]
) - Authenticated proxies:
http://username:[email protected]:8080
--proxy-all URL
: Use the same proxy for both HTTP and HTTPS--proxy-http URL
: Use proxy only for HTTP requests--proxy-https URL
: Use proxy only for HTTPS requests
# Local debugging proxy (Charles, Fiddler, etc.)
boj-crawler -u username --proxy-all http://localhost:8888
# Corporate proxy with authentication
boj-crawler -u username --proxy-all http://user:[email protected]:8080
# SOCKS5 proxy
boj-crawler -u username --proxy-all socks5://proxy.example.com:1080
# Different proxies for HTTP and HTTPS
boj-crawler -u username --proxy-http http://proxy1.example.com:8080 --proxy-https https://proxy2.example.com:8443
The script creates a folder named after each user and generates a JSON file (solved_problems.json
) inside that folder. The JSON file contains an array of solved problems with the following structure:
[
{
"submission_id": "...",
"problem_id": "...",
"problem_title": "...",
"language": "...",
"submission_time": "..."
},
...
]
- The script uses a 2-second delay between requests to be respectful to the BOJ servers
- Problem titles and submission times are extracted from the title attributes of the respective elements
- The script handles pagination automatically to collect all solved problems
- Each user's data is stored in a separate folder to keep the data organized
- When using date filtering:
-m/--month
: Date must be in YYYYMM format (e.g., 202401 for January 2024)-s/--start-date
and-e/--end-date
: Dates must be in YYMMDD format (e.g., 240315 for March 15, 2024)
- The batch crawler adds a 5-second delay between users to be respectful to the server
- Proxy settings are applied to all HTTP requests made by the crawler
- The crawler includes retry logic for 403 Forbidden errors with configurable delays
boj-crawler/
├── boj_crawler/
│ ├── __init__.py
│ ├── crawler.py
│ ├── cli.py
│ └── batch.py
├── setup.py
├── README.md
└── requirements.txt
python -m pytest tests/
This project is licensed under the MIT License - see the LICENSE file for details.