- Overview
- Features
- System Requirements
- Installation
- Project Structure
- Usage Guide
- Visualization Examples
- Troubleshooting
- Contributing
- License
This toolkit provides robust solutions for web data extraction, preprocessing, and advanced visualization. It's designed specifically for analyzing job market data, with built-in mechanisms to handle anti-scraping measures, perform natural language processing on job descriptions, and generate actionable insights through comprehensive visualizations.
- Python 3.8 or higher
- 4GB+ RAM (8GB+ recommended for larger datasets)
- Active internet connection for data scraping
- IDE: Visual Studio Code or IntelliJ IDEA (recommended)
# Clone the repository
git clone https://github.com/rayxiang03/Indeed-Job-Scraping.git
cd Indeed-Job-Scraping
# Create and activate virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install required packages
pip install beautifulsoup4 cloudscraper requests pandas matplotlib fake-useragent nltk \
torch sentencepiece transformers pyspellchecker wordcloud numpy
Alternatively, you can install the required packages directly:
pip install beautifulsoup4 cloudscraper requests pandas matplotlib fake-useragent nltk \
torch sentencepiece transformers pyspellchecker wordcloud numpy
├── code_WebScraping.py # Web scraping and data preprocessing script
├── code2_Analysis.py # Data visualization and analysis script
└── indeed_job.csv # Generated dataset (after running code_WebScraping.py)
This script handles data extraction from targeted websites, text preprocessing, and dataset creation:
- Open the script in your preferred IDE (VS Code or IntelliJ IDEA)
- Verify your internet connection
- Execute the script:
python code_WebScraping.py
The script will:
- Connect to specified job websites
- Extract job listings data
- Clean and preprocess text content
- Create a structured DataFrame
- Export the data to
indeed_job.csv
This script loads the previously scraped data and generates various visualizations:
- Ensure
indeed_job.csv
is in the same directory - Run the script:
python code2_Analysis.py
The script will generate visualizations for:
- Job category distributions
- Geographic job distribution
- Salary range analysis
- Keyword frequency analysis
- Word clouds of most common terms
- Other insightful data visualizations
Issue | Solution |
---|---|
403 Forbidden Errors | • Use a VPN to change your IP address • Connect to a mobile hotspot • Switch to a different WiFi network • Increase delay between requests |
Missing Dependencies | Install all required packages using the pip command in the installation section |
Memory Errors | Reduce batch size in data processing or use a machine with more RAM |
Visualization Errors | Ensure matplotlib backend is properly configured for your environment |
CSV Loading Errors | Verify indeed_job.csv exists and has proper formatting |
For persistent scraping issues, consider implementing:
- Proxy rotation services
- Tor network integration
- Cloud-based scraping with IP rotation
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
Please ensure your code follows the project's coding style and includes appropriate tests.
This project is licensed under the MIT License - see the LICENSE file for details.
Built with ❤️ by rayxiang03