Skip to content

Python toolkit for scraping Indeed job listings, preprocessing data, and generating visualizations for market analysis.

License

Notifications You must be signed in to change notification settings

rayxiang03/Indeed-Job-Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 Indeed Job Scraping & Analysis Tool

GitHub stars GitHub forks GitHub issues GitHub license Python 3.8+

📋 Table of Contents

🚀 Overview

This toolkit provides robust solutions for web data extraction, preprocessing, and advanced visualization. It's designed specifically for analyzing job market data, with built-in mechanisms to handle anti-scraping measures, perform natural language processing on job descriptions, and generate actionable insights through comprehensive visualizations.

✨ Features


Advanced Scraping
Bypasses common anti-scraping protections

Data Cleaning
Automated text normalization & correction

NLP Integration
Transformer models for text analysis

Data Visualization
Multiple chart types & word clouds

Insight Generation
Extract actionable job market trends

💻 System Requirements

  • Python 3.8 or higher
  • 4GB+ RAM (8GB+ recommended for larger datasets)
  • Active internet connection for data scraping
  • IDE: Visual Studio Code or IntelliJ IDEA (recommended)

📦 Installation

# Clone the repository
git clone https://github.com/rayxiang03/Indeed-Job-Scraping.git
cd Indeed-Job-Scraping

# Create and activate virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install required packages
pip install beautifulsoup4 cloudscraper requests pandas matplotlib fake-useragent nltk \
            torch sentencepiece transformers pyspellchecker wordcloud numpy

Alternatively, you can install the required packages directly:

pip install beautifulsoup4 cloudscraper requests pandas matplotlib fake-useragent nltk \
            torch sentencepiece transformers pyspellchecker wordcloud numpy

📁 Project Files

├── code_WebScraping.py      # Web scraping and data preprocessing script
├── code2_Analysis.py        # Data visualization and analysis script
└── indeed_job.csv           # Generated dataset (after running code_WebScraping.py)

🔧 Usage Guide

Web Scraping Module (code_WebScraping.py)

This script handles data extraction from targeted websites, text preprocessing, and dataset creation:

  1. Open the script in your preferred IDE (VS Code or IntelliJ IDEA)
  2. Verify your internet connection
  3. Execute the script:
python code_WebScraping.py

The script will:

  • Connect to specified job websites
  • Extract job listings data
  • Clean and preprocess text content
  • Create a structured DataFrame
  • Export the data to indeed_job.csv

Visualization Module (code2_Analysis.py)

This script loads the previously scraped data and generates various visualizations:

  1. Ensure indeed_job.csv is in the same directory
  2. Run the script:
python code2_Analysis.py

The script will generate visualizations for:

  • Job category distributions
  • Geographic job distribution
  • Salary range analysis
  • Keyword frequency analysis
  • Word clouds of most common terms
  • Other insightful data visualizations

📊 Visualization Examples

Job Categories Chart Location Distribution
Salary Distribution Skills Word Cloud

⚠️ Troubleshooting

Common Issues and Solutions

Issue Solution
403 Forbidden Errors • Use a VPN to change your IP address
• Connect to a mobile hotspot
• Switch to a different WiFi network
• Increase delay between requests
Missing Dependencies Install all required packages using the pip command in the installation section
Memory Errors Reduce batch size in data processing or use a machine with more RAM
Visualization Errors Ensure matplotlib backend is properly configured for your environment
CSV Loading Errors Verify indeed_job.csv exists and has proper formatting

Advanced IP Rotation Techniques

For persistent scraping issues, consider implementing:

  • Proxy rotation services
  • Tor network integration
  • Cloud-based scraping with IP rotation

🤝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Please ensure your code follows the project's coding style and includes appropriate tests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Built with ❤️ by rayxiang03

About

Python toolkit for scraping Indeed job listings, preprocessing data, and generating visualizations for market analysis.

Topics

Resources

License

Stars

Watchers

Forks

Languages