Skip to content

amolsr/web-scrapping

Repository files navigation

Web Scraping Collection 🕷️

A comprehensive collection of web scraping scripts for extracting data from popular websites. This project demonstrates various web scraping techniques using Python and provides ready-to-use scripts for data extraction.

🌟 Features

  • Multiple Website Support: Scrape data from 10+ popular websites
  • CSV Output: All scrapers export data in CSV format for easy analysis
  • Easy to Use: Simple Python scripts with clear documentation
  • Educational: Perfect for learning web scraping techniques
  • Open Source: Contribute and improve the collection

📋 Available Scrapers

Scraper Description Output
Flipkart (1. flipkart.py) Extract Nokia smartphone data (name, rating, price, description) flipkart.csv
YouTube (2. youtube.py) Scrape YouTube video information youtube.csv
YouTube Links (3. youtube_links.py) Extract YouTube video links youtube_links.csv
IMDB (4. imdb.py) Get top-rated movies with rankings, ratings, and director info imdb.csv
Amazon (5. Amazon.py) Extract Amazon product data Amazon.csv
GitHub (6. Github.py) Scrape GitHub repository information github.csv
Udemy (7. Udemy.py) Extract Udemy course data udemy.csv
College Notices (8. college_notice_scrapper.py) Scrape college notice board notice.csv
Sanfoundry (9. Sanfoundry.py) Extract educational content sanfoundry.csv
Hacker News (10. HackNews.py) Scrape GitHub-related posts from Hacker News hacknews.csv
Weather (Weather.py) Extract weather information weather.csv

🚀 Quick Start

Prerequisites

pip install requests beautifulsoup4 lxml

Installation

  1. Clone the repository

    git clone https://github.com/amolsr/web-scrapping.git
    cd web-scrapping
  2. Run any scraper

    python "1. flipkart.py"
  3. Check the output

    ls output/

📊 Sample Output

IMDB Top Movies

Rank,Name,Year,Rating,Link,Director
1,The Shawshank Redemption,1994,9.2,https://www.imdb.com/title/tt0111161/,Frank Darabont
2,The Godfather,1972,9.2,https://www.imdb.com/title/tt0068646/,Francis Ford Coppola

Flipkart Smartphones

Mobile Name,Ratings,Pricing,Description
Nokia 8.1,4.3,₹15,999,6GB RAM | 128GB Storage
Nokia 6.1 Plus,4.2,₹12,999,4GB RAM | 64GB Storage

🛠️ Usage Examples

Basic Usage

# Run a specific scraper
python "4. imdb.py"

# The script will automatically:
# 1. Fetch data from the website
# 2. Parse the HTML content
# 3. Extract relevant information
# 4. Save to CSV file in the output/ directory

Customization

Each script can be easily modified to:

  • Change the target URL
  • Extract different data fields
  • Modify the output format
  • Add error handling

📁 Project Structure

web-scrapping/
├── 1. flipkart.py          # Flipkart smartphone scraper
├── 2. youtube.py           # YouTube video scraper
├── 3. youtube_links.py     # YouTube links extractor
├── 4. imdb.py              # IMDB top movies scraper
├── 5. Amazon.py            # Amazon product scraper
├── 6. Github.py            # GitHub repository scraper
├── 7. Udemy.py             # Udemy course scraper
├── 8. college_notice_scrapper.py  # College notices scraper
├── 9. Sanfoundry.py        # Sanfoundry educational content
├── 10. HackNews.py         # Hacker News GitHub posts
├── Weather.py              # Weather information scraper
├── output/                 # Generated CSV files
│   ├── flipkart.csv
│   ├── imdb.csv
│   ├── github.csv
│   └── ...
└── README.md               # This file

🔧 Dependencies

  • requests: HTTP library for making web requests
  • beautifulsoup4: HTML/XML parsing library
  • lxml: XML and HTML processing library
  • csv: Built-in CSV module for data export

🤝 Contributing

We welcome contributions! Here's how you can help:

  1. Fork the repository
  2. Create a new scraper or improve existing ones
  3. Add proper documentation and comments
  4. Test your changes
  5. Submit a pull request

Contribution Ideas

  • Add new website scrapers
  • Improve error handling
  • Add data validation
  • Create web interface
  • Add support for different output formats (JSON, XML)
  • Implement rate limiting and respect robots.txt

⚠️ Important Notes

  • Respect robots.txt: Always check the website's robots.txt file
  • Rate Limiting: Add delays between requests to be respectful
  • Terms of Service: Ensure you comply with each website's terms
  • Data Usage: Use scraped data responsibly and ethically

📝 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

  • Beautiful Soup for HTML parsing
  • Requests library for HTTP handling
  • All contributors who help improve this collection

📞 Support

If you have questions or need help:

  • Open an issue on GitHub
  • Check the code comments for implementation details
  • Review the output files for expected data format

Happy Scraping! 🕷️✨

Stargazers over time

About

Web Scrapping Examples using Beautiful Soup in Python.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 8

Languages