A comprehensive collection of web scraping scripts for extracting data from popular websites. This project demonstrates various web scraping techniques using Python and provides ready-to-use scripts for data extraction.
- Multiple Website Support: Scrape data from 10+ popular websites
- CSV Output: All scrapers export data in CSV format for easy analysis
- Easy to Use: Simple Python scripts with clear documentation
- Educational: Perfect for learning web scraping techniques
- Open Source: Contribute and improve the collection
Scraper | Description | Output |
---|---|---|
Flipkart (1. flipkart.py ) |
Extract Nokia smartphone data (name, rating, price, description) | flipkart.csv |
YouTube (2. youtube.py ) |
Scrape YouTube video information | youtube.csv |
YouTube Links (3. youtube_links.py ) |
Extract YouTube video links | youtube_links.csv |
IMDB (4. imdb.py ) |
Get top-rated movies with rankings, ratings, and director info | imdb.csv |
Amazon (5. Amazon.py ) |
Extract Amazon product data | Amazon.csv |
GitHub (6. Github.py ) |
Scrape GitHub repository information | github.csv |
Udemy (7. Udemy.py ) |
Extract Udemy course data | udemy.csv |
College Notices (8. college_notice_scrapper.py ) |
Scrape college notice board | notice.csv |
Sanfoundry (9. Sanfoundry.py ) |
Extract educational content | sanfoundry.csv |
Hacker News (10. HackNews.py ) |
Scrape GitHub-related posts from Hacker News | hacknews.csv |
Weather (Weather.py ) |
Extract weather information | weather.csv |
pip install requests beautifulsoup4 lxml
-
Clone the repository
git clone https://github.com/amolsr/web-scrapping.git cd web-scrapping
-
Run any scraper
python "1. flipkart.py"
-
Check the output
ls output/
Rank,Name,Year,Rating,Link,Director
1,The Shawshank Redemption,1994,9.2,https://www.imdb.com/title/tt0111161/,Frank Darabont
2,The Godfather,1972,9.2,https://www.imdb.com/title/tt0068646/,Francis Ford Coppola
Mobile Name,Ratings,Pricing,Description
Nokia 8.1,4.3,₹15,999,6GB RAM | 128GB Storage
Nokia 6.1 Plus,4.2,₹12,999,4GB RAM | 64GB Storage
# Run a specific scraper
python "4. imdb.py"
# The script will automatically:
# 1. Fetch data from the website
# 2. Parse the HTML content
# 3. Extract relevant information
# 4. Save to CSV file in the output/ directory
Each script can be easily modified to:
- Change the target URL
- Extract different data fields
- Modify the output format
- Add error handling
web-scrapping/
├── 1. flipkart.py # Flipkart smartphone scraper
├── 2. youtube.py # YouTube video scraper
├── 3. youtube_links.py # YouTube links extractor
├── 4. imdb.py # IMDB top movies scraper
├── 5. Amazon.py # Amazon product scraper
├── 6. Github.py # GitHub repository scraper
├── 7. Udemy.py # Udemy course scraper
├── 8. college_notice_scrapper.py # College notices scraper
├── 9. Sanfoundry.py # Sanfoundry educational content
├── 10. HackNews.py # Hacker News GitHub posts
├── Weather.py # Weather information scraper
├── output/ # Generated CSV files
│ ├── flipkart.csv
│ ├── imdb.csv
│ ├── github.csv
│ └── ...
└── README.md # This file
- requests: HTTP library for making web requests
- beautifulsoup4: HTML/XML parsing library
- lxml: XML and HTML processing library
- csv: Built-in CSV module for data export
We welcome contributions! Here's how you can help:
- Fork the repository
- Create a new scraper or improve existing ones
- Add proper documentation and comments
- Test your changes
- Submit a pull request
- Add new website scrapers
- Improve error handling
- Add data validation
- Create web interface
- Add support for different output formats (JSON, XML)
- Implement rate limiting and respect robots.txt
- Respect robots.txt: Always check the website's robots.txt file
- Rate Limiting: Add delays between requests to be respectful
- Terms of Service: Ensure you comply with each website's terms
- Data Usage: Use scraped data responsibly and ethically
This project is open source and available under the MIT License.
- Beautiful Soup for HTML parsing
- Requests library for HTTP handling
- All contributors who help improve this collection
If you have questions or need help:
- Open an issue on GitHub
- Check the code comments for implementation details
- Review the output files for expected data format
Happy Scraping! 🕷️✨