Here’s a more detailed and professional README for your Google Play Store dataset EDA project, with more emphasis on the analysis, methodologies, and conclusions.
This project showcases a comprehensive Exploratory Data Analysis (EDA) performed on the Google Play Store dataset. The analysis delves into key metrics such as app categories, reviews, ratings, installs, and pricing structures to uncover insights into app trends and user behaviors.
- Dataset Overview
- Project Objectives
- Key Insights and Findings
- Data Cleaning & Preprocessing
- Exploratory Data Analysis
- Visualizations
- Conclusion
- Tools & Libraries Used
- How to Run the Project
- Contact
The Google Play Store dataset contains essential information about various apps, including:
- App Name: Name of the application.
- Category: App category (e.g., Games, Social, Tools).
- Rating: User rating on a scale of 1 to 5.
- Reviews: Total number of user reviews.
- Size: App size in MB or KB.
- Installs: Total number of installs.
- Type: Whether the app is Free or Paid.
- Price: Price of the app (for paid apps).
- Content Rating: Targeted age group (e.g., Everyone, Teen, Mature 17+).
- Genres: Genre classification (e.g., Arcade, Puzzle).
- Last Updated: Last update date.
- Current Version: The latest version available.
- Android Version: Minimum Android version required.
The primary objective of this project is to:
- Analyze the distribution of apps across categories.
- Identify trends in ratings, reviews, and installs.
- Understand how factors such as app size, pricing, and updates impact user engagement.
- Compare free vs. paid apps.
- Highlight the top-performing app genres and categories.
-
Category Dominance: Categories such as "Family" and "Games" have the highest number of apps, while niche categories like "Comics" and "Beauty" have far fewer apps.
-
Ratings Distribution: Most apps have ratings between 4.0 and 4.5, with fewer apps achieving very high or very low ratings.
-
Review Trends: There is a positive correlation between the number of reviews and the rating score, with high-rated apps generally receiving more user feedback.
-
Impact of Size on Installs: Large apps tend to have fewer installs compared to medium-sized apps, possibly due to storage limitations on user devices.
-
Free vs Paid Apps: Free apps dominate the Google Play Store, but paid apps often have better average ratings.
-
Content Rating: Most apps are suitable for all audiences, with a smaller proportion targeting mature users.
Before diving into the analysis, the following preprocessing steps were taken:
-
Handling Missing Values:
- Columns such as
Rating
,Size
, andType
had missing values. Missing values were either imputed or removed based on their context.
- Columns such as
-
Data Type Conversions:
- The
Installs
column, which had values with commas and "+" signs (e.g., "1,000,000+"), was cleaned and converted to integers for analysis. - The
Size
column, which had mixed units (MB and KB), was normalized to MB for consistency.
- The
-
Outlier Detection:
- Outliers in columns like
Price
andReviews
were analyzed and handled appropriately.
- Outliers in columns like
-
Feature Engineering:
- Additional columns like
Price Category
(Free, Low, High) andSize Category
(Small, Medium, Large) were derived for better insights.
- Additional columns like
-
App Distribution by Category:
- Bar charts were used to visualize the number of apps in each category, showing which categories dominate the Play Store.
-
Rating Analysis:
- Distribution of app ratings across categories was analyzed to identify which categories have the highest-rated apps.
-
Correlation Analysis:
- We explored relationships between features like
Reviews
,Size
, andInstalls
using scatter plots and correlation matrices.
- We explored relationships between features like
-
Free vs. Paid Apps:
- We compared ratings, installs, and reviews between free and paid apps to highlight key differences in user engagement and performance.
-
Price Impact on Installs:
- Analyzed how the price of paid apps affects the number of installs and the average rating.
The following visualizations were created to enhance understanding:
- App Categories: Bar plot showing the distribution of apps across different categories.
- Rating Distribution: Histograms to visualize the spread of ratings.
- Installs vs. Reviews: Scatter plot to explore the correlation between reviews and installs.
- Paid vs. Free Apps: Box plots comparing paid and free apps in terms of reviews, installs, and ratings.
- Content Rating: Pie chart showing the proportion of apps targeting different age groups.
From this EDA, we conclude that:
- Most users prefer free apps, but paid apps generally have higher user satisfaction (measured by ratings).
- Categories like "Games" and "Family" are dominant in terms of app numbers, but niche categories also have high ratings.
- App size can impact the number of installs, with medium-sized apps generally performing better.
- Frequent updates and newer versions correlate positively with app ratings and user engagement.
The following tools and libraries were used for data analysis and visualization:
- Python 🐍
- Pandas: For data manipulation.
- NumPy: For numerical operations.
- Matplotlib & Seaborn: For visualizations.
- Jupyter Notebook: For organizing the analysis and presenting results.