In the first part of the project, the following actions were performed for data exploration and visualization:
-
Data Loading: The dataset consisting of 1.6 million headlines from The Times Irish news site was loaded via Kaggle.
-
Visualization:
-
Word Clouds: Word clouds were generated to visualize the most frequent words in the headlines, providing an overview of the key themes and topics.

-
Bar Plots: Bar plots were used to display the distribution of headlines across different categories or classes, giving an understanding of the class imbalance and the prevalence of each category. We also visualized the sentiment and sentiment polarity for each category:

-
Histograms: Histograms were created to analyze the headline lengths, helping identify any patterns or trends in headline lengths.

-
Count Plots: Count plots were employed to visualize the distribution of headlines based on specific criteria, such as the source or publication date, allowing for a deeper exploration of the data.
-
Pie Charts: Pie charts were used to present the proportion of headlines in each category, providing a visual representation of the class distribution:

-
-
Statistical Analysis:
- Descriptive Statistics: Descriptive statistics were calculated to summarize headline lengths, category frequencies, and other relevant metrics.
- Correlation Analysis: Correlation analysis was performed to identify any relationships or dependencies between headline features, providing insights into potential associations between variables.
-
Data Preprocessing: Preprocessing steps were undertaken to clean and prepare the data for subsequent modeling, including text normalization, handling missing values, and encoding categorical variables.
By employing these exploratory data analysis techniques, we gained a comprehensive understanding of the dataset's characteristics, distribution, and key insights. These actions set the foundation for further model development and optimization in the second part of the project.
