Dummy Data for Data Science Classes

User Knowledge Modeling Dataset

1-user_knowledge.csv: This dataset has 403 rows and 6 columns. This dataset relates to learning efforts and knowledge levels of participants in a corporate training course.

#	Name	Refined Definition	Data Type
1	STG	Study Time on Goal Material – How much time a participant spends studying the test material.	Quantitative
2	SCG	Study Count on Goal Material – How often a participant reviews or repeats the test material.	Quantitative
3	STR	Study Time on Related Material – How much time a participant spends learning content related to, but not part of, the test material.	Quantitative
4	LPR	Performance on Related Practice – How well a participant performs on quizzes related to the subject but not on the main goal content.	Quantitative
5	PEG	Performance on Exam Goals – How well a participant scores on questions that assess the main learning objectives.	Quantitative
6	UNS	User Knowledge Level – A label that reflects the participant's overall understanding.	Qualitative

Car Evaluation Dataset

2-car_acceptance.csv: The data set has 1728 rows and 7 columns in which car attributes such as price and technology are described across 6 attributes such as Buying Price, Maintenance, and Safety etc.

#	Name	Definition	Data Type
1	buying	Buying price of the car (thousand USD)	Quantitative
2	maint	Price of the maintenance of car (thousand USD)	Quantitative
3	doors	Number of doors (2, 3, 4, 5-more)	Qualitative
4	persons	Capacity in terms of persons to carry (2, 4, more)	Qualitative
5	lug_boot	Indicator whether the car has a large luggage boot (TRUE, FALSE)	Qualitative
6	safety	Estimated safety of the car (low, med, high)	Qualitative
7	class	Car acceptability (0: unacceptable, 10: very good)	Quantitative

Online News Popularity Dataset

3-online_news_popularity.csv: This dataset has 39644 rows and 61 columns. This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years.

#	Name	Definition	Data Type
1	URL	URL of the Article	Qualitative
2	Timedelta	Days Between the Article Publication and the Dataset Acquisition	Quantitative
3	N_Tokens_Title	Number of Words in the Title	Quantitative
4	N_Tokens_Content	Number of Words in the Content	Quantitative
5	N_Unique_Tokens	Rate of Unique Words in the Content	Quantitative
6	N_Non_Stop_Words	Rate of Non-Stop Words in the Content	Quantitative
7	N_Non_Stop_Unique_Tokens	Rate of Unique Non-Stop Words in the Content	Quantitative
8	Num_Hrefs	Number of Links	Quantitative
9	Num_Self_Hrefs	Number of Links to Other Articles Published by Mashable	Quantitative
10	Num_Imgs	Number of Images	Quantitative
11	Num_Videos	Number of Videos	Quantitative
12	Average_Token_Length	Average Length of the Words in the Content	Quantitative
13	Num_Keywords	Number of Keywords in the Metadata	Quantitative
14	Data_Channel_Is_Lifestyle	Is Data Channel 'Lifestyle'?	Quantitative
15	Data_Channel_Is_Entertainment	Is Data Channel 'Entertainment'?	Quantitative
16	Data_Channel_Is_Bus	Is Data Channel 'Business'?	Quantitative
17	Data_Channel_Is_Socmed	Is Data Channel 'Social Media'?	Quantitative
18	Data_Channel_Is_Tech	Is Data Channel 'Tech'?	Quantitative
19	Data_Channel_Is_World	Is Data Channel 'World'?	Quantitative
20	Kw_Min_Min	Worst Keyword (Min. Shares)	Quantitative
21	Kw_Max_Min	Worst Keyword (Max. Shares)	Quantitative
22	Kw_Avg_Min	Worst Keyword (Avg. Shares)	Quantitative
23	Kw_Min_Max	Best Keyword (Min. Shares)	Quantitative
24	Kw_Max_Max	Best Keyword (Max. Shares)	Quantitative
25	Kw_Avg_Max	Best Keyword (Avg. Shares)	Quantitative
26	Kw_Min_Avg	Avg. Keyword (Min. Shares)	Quantitative
27	Kw_Max_Avg	Avg. Keyword (Max. Shares)	Quantitative
28	Kw_Avg_Avg	Avg. Keyword (Avg. Shares)	Quantitative
29	Self_Reference_Min_Shares	Min. Shares of Referenced Articles in Mashable	Quantitative
30	Self_Reference_Max_Shares	Max. Shares of Referenced Articles in Mashable	Quantitative
31	Self_Reference_Avg_Sharess	Avg. Shares of Referenced Articles in Mashable	Quantitative
32	Weekday_Is_Monday	Was the Article Published on a Monday?	Quantitative
33	Weekday_Is_Tuesday	Was the Article Published on a Tuesday?	Quantitative
34	Weekday_Is_Wednesday	Was the Article Published on a Wednesday?	Quantitative
35	Weekday_Is_Thursday	Was the Article Published on a Thursday?	Quantitative
36	Weekday_Is_Friday	Was the Article Published on a Friday?	Quantitative
37	Weekday_Is_Saturday	Was the Article Published on a Saturday?	Quantitative
38	Weekday_Is_Sunday	Was the Article Published on a Sunday?	Quantitative
39	Is_Weekend	Was the Article Published on the Weekend?	Quantitative
40	Lda_00	Closeness to Lda Topic 0	Quantitative
41	Lda_01	Closeness to Lda Topic 1	Quantitative
42	Lda_02	Closeness to Lda Topic 2	Quantitative
43	Lda_03	Closeness to Lda Topic 3	Quantitative
44	Lda_04	Closeness to Lda Topic 4	Quantitative
45	Global_Subjectivity	Text Subjectivity	Quantitative
46	Global_Sentiment_Polarity	Text Sentiment Polarity	Quantitative
47	Global_Rate_Positive_Words	Rate of Positive Words in the Content	Quantitative
48	Global_Rate_Negative_Words	Rate of Negative Words in the Content	Quantitative
49	Rate_Positive_Words	Rate of Positive Words Among Non-Neutral Tokens	Quantitative
50	Rate_Negative_Words	Rate of Negative Words Among Non-Neutral Tokens	Quantitative
51	Avg_Positive_Polarity	Avg. Polarity of Positive Words	Quantitative
52	Min_Positive_Polarity	Min. Polarity of Positive Words	Quantitative
53	Max_Positive_Polarity	Max. Polarity of Positive Words	Quantitative
54	Avg_Negative_Polarity	Avg. Polarity of Negative Words	Quantitative
55	Min_Negative_Polarity	Min. Polarity of Negative Words	Quantitative
56	Max_Negative_Polarity	Max. Polarity of Negative Words	Quantitative
57	Title_Subjectivity	Title Subjectivity	Quantitative
58	Title_Sentiment_Polarity	Title Polarity	Quantitative
59	Abs_Title_Subjectivity	Absolute Subjectivity Level	Quantitative
60	Abs_Title_Sentiment_Polarity	Absolute Polarity Level	Quantitative
61	Shares	Number of Shares	Quantitative
62	High_Shares	Dummy indicator whether the article has been highly shared	Qualitative

Repurchase Likelihood Dataset

4-repurchase_likelihood.csv: This dataset has 891 rows and 10 columns. This dataset builds on the Titanic Survival dataset and was renamed to summarize a set of factors that might predict the likelihood of a repeated purchase in from a retail store chain. The data is prepared as if it were taken from a loyalty card programme.

#	Name	Definition	Data Type
1	customer_id	Customer ID	Qualitative
2	repurchase	Indicator whether a repurchase occured (0: No, 1: Yes)	Qualitative
3	customer_tier	Loyalty tier to which a customer was assigned (1: Frequent, 2: Repated, 3: Seldom)	Qualitative
4	customer_name	Name of the customer	Qualitative
5	customer_sex	Sex of the customer (female, male)	Qualitative
6	customer_age	Age of the customer (parents sometimes created loyalty cards for their children)	Quantitative
7	customer_siblings	Number of siblings that also have a loyalty card	Quantitative
8	customer_parents	Number of parents/grandparents that also have a loyalty card	Quantitative
9	recent_purchase	Amount spent on the most recent purchase	Quantitative
10	customer_location	Preferred location of the customer (C: Cherbourg, Q: Queenstown, S: Southampton)	Qualitative

Subsidiary Income Dataset

5-subsidiary_income.csv: This dataset has 10,159 rows and 14 columns. The data contains income data for foreign subsidiaries in multiple countries and years. It contains anonymized numeric and categorical features.

#	Name	Definition	Data Type
1	Year	Year of the record	Integer
2	CountryCode	ISO code representing the country	String
3	DomesticId	Unique identifier for domestic entities	Integer
4	ForeignId	Unique identifier for foreign entities	Integer
5	Income	Income value, possibly in local currency	Float
6	NumFeature1	Numeric feature 1	Float
7	NumFeature2	Numeric feature 2	Float
8	NumFeature3	Numeric feature 3	Float
9	NumFeature4	Numeric feature 4	Float
10	NumFeature5	Numeric feature 5	Float
11	NumFeature6	Numeric feature 6	Float
12	FactorFeature1	Categorical or factor feature 1	String
13	FactorFeature2	Categorical or factor feature 2	String
14	FactorFeature3	Categorical or factor feature 3	String

Acknowledgements

Datasets 1-4 were sourced from Data Science Dojo. Dataset 5 is synthetic data based on information from the Austrian National Bank.

User Knowledge Data

This data set has been originally sourced from the Machine Learning Repository of the University of California, Irvine User Knowledge Modeling Data Set (UC Irvine). The UCI page mentions the following publication as the original source of the data set: H. T. Kahraman, Sagiroglu, S., Colak, I., Developing intuitive knowledge classifier and modeling of users' domain dependent data in web, Knowledge Based Systems, vol. 37, pp. 283-295, 2013.

Car Acceptance Data

This data set has been sourced from the Machine Learning Repository of University of California, Irvine Car Evaluation Data Set (UC Irvine). The UCI page mentions following as the donors of the dataset: Marko Bohanec (marko.bohanec '@' ijs.si) and Blaz Zupan (blaz.zupan '@' ijs.si).

Online News Popularity

This data set has been sourced from the Machine Learning Repository of University of California, Irvine Online News Popularity Data Set (UC Irvine). The UCI page mentions the following publication as the original source of the data set: K. Fernandes, P. Vinagre and P. Cortez. a Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dummy Data for Data Science Classes

User Knowledge Modeling Dataset

Car Evaluation Dataset

Online News Popularity Dataset

Repurchase Likelihood Dataset

Subsidiary Income Dataset

Acknowledgements

User Knowledge Data

Car Acceptance Data

Online News Popularity

About

Uh oh!

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
1-user_knowledge.csv		1-user_knowledge.csv
2-car_acceptance.csv		2-car_acceptance.csv
3-online_news_popularity.csv		3-online_news_popularity.csv
4-repurchase_likelihood.csv		4-repurchase_likelihood.csv
5-subsidiary_income.csv		5-subsidiary_income.csv
LICENSE		LICENSE
README.md		README.md

License

ha-pu/data_files

Folders and files

Latest commit

History

Repository files navigation

Dummy Data for Data Science Classes

User Knowledge Modeling Dataset

Car Evaluation Dataset

Online News Popularity Dataset

Repurchase Likelihood Dataset

Subsidiary Income Dataset

Acknowledgements

User Knowledge Data

Car Acceptance Data

Online News Popularity

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages