I have always loved reading, so I thought may not use R and data science to help me decided which book I should read next! After poking around on Kaggle, I have found two datasets containing a myraid of information on many titles. The first dataset by Soumik originated from him scraping the GoodReads API using ISBNs and contains the variables ‘bookID’ (a unique integer associated with each book), ‘title’, ‘authors’, ‘average rating’ (0-5 scale), ‘isbn’, ‘isbn13’, ‘langauge code’ (eng or other languages), ‘num_pages’, ‘ratings_count’ (total number of ratings), ‘text_reviews_count’ (total number of written text reviews). Out of these variables, ‘ratings_count’ and ‘average_rating’s is highly associated, since the latter is simply the mean of the total ratings score divided by the former. However, we wish to compare average ratings so that book suggestions will not be biased by the number of ratings. Furthermore, I would expect ’ratings_count’ and ‘text_review_count’ to be highly correlated, since if a book has many ratings, which means that it must be somewhat popular (or notorious), so the said book probably also has many written critic ratings.
The second dataset authored by Dylan Castillo was originally intended to be an extension of Soumik’s dataset, but this time using the Google Books API (Crawler) by ISBNs. This dataset contains the variables ‘isbn13’, ‘isbn10’, ‘title’, ‘subtitle’, ‘authors’, ‘categories’ (genre information), ‘thumbnail’ (urls of book cover), ‘description’ (short summary), ‘published_year’, and ‘average_rating’ (0-5 again). As one expects, this dataset should largely overlap with the first chosen dataset, thus when we join we can afford an inner join by ISBN (11 and 13) as both datasets contain essentially the same information for different titles. For this dataset, I would expect an association between ‘ratings_count’ and ‘average_ratings’, where the latter again is total ratings normalized by the former. However, I also speculate that the more ratings a particular book has, then the higher its ratings will be, given that people cared enough to bestow a rating and can flush out extreme reviews.
Finally, the dataset can be found at:
https://www.kaggle.com/jealousleopard/goodreadsbooks
https://www.kaggle.com/dylanjcastillo/7k-books-with-metadata
library(tidyverse)
select <- dplyr::select # resolve conflict with MASS
# read in datasets
books <- read.csv("books.csv")
books2 <- read.csv("books2.csv")
# first look at dataset
books %>% glimpse()
## Rows: 11,131
## Columns: 12
## $ bookID <chr> "1", "2", "4", "5", "8", "9", "10", "12", "13", "14~
## $ title <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors <chr> "J.K. Rowling/Mary GrandPré", "J.K. Rowling/Mary G~
## $ average_rating <chr> "4.57", "4.49", "4.42", "4.56", "4.78", "3.74", "4.~
## $ isbn <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13 <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ language_code <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ num_pages <chr> "652", "870", "352", "435", "2690", "152", "3342", ~
## $ ratings_count <int> 2095690, 2153167, 6333, 2339585, 41428, 19, 28242, ~
## $ text_reviews_count <int> 27591, 29221, 244, 36325, 164, 1, 808, 254, 4080, 4~
## $ publication_date <chr> "9/16/2006", "9/1/2004", "11/1/2003", "5/1/2004", "~
## $ publisher <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~
books2 %>% glimpse()
## Rows: 6,810
## Columns: 12
## $ isbn13 <dbl> 9.780002e+12, 9.780002e+12, 9.780006e+12, 9.780006e+12,~
## $ isbn10 <chr> "0002005883", "0002261987", "0006163831", "0006178731",~
## $ title <chr> "Gilead", "Spider's Web", "The One Tree", "Rage of ange~
## $ subtitle <chr> "", "A Novel", "", "", "", "", "", "A History of the In~
## $ authors <chr> "Marilynne Robinson", "Charles Osborne;Agatha Christie"~
## $ categories <chr> "Fiction", "Detective and mystery stories", "American f~
## $ thumbnail <chr> "http://books.google.com/books/content?id=KQZCPgAACAAJ&~
## $ description <chr> "A NOVEL THAT READERS and critics have been eagerly ant~
## $ published_year <int> 2004, 2000, 1982, 1993, 2002, 2002, 1977, 1998, 1994, 1~
## $ average_rating <dbl> 3.85, 3.83, 3.97, 3.93, 4.15, 4.09, 4.27, 4.41, 4.15, 4~
## $ num_pages <int> 247, 241, 479, 512, 170, 176, 560, 608, 743, 489, 501, ~
## $ ratings_count <int> 361, 5164, 172, 29532, 33684, 37569, 3975, 65, 103, 435~
The datasets appears tidy and does not require pivot functions, which will be used to clean up summary statistics later. However, after some more in-depth visual inspection, I have noticed some trends. In particular, some observations (books) seems to have information keyed-in wrong, some field order got scrambled (leaving NAs in one variable and obviously false information in another), some information/columns share the same information, some variables are not the correct type. We fix a variety of small issues below; when all is done, we have lost 5 observations in the first dataset.
# fix some obvious data issues: Fields with NAs are keyed
# wrong, for simplicity let's drop them we also don't need
# the bookID field
books <- books %>% na.omit() %>% select(-bookID)
# in books1, average_rating, num_pages should be dbl and int
# respectively
books <- books %>% mutate(average_rating = as.numeric(average_rating),
num_pages = as.numeric(num_pages))
# Let's drop the authors and leave only the primary (ie. get
# rid of all authors after the first '/')
books <- books %>% separate(authors, into = "authors", sep = "/")
# we need to make isbn13 in books2 a character vector
books2 <- books2 %>% mutate(isbn13 = as.character(isbn13))
# we should drop the shared description columns in books2: no
# good way to combine
books2 <- books2 %>% select(-c("title", "authors", "published_year"))
# dataset seems tidely now, but let's make sure let's
# investigate if each book (observation) is uniquely
# identified by ISBN10 and ISBN13 in each dataset
books %>% select(isbn, isbn13) %>% summarize_all(n_distinct)
## isbn isbn13
## 1 11126 11127
books2 %>% select(isbn10, isbn13) %>% summarize_all(n_distinct)
## isbn10 isbn13
## 1 6810 6810
# we note that books2 is tidy, where all books have an unique
# isbn10 and isbn13 However, there seems to be a duplicate in
# books1? we simply excluded that one
books <- books %>% filter(!duplicated(isbn))
After contemplation, I have decided to perform an inner join. For one, the second dataset is an extension of the first, which means that we should obtain a fairly large amount of shared observations (5686 total). This way, we also don’t have to worry about managing possible NAs in our merged dataset. The problem comes when deciding the which columns to merge on, again given that the two datasets shares a decent amount of similar variables (ie. authors, isbn, title, average_rating, num_pages, ratings_count). In fact, values in these shared columns aren’t exactly the same, and so I have selected only the isbn columns (isbn=isbn10, isbn13) to merge on since that’s the universal convention to uniquely identify a book.
# we can now join our dataset!
merged_books <- books %>% inner_join(books2, convert = T, by = c(isbn = "isbn10",
isbn13 = "isbn13"))
merged_books %>% glimpse()
## Rows: 5,686
## Columns: 18
## $ title <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors <chr> "J.K. Rowling", "J.K. Rowling", "J.K. Rowling", "J.~
## $ average_rating.x <dbl> 4.57, 4.49, 4.42, 4.56, 4.78, 3.74, 4.73, 4.38, 4.2~
## $ isbn <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13 <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ language_code <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ num_pages.x <dbl> 652, 870, 352, 435, 2690, 152, 3342, 815, 215, 815,~
## $ ratings_count.x <int> 2095690, 2153167, 6333, 2339585, 41428, 19, 28242, ~
## $ text_reviews_count <int> 27591, 29221, 244, 36325, 164, 1, 808, 4080, 460, 1~
## $ publication_date <chr> "9/16/2006", "9/1/2004", "11/1/2003", "5/1/2004", "~
## $ publisher <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~
## $ subtitle <chr> "", "", "", "", "5 Years of Magic, Adventure, and M~
## $ categories <chr> "Juvenile Fiction", "Juvenile Fiction", "Juvenile F~
## $ thumbnail <chr> "http://books.google.com/books/content?id=QzI0BgAAQ~
## $ description <chr> "When Harry Potter and the Half-Blood Prince opens,~
## $ average_rating.y <dbl> 4.56, 4.49, 4.41, 4.55, 4.78, 3.69, 4.73, 4.38, 4.2~
## $ num_pages.y <int> 652, 870, 352, 435, 2690, 152, 3342, 815, 215, 815,~
## $ ratings_count.y <int> 1944099, 1996446, 6267, 2149872, 38872, 18, 27410, ~
After joining, we lost about 5440 (about half) observations. This could mean that we have lost some levels of our categorical variables (ie. we no longer have non-english books), but it turns out it’s not a big problem given that our categorical data has many levels.
To simplify information, I have decided that for the shared character variables (ie. title, author, published_year which can be derived from date), to simply keep the first dataset’s information and drop the same columns in the second dataset, since they describe the same information and there’s really no good way to combine the information together somehow. For the shared numerical variables (ie. average_rating, num_pages, ratings_count), I have decided to take an average of them after joining, and then deleting both the original columns.
# we now create a new column as the ceiled average of the two
# .x and .y columns besides average_rating, which should be
# continuous anyways
merged_books <- merged_books %>% mutate(average_rating = (average_rating.x +
average_rating.y)/2, num_pages = ceiling((num_pages.x + num_pages.y)/2),
ratings_count = ceiling((ratings_count.x + ratings_count.y)/2))
# we can now drop the .x and .y columns all together
merged_books <- merged_books %>% select(-(contains(".")))
# finally, move some columns around for easy viewing!
merged_books <- merged_books %>% relocate(title, authors, average_rating,
ratings_count, num_pages, language_code, categories, isbn,
isbn13)
merged_books %>% glimpse()
## Rows: 5,686
## Columns: 15
## $ title <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors <chr> "J.K. Rowling", "J.K. Rowling", "J.K. Rowling", "J.~
## $ average_rating <dbl> 4.565, 4.490, 4.415, 4.555, 4.780, 3.715, 4.730, 4.~
## $ ratings_count <dbl> 2019895, 2074807, 6300, 2244729, 40150, 19, 27826, ~
## $ num_pages <dbl> 652, 870, 352, 435, 2690, 152, 3342, 815, 215, 815,~
## $ language_code <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ categories <chr> "Juvenile Fiction", "Juvenile Fiction", "Juvenile F~
## $ isbn <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13 <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ text_reviews_count <int> 27591, 29221, 244, 36325, 164, 1, 808, 4080, 460, 1~
## $ publication_date <chr> "9/16/2006", "9/1/2004", "11/1/2003", "5/1/2004", "~
## $ publisher <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~
## $ subtitle <chr> "", "", "", "", "5 Years of Magic, Adventure, and M~
## $ thumbnail <chr> "http://books.google.com/books/content?id=QzI0BgAAQ~
## $ description <chr> "When Harry Potter and the Half-Blood Prince opens,~
library(gt) # tables
# time for some exploration! We can first separate
# publication date into month, day, year new columns
merged_books <- merged_books %>% separate(publication_date, into = c("month",
"day", "year"))
merged_books %>% glimpse()
## Rows: 5,686
## Columns: 17
## $ title <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors <chr> "J.K. Rowling", "J.K. Rowling", "J.K. Rowling", "J.~
## $ average_rating <dbl> 4.565, 4.490, 4.415, 4.555, 4.780, 3.715, 4.730, 4.~
## $ ratings_count <dbl> 2019895, 2074807, 6300, 2244729, 40150, 19, 27826, ~
## $ num_pages <dbl> 652, 870, 352, 435, 2690, 152, 3342, 815, 215, 815,~
## $ language_code <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ categories <chr> "Juvenile Fiction", "Juvenile Fiction", "Juvenile F~
## $ isbn <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13 <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ text_reviews_count <int> 27591, 29221, 244, 36325, 164, 1, 808, 4080, 460, 1~
## $ month <chr> "9", "9", "11", "5", "9", "4", "9", "4", "8", "1", ~
## $ day <chr> "16", "1", "1", "1", "13", "26", "12", "30", "3", "~
## $ year <chr> "2006", "2004", "2003", "2004", "2004", "2005", "20~
## $ publisher <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~
## $ subtitle <chr> "", "", "", "", "5 Years of Magic, Adventure, and M~
## $ thumbnail <chr> "http://books.google.com/books/content?id=QzI0BgAAQ~
## $ description <chr> "When Harry Potter and the Half-Blood Prince opens,~
# which book has the highest average_rating with more than
# 100 ratings? Calvin and Hobbes!
merged_books %>% filter(ratings_count >= 100) %>% arrange(desc(average_rating)) %>%
select(title) %>% slice(1) %>% .$title
## [1] "The Complete Calvin and Hobbes"
# which book from J.K. Rowling has the highest rating? A set
# of Harry Potter series!
merged_books %>% filter(str_detect(authors, "J.K. Rowling")) %>%
arrange(desc(average_rating)) %>% slice(1) %>% .$title
## [1] "Harry Potter Boxed Set Books 1-5 (Harry Potter #1-5)"
The Complete Calvin and Hobbes seems to be the best rated book with more than 100 ratings (a childhood classics!). The highest rated title from J.K. Rowling is the Harry Potter series, no surprises here!
# show the correlation matrix between numerical variables
# Interpretation can be found under the visualization section
cor_matrix <- merged_books %>% select(is.numeric) %>% cor() %>%
round(3)
colnames(cor_matrix) %>% cbind(cor_matrix) %>% as.data.frame %>%
gt()
. | average_rating | ratings_count | num_pages | text_reviews_count |
---|---|---|---|---|
average_rating | 1 | 0.043 | 0.18 | 0.037 |
ratings_count | 0.043 | 1 | 0.041 | 0.876 |
num_pages | 0.18 | 0.041 | 1 | 0.04 |
text_reviews_count | 0.037 | 0.876 | 0.04 | 1 |
# show some number of distinct for non-numerical variables
merged_books %>% select(is.character) %>% dplyr::summarise_all(.funs = list(.distinct = n_distinct)) %>%
pivot_longer(everything(), names_to = "variables", values_to = "distinct_count") %>%
separate(variables, into = c("variable"), sep = "_\\.") %>%
gt()
variable | distinct_count |
---|---|
title | 5528 |
authors | 2843 |
language_code | 5 |
categories | 406 |
isbn | 5686 |
isbn13 | 5686 |
month | 12 |
day | 31 |
year | 77 |
publisher | 1373 |
subtitle | 1767 |
thumbnail | 5474 |
description | 5478 |
Expectedly, we have 12 unique values for month (1-12) and day (1-31). However, we see that even though each book has an unique isbn and isbn13, there are some duplicated titles (most likely different versions/reprints). We have only 5 language codes, which indicates that many were lost when we admitted all NA entries. The thumbnail and description are largely unique, while there’s also a surprising amount of unique publishers and categories.
# calculate summary stats, no groups (mean, sd, var,
# quantile, min, max)
summary_stat <- merged_books %>% select(is.numeric) %>% summarize_all(.funs = list(.mean = mean,
.sd = sd, .var = var, .median = median, .min = min, .max = max),
na.rm = T)
# we should now clean up summary_stat using pivot functions:
# we first stuff all stats into a column with their values we
# then separate out the statistics from their variable into
# their own column finally, we make a separate column per
# stat function for table viewing
summary_stat %>% round(3) %>% pivot_longer(everything(), names_to = "variables",
values_to = "values") %>% separate(variables, into = c("variable",
"stat"), sep = "_\\.") %>% pivot_wider(names_from = "stat",
values_from = "values") %>% gt()
variable | mean | sd | var | median | min | max |
---|---|---|---|---|---|---|
average_rating | 3.935 | 0.329 | 1.080000e-01 | 3.96 | 0 | 5 |
ratings_count | 20125.570 | 128461.947 | 1.650247e+10 | 1031.50 | 0 | 4482504 |
num_pages | 343.568 | 230.518 | 5.313871e+04 | 304.00 | 0 | 3342 |
text_reviews_count | 617.817 | 2889.346 | 8.348322e+06 | 60.00 | 0 | 94265 |
On average, our merged dataset has a relative high average_rating at ~4, and given that sd = 0.329, that means 95% of our titles has an average_rating [3.4, 4.6], or “above average” rating. The average_rating median is very similar to the mean, which indicates that there’s no extreme outliers pulling the mean (given the restricted range 0-5). The min and max logically follows as 0 and 5. The other metrics in contrast has relatively a much larger spread and often conflicting mean and median values, indicative of outliers (ie. very long/popular books). Thus, we cannot make generalizations as easily.
# calculate summary stats, group by language_code
summary_stat_lang <- merged_books %>% group_by(language_code) %>%
select(is.numeric) %>% summarize_all(.funs = list(.mean = mean,
.sd = sd, .var = var, .median = median, .min = min, .max = max))
# we should now clean up summary_stat_lang
summary_stat_lang %>% pivot_longer(is.numeric, names_to = "variables",
values_to = "values") %>% separate(variables, into = c("variable",
"stat"), sep = "_\\.") %>% pivot_wider(names_from = "stat",
values_from = "values") %>% mutate_if(is.numeric, round,
3) %>% gt()
language_code | variable | mean | sd | var | median | min | max |
---|---|---|---|---|---|---|---|
en-CA | average_rating | 4.091 | 0.120 | 1.400000e-02 | 4.105 | 3.945 | 4.210 |
en-CA | ratings_count | 4119.250 | 1434.519 | 2.057846e+06 | 4661.500 | 2043.000 | 5111.000 |
en-CA | num_pages | 223.750 | 43.531 | 1.894917e+03 | 239.500 | 160.000 | 256.000 |
en-CA | text_reviews_count | 298.750 | 163.732 | 2.680825e+04 | 267.000 | 155.000 | 506.000 |
en-GB | average_rating | 3.915 | 0.299 | 9.000000e-02 | 3.940 | 2.670 | 4.535 |
en-GB | ratings_count | 2838.202 | 11426.034 | 1.305543e+08 | 203.500 | 3.000 | 100729.000 |
en-GB | num_pages | 323.070 | 211.776 | 4.484897e+04 | 284.500 | 32.000 | 1262.000 |
en-GB | text_reviews_count | 121.746 | 396.289 | 1.570449e+05 | 15.000 | 0.000 | 3838.000 |
en-US | average_rating | 3.919 | 0.281 | 7.900000e-02 | 3.950 | 2.670 | 4.880 |
en-US | ratings_count | 3997.590 | 12032.907 | 1.447909e+08 | 618.000 | 2.000 | 189065.000 |
en-US | num_pages | 336.759 | 225.169 | 5.070120e+04 | 307.000 | 24.000 | 2480.000 |
en-US | text_reviews_count | 174.386 | 475.223 | 2.258366e+05 | 38.000 | 0.000 | 6599.000 |
eng | average_rating | 3.938 | 0.337 | 1.140000e-01 | 3.960 | 0.000 | 5.000 |
eng | ratings_count | 23255.991 | 139957.894 | 1.958821e+10 | 1183.000 | 0.000 | 4482504.000 |
eng | num_pages | 345.233 | 231.921 | 5.378729e+04 | 304.000 | 0.000 | 3342.000 |
eng | text_reviews_count | 704.307 | 3141.016 | 9.865984e+06 | 67.000 | 0.000 | 94265.000 |
mul | average_rating | 4.235 | 0.332 | 1.100000e-01 | 4.235 | 4.000 | 4.470 |
mul | ratings_count | 29.000 | 1.414 | 2.000000e+00 | 29.000 | 28.000 | 30.000 |
mul | num_pages | 496.000 | 45.255 | 2.048000e+03 | 496.000 | 464.000 | 528.000 |
mul | text_reviews_count | 3.000 | 1.414 | 2.000000e+00 | 3.000 | 2.000 | 4.000 |
When categorized by language codes, our metric trends follows from both. However, we note that the multilingual books in particular has a higher average_rating and much smaller ratings_count/text_reviews_count compared to the other english books. This is probably first due to its smaller sample size compared to other english books as well as the fact that it requires substantial knowledge to read and review books in a different language, but they are often of high quality given the substantial translation work behind.
# calculate summary stats, group by authors and language
# codes notice there are too many authors, how about we just
# choose 3 of my favorites authors!
summary_stat_auth_lang <- merged_books %>% filter(authors %in%
c("Stephen King", "J.K. Rowling", "Ovid")) %>% group_by(authors,
language_code) %>% select(is.numeric) %>% summarize_all(.funs = list(.mean = mean,
.sd = sd, .var = var, .median = median, .min = min, .max = max))
# we should now clean up summary_stat_auth_lang
summary_stat_auth_lang %>% pivot_longer(is.numeric, names_to = "variables",
values_to = "values") %>% separate(variables, into = c("variable",
"stat"), sep = "_\\.") %>% pivot_wider(names_from = "stat",
values_from = "values") %>% mutate_if(is.numeric, round,
3) %>% gt()
language_code | variable | mean | sd | var | median | min | max |
---|---|---|---|---|---|---|---|
J.K. Rowling | |||||||
eng | average_rating | 4.546 | 0.128 | 1.600000e-02 | 4.555 | 4.400 | 4.780 |
eng | ratings_count | 865151.200 | 1095578.454 | 1.200292e+12 | 33988.000 | 3129.000 | 2244729.000 |
eng | num_pages | 1003.800 | 1086.934 | 1.181426e+06 | 558.000 | 240.000 | 3342.000 |
eng | text_reviews_count | 13023.000 | 16478.216 | 2.715316e+08 | 857.000 | 139.000 | 36325.000 |
Ovid | |||||||
eng | average_rating | 4.222 | 0.194 | 3.800000e-02 | 4.190 | 4.020 | 4.545 |
eng | ratings_count | 6213.250 | 16639.188 | 2.768626e+08 | 323.500 | 32.000 | 47387.000 |
eng | num_pages | 534.000 | 97.860 | 9.576571e+03 | 529.000 | 400.000 | 723.000 |
eng | text_reviews_count | 110.500 | 264.793 | 7.011543e+04 | 14.000 | 3.000 | 764.000 |
Stephen King | |||||||
en-US | average_rating | 3.947 | 0.043 | 2.000000e-03 | 3.930 | 3.915 | 3.995 |
en-US | ratings_count | 1470.000 | 1332.980 | 1.776837e+06 | 1282.000 | 241.000 | 2887.000 |
en-US | num_pages | 360.667 | 30.089 | 9.053330e+02 | 376.000 | 326.000 | 380.000 |
en-US | text_reviews_count | 127.667 | 140.678 | 1.979033e+04 | 84.000 | 14.000 | 285.000 |
eng | average_rating | 4.081 | 0.266 | 7.100000e-02 | 4.110 | 3.585 | 4.535 |
eng | ratings_count | 83279.095 | 204672.983 | 4.189103e+10 | 12729.000 | 281.000 | 944903.000 |
eng | num_pages | 458.238 | 231.374 | 5.353379e+04 | 463.000 | 66.000 | 1096.000 |
eng | text_reviews_count | 1609.286 | 3374.593 | 1.138788e+07 | 227.000 | 7.000 | 15105.000 |
General trend again follows: average_rating has a much smaller spread compared to other variables. The table seems short since selected authors only publish in particular language. Nevertheless, we can conclude that J.K. Rowling’s titles are on average rated higher than that of Stephen King’s and Ovid’s, though this may be confounded by the fact that her titles have much more ratings compared to titles of other authors.
cor_matrix %>% as.data.frame() %>% rownames_to_column("var1") %>%
pivot_longer(-1, "var2", values_to = "correlation") %>% ggplot(aes(var1,
var2, fill = correlation)) + geom_tile() + scale_fill_gradient2(low = "red",
mid = "white", high = "blue") + geom_text(aes(label = round(correlation,
2)), color = "black", size = 4) + theme(axis.text.x = element_text(angle = 90,
hjust = 1)) + coord_fixed() + xlab("") + ylab("")
The correlation heatmap reveals that from our numerical variables, only ratings_count and text_reviews_count is significantly correlated, which makes sense logically since one would expect a partitular title to be more critically acclaimed the more people have read and rated that particular title. On the other hand, other pairs of variables are weakly correlated, which is somewhat surprising since one would expect more critically acclaimed titles to be particularly high or low in quality/rating.
library(ggExtra)
p1 <- ggplot(merged_books) + geom_point(aes(x = num_pages, y = average_rating,
color = language_code, shape = 1)) + scale_shape_identity() +
scale_color_brewer(palette = "Set1", label = c("eng-Canada",
"eng-Great Britain", "eng-US", "eng", "Multilingual")) +
scale_y_continuous(breaks = seq(0, 5, 0.2)) + scale_x_continuous(breaks = seq(0,
3500, 500)) + labs(color = "Language Codes", x = "Total Pages",
y = "Average Rating", title = "Book Average Rating vs Total Pages") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
legend.position = c(0.8, 0.35))
ggMarginal(p1, type = "densigram")
The marginal distribution of novel average ratings is roughly normal (symmetric) with mean around rating 4.0, which implies that on average the books in this merged dataset have a relatively high rating (they are good books!). The marginal distribution of total number of pages of the novels is right skewed with mean around 300 pages, which makes sense given that books (except textbooks, especially novels) are typically short and do not go over 1000 pages. This scatterplot does not indicate an apparent relationship between average_rating and total pages, which also makes sense since quantity does not imply quality. When colorized by language codes, we see that almost all books are in English (especially the longer ones, since the effort to translate said books would be monumental) with each sub-category consisting of largely variant book ratings with no significant patterns.
months <- c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "Novemember", "December")
# Let's filter down to just a few categories for clearer
# visualization
merged_books %>% filter(categories %in% c("Fiction", "Science",
"Business & Economics", "Computers", "Religion")) %>% mutate(months = factor(months[as.numeric(month)],
levels = months)) %>% ggplot(aes(x = as.numeric(day), y = average_rating,
fill = categories)) + scale_fill_brewer(palette = "Dark2") +
geom_bar(stat = "summary", fun = mean) + facet_wrap(~months) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)) + labs(title = "Book Ratings per Day of each Month",
x = "Day", y = "Mean Average rating", fill = "Genre") + scale_x_continuous(breaks = c(1,
seq(5, 31, 5)))
Based on the categories of books we have selected, it appears that authors like to release these genre books at the first day of each month (most likely to maximize the first monthly sales possible), given that we have the most number of stacked bars at usually day 1 of each month except August (which happens to be around the time of school starting in the fall). For our dataset, the purple colored bars appears most often, meaning that fiction is the most popular category of novels given that it’s published the most. Across each month and day, the mean average_rating of each genre appears roughly similar, which makes sense which the quality of a particular novel category should have no relation with respect to the title’s publication date. The frequency of each category also remains relative constant throughout each month and day, with the exception of March which curiously have no computer publications.
library(cluster)
# we should first see how many clusters we should create
pam_dat <- merged_books %>% select(is.numeric) %>% scale %>%
as.data.frame
sil_width <- vector()
for (i in 2:10) {
pam_fit <- pam(pam_dat, k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
# visualize our results to select k
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "Number of Clusters (k)",
breaks = 1:10) + labs(title = "Average Silhouette Width per Number of Clusters",
y = "Average Silhouette Width") + theme(plot.title = element_text(hjust = 0.5))
Based on this silhouette width plot, we see that we should elect three clusters since that’s the number of clusters that results in the highest average silhouette width, which means that our 3 supposed clusters should be more cohesive, distinct, and parsimonious compared to other number of clusters.
pam <- pam_dat %>% pam(3)
# now that we ran PAM, save the cluster assignment
pam_dat <- pam_dat %>% mutate(cluster = as.factor(pam$clustering))
library(GGally)
ggpairs(pam_dat, columns = 1:4, aes(color = cluster))
For the first column, particularly the num_pages vs average_rating scatterplot, we see three distinct clusters where the red clusters seems to consist of titles with relatively large number of pages, while the blue and green clusters separate out titles with low page numbers but with low and high average_ratings respectively (at z=0, or the overall mean). The same trend is also found in other graphs, where the blue clusters mainly consist of low-rated, small books, while the red cluster consist of large books and the green cluster highly-rated, small books. The clusters seems to overlap relative to variables like text_reviews_count and reviews_count which is highly correlated with each other and thus hard to differentiate.
# Interpreting Average Silhouette Width
plot(silhouette(pam), col = 1:3, border = NA)
Our average silhouette width was 0.27, which according to the cutoffs is within 0.26-0.5, indicating that the cluster structure is weak and likely superficial.