Introduction

I have always loved reading, so I thought may not use R and data science to help me decided which book I should read next! After poking around on Kaggle, I have found two datasets containing a myraid of information on many titles. The first dataset by Soumik originated from him scraping the GoodReads API using ISBNs and contains the variables ‘bookID’ (a unique integer associated with each book), ‘title’, ‘authors’, ‘average rating’ (0-5 scale), ‘isbn’, ‘isbn13’, ‘langauge code’ (eng or other languages), ‘num_pages’, ‘ratings_count’ (total number of ratings), ‘text_reviews_count’ (total number of written text reviews). Out of these variables, ‘ratings_count’ and ‘average_rating’s is highly associated, since the latter is simply the mean of the total ratings score divided by the former. However, we wish to compare average ratings so that book suggestions will not be biased by the number of ratings. Furthermore, I would expect ’ratings_count’ and ‘text_review_count’ to be highly correlated, since if a book has many ratings, which means that it must be somewhat popular (or notorious), so the said book probably also has many written critic ratings.

The second dataset authored by Dylan Castillo was originally intended to be an extension of Soumik’s dataset, but this time using the Google Books API (Crawler) by ISBNs. This dataset contains the variables ‘isbn13’, ‘isbn10’, ‘title’, ‘subtitle’, ‘authors’, ‘categories’ (genre information), ‘thumbnail’ (urls of book cover), ‘description’ (short summary), ‘published_year’, and ‘average_rating’ (0-5 again). As one expects, this dataset should largely overlap with the first chosen dataset, thus when we join we can afford an inner join by ISBN (11 and 13) as both datasets contain essentially the same information for different titles. For this dataset, I would expect an association between ‘ratings_count’ and ‘average_ratings’, where the latter again is total ratings normalized by the former. However, I also speculate that the more ratings a particular book has, then the higher its ratings will be, given that people cared enough to bestow a rating and can flush out extreme reviews.

Finally, the dataset can be found at:
https://www.kaggle.com/jealousleopard/goodreadsbooks
https://www.kaggle.com/dylanjcastillo/7k-books-with-metadata

Tidying

library(tidyverse)
select <- dplyr::select  # resolve conflict with MASS

# read in datasets
books <- read.csv("books.csv")
books2 <- read.csv("books2.csv")

# first look at dataset
books %>% glimpse()
## Rows: 11,131
## Columns: 12
## $ bookID             <chr> "1", "2", "4", "5", "8", "9", "10", "12", "13", "14~
## $ title              <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors            <chr> "J.K. Rowling/Mary GrandPré", "J.K. Rowling/Mary G~
## $ average_rating     <chr> "4.57", "4.49", "4.42", "4.56", "4.78", "3.74", "4.~
## $ isbn               <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13             <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ language_code      <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ num_pages          <chr> "652", "870", "352", "435", "2690", "152", "3342", ~
## $ ratings_count      <int> 2095690, 2153167, 6333, 2339585, 41428, 19, 28242, ~
## $ text_reviews_count <int> 27591, 29221, 244, 36325, 164, 1, 808, 254, 4080, 4~
## $ publication_date   <chr> "9/16/2006", "9/1/2004", "11/1/2003", "5/1/2004", "~
## $ publisher          <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~
books2 %>% glimpse()
## Rows: 6,810
## Columns: 12
## $ isbn13         <dbl> 9.780002e+12, 9.780002e+12, 9.780006e+12, 9.780006e+12,~
## $ isbn10         <chr> "0002005883", "0002261987", "0006163831", "0006178731",~
## $ title          <chr> "Gilead", "Spider's Web", "The One Tree", "Rage of ange~
## $ subtitle       <chr> "", "A Novel", "", "", "", "", "", "A History of the In~
## $ authors        <chr> "Marilynne Robinson", "Charles Osborne;Agatha Christie"~
## $ categories     <chr> "Fiction", "Detective and mystery stories", "American f~
## $ thumbnail      <chr> "http://books.google.com/books/content?id=KQZCPgAACAAJ&~
## $ description    <chr> "A NOVEL THAT READERS and critics have been eagerly ant~
## $ published_year <int> 2004, 2000, 1982, 1993, 2002, 2002, 1977, 1998, 1994, 1~
## $ average_rating <dbl> 3.85, 3.83, 3.97, 3.93, 4.15, 4.09, 4.27, 4.41, 4.15, 4~
## $ num_pages      <int> 247, 241, 479, 512, 170, 176, 560, 608, 743, 489, 501, ~
## $ ratings_count  <int> 361, 5164, 172, 29532, 33684, 37569, 3975, 65, 103, 435~

The datasets appears tidy and does not require pivot functions, which will be used to clean up summary statistics later. However, after some more in-depth visual inspection, I have noticed some trends. In particular, some observations (books) seems to have information keyed-in wrong, some field order got scrambled (leaving NAs in one variable and obviously false information in another), some information/columns share the same information, some variables are not the correct type. We fix a variety of small issues below; when all is done, we have lost 5 observations in the first dataset.

# fix some obvious data issues: Fields with NAs are keyed
# wrong, for simplicity let's drop them we also don't need
# the bookID field
books <- books %>% na.omit() %>% select(-bookID)

# in books1, average_rating, num_pages should be dbl and int
# respectively
books <- books %>% mutate(average_rating = as.numeric(average_rating), 
    num_pages = as.numeric(num_pages))
# Let's drop the authors and leave only the primary (ie. get
# rid of all authors after the first '/')
books <- books %>% separate(authors, into = "authors", sep = "/")

# we need to make isbn13 in books2 a character vector
books2 <- books2 %>% mutate(isbn13 = as.character(isbn13))
# we should drop the shared description columns in books2: no
# good way to combine
books2 <- books2 %>% select(-c("title", "authors", "published_year"))
# dataset seems tidely now, but let's make sure let's
# investigate if each book (observation) is uniquely
# identified by ISBN10 and ISBN13 in each dataset
books %>% select(isbn, isbn13) %>% summarize_all(n_distinct)
##    isbn isbn13
## 1 11126  11127
books2 %>% select(isbn10, isbn13) %>% summarize_all(n_distinct)
##   isbn10 isbn13
## 1   6810   6810
# we note that books2 is tidy, where all books have an unique
# isbn10 and isbn13 However, there seems to be a duplicate in
# books1?  we simply excluded that one
books <- books %>% filter(!duplicated(isbn))

Joining/Merging

After contemplation, I have decided to perform an inner join. For one, the second dataset is an extension of the first, which means that we should obtain a fairly large amount of shared observations (5686 total). This way, we also don’t have to worry about managing possible NAs in our merged dataset. The problem comes when deciding the which columns to merge on, again given that the two datasets shares a decent amount of similar variables (ie. authors, isbn, title, average_rating, num_pages, ratings_count). In fact, values in these shared columns aren’t exactly the same, and so I have selected only the isbn columns (isbn=isbn10, isbn13) to merge on since that’s the universal convention to uniquely identify a book.

# we can now join our dataset!
merged_books <- books %>% inner_join(books2, convert = T, by = c(isbn = "isbn10", 
    isbn13 = "isbn13"))
merged_books %>% glimpse()
## Rows: 5,686
## Columns: 18
## $ title              <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors            <chr> "J.K. Rowling", "J.K. Rowling", "J.K. Rowling", "J.~
## $ average_rating.x   <dbl> 4.57, 4.49, 4.42, 4.56, 4.78, 3.74, 4.73, 4.38, 4.2~
## $ isbn               <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13             <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ language_code      <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ num_pages.x        <dbl> 652, 870, 352, 435, 2690, 152, 3342, 815, 215, 815,~
## $ ratings_count.x    <int> 2095690, 2153167, 6333, 2339585, 41428, 19, 28242, ~
## $ text_reviews_count <int> 27591, 29221, 244, 36325, 164, 1, 808, 4080, 460, 1~
## $ publication_date   <chr> "9/16/2006", "9/1/2004", "11/1/2003", "5/1/2004", "~
## $ publisher          <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~
## $ subtitle           <chr> "", "", "", "", "5 Years of Magic, Adventure, and M~
## $ categories         <chr> "Juvenile Fiction", "Juvenile Fiction", "Juvenile F~
## $ thumbnail          <chr> "http://books.google.com/books/content?id=QzI0BgAAQ~
## $ description        <chr> "When Harry Potter and the Half-Blood Prince opens,~
## $ average_rating.y   <dbl> 4.56, 4.49, 4.41, 4.55, 4.78, 3.69, 4.73, 4.38, 4.2~
## $ num_pages.y        <int> 652, 870, 352, 435, 2690, 152, 3342, 815, 215, 815,~
## $ ratings_count.y    <int> 1944099, 1996446, 6267, 2149872, 38872, 18, 27410, ~

After joining, we lost about 5440 (about half) observations. This could mean that we have lost some levels of our categorical variables (ie. we no longer have non-english books), but it turns out it’s not a big problem given that our categorical data has many levels.

To simplify information, I have decided that for the shared character variables (ie. title, author, published_year which can be derived from date), to simply keep the first dataset’s information and drop the same columns in the second dataset, since they describe the same information and there’s really no good way to combine the information together somehow. For the shared numerical variables (ie. average_rating, num_pages, ratings_count), I have decided to take an average of them after joining, and then deleting both the original columns.

# we now create a new column as the ceiled average of the two
# .x and .y columns besides average_rating, which should be
# continuous anyways
merged_books <- merged_books %>% mutate(average_rating = (average_rating.x + 
    average_rating.y)/2, num_pages = ceiling((num_pages.x + num_pages.y)/2), 
    ratings_count = ceiling((ratings_count.x + ratings_count.y)/2))

# we can now drop the .x and .y columns all together
merged_books <- merged_books %>% select(-(contains(".")))

# finally, move some columns around for easy viewing!
merged_books <- merged_books %>% relocate(title, authors, average_rating, 
    ratings_count, num_pages, language_code, categories, isbn, 
    isbn13)

merged_books %>% glimpse()
## Rows: 5,686
## Columns: 15
## $ title              <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors            <chr> "J.K. Rowling", "J.K. Rowling", "J.K. Rowling", "J.~
## $ average_rating     <dbl> 4.565, 4.490, 4.415, 4.555, 4.780, 3.715, 4.730, 4.~
## $ ratings_count      <dbl> 2019895, 2074807, 6300, 2244729, 40150, 19, 27826, ~
## $ num_pages          <dbl> 652, 870, 352, 435, 2690, 152, 3342, 815, 215, 815,~
## $ language_code      <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ categories         <chr> "Juvenile Fiction", "Juvenile Fiction", "Juvenile F~
## $ isbn               <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13             <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ text_reviews_count <int> 27591, 29221, 244, 36325, 164, 1, 808, 4080, 460, 1~
## $ publication_date   <chr> "9/16/2006", "9/1/2004", "11/1/2003", "5/1/2004", "~
## $ publisher          <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~
## $ subtitle           <chr> "", "", "", "", "5 Years of Magic, Adventure, and M~
## $ thumbnail          <chr> "http://books.google.com/books/content?id=QzI0BgAAQ~
## $ description        <chr> "When Harry Potter and the Half-Blood Prince opens,~

Wrangling

library(gt)  # tables
# time for some exploration!  We can first separate
# publication date into month, day, year new columns
merged_books <- merged_books %>% separate(publication_date, into = c("month", 
    "day", "year"))
merged_books %>% glimpse()
## Rows: 5,686
## Columns: 17
## $ title              <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors            <chr> "J.K. Rowling", "J.K. Rowling", "J.K. Rowling", "J.~
## $ average_rating     <dbl> 4.565, 4.490, 4.415, 4.555, 4.780, 3.715, 4.730, 4.~
## $ ratings_count      <dbl> 2019895, 2074807, 6300, 2244729, 40150, 19, 27826, ~
## $ num_pages          <dbl> 652, 870, 352, 435, 2690, 152, 3342, 815, 215, 815,~
## $ language_code      <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ categories         <chr> "Juvenile Fiction", "Juvenile Fiction", "Juvenile F~
## $ isbn               <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13             <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ text_reviews_count <int> 27591, 29221, 244, 36325, 164, 1, 808, 4080, 460, 1~
## $ month              <chr> "9", "9", "11", "5", "9", "4", "9", "4", "8", "1", ~
## $ day                <chr> "16", "1", "1", "1", "13", "26", "12", "30", "3", "~
## $ year               <chr> "2006", "2004", "2003", "2004", "2004", "2005", "20~
## $ publisher          <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~
## $ subtitle           <chr> "", "", "", "", "5 Years of Magic, Adventure, and M~
## $ thumbnail          <chr> "http://books.google.com/books/content?id=QzI0BgAAQ~
## $ description        <chr> "When Harry Potter and the Half-Blood Prince opens,~
# which book has the highest average_rating with more than
# 100 ratings? Calvin and Hobbes!
merged_books %>% filter(ratings_count >= 100) %>% arrange(desc(average_rating)) %>% 
    select(title) %>% slice(1) %>% .$title
## [1] "The Complete Calvin and Hobbes"
# which book from J.K. Rowling has the highest rating? A set
# of Harry Potter series!
merged_books %>% filter(str_detect(authors, "J.K. Rowling")) %>% 
    arrange(desc(average_rating)) %>% slice(1) %>% .$title
## [1] "Harry Potter Boxed Set  Books 1-5 (Harry Potter  #1-5)"

The Complete Calvin and Hobbes seems to be the best rated book with more than 100 ratings (a childhood classics!). The highest rated title from J.K. Rowling is the Harry Potter series, no surprises here!

# show the correlation matrix between numerical variables
# Interpretation can be found under the visualization section
cor_matrix <- merged_books %>% select(is.numeric) %>% cor() %>% 
    round(3)
colnames(cor_matrix) %>% cbind(cor_matrix) %>% as.data.frame %>% 
    gt()
. average_rating ratings_count num_pages text_reviews_count
average_rating 1 0.043 0.18 0.037
ratings_count 0.043 1 0.041 0.876
num_pages 0.18 0.041 1 0.04
text_reviews_count 0.037 0.876 0.04 1
# show some number of distinct for non-numerical variables
merged_books %>% select(is.character) %>% dplyr::summarise_all(.funs = list(.distinct = n_distinct)) %>% 
    pivot_longer(everything(), names_to = "variables", values_to = "distinct_count") %>% 
    separate(variables, into = c("variable"), sep = "_\\.") %>% 
    gt()
variable distinct_count
title 5528
authors 2843
language_code 5
categories 406
isbn 5686
isbn13 5686
month 12
day 31
year 77
publisher 1373
subtitle 1767
thumbnail 5474
description 5478

Expectedly, we have 12 unique values for month (1-12) and day (1-31). However, we see that even though each book has an unique isbn and isbn13, there are some duplicated titles (most likely different versions/reprints). We have only 5 language codes, which indicates that many were lost when we admitted all NA entries. The thumbnail and description are largely unique, while there’s also a surprising amount of unique publishers and categories.

# calculate summary stats, no groups (mean, sd, var,
# quantile, min, max)
summary_stat <- merged_books %>% select(is.numeric) %>% summarize_all(.funs = list(.mean = mean, 
    .sd = sd, .var = var, .median = median, .min = min, .max = max), 
    na.rm = T)

# we should now clean up summary_stat using pivot functions:
# we first stuff all stats into a column with their values we
# then separate out the statistics from their variable into
# their own column finally, we make a separate column per
# stat function for table viewing
summary_stat %>% round(3) %>% pivot_longer(everything(), names_to = "variables", 
    values_to = "values") %>% separate(variables, into = c("variable", 
    "stat"), sep = "_\\.") %>% pivot_wider(names_from = "stat", 
    values_from = "values") %>% gt()
variable mean sd var median min max
average_rating 3.935 0.329 1.080000e-01 3.96 0 5
ratings_count 20125.570 128461.947 1.650247e+10 1031.50 0 4482504
num_pages 343.568 230.518 5.313871e+04 304.00 0 3342
text_reviews_count 617.817 2889.346 8.348322e+06 60.00 0 94265

On average, our merged dataset has a relative high average_rating at ~4, and given that sd = 0.329, that means 95% of our titles has an average_rating [3.4, 4.6], or “above average” rating. The average_rating median is very similar to the mean, which indicates that there’s no extreme outliers pulling the mean (given the restricted range 0-5). The min and max logically follows as 0 and 5. The other metrics in contrast has relatively a much larger spread and often conflicting mean and median values, indicative of outliers (ie. very long/popular books). Thus, we cannot make generalizations as easily.

# calculate summary stats, group by language_code
summary_stat_lang <- merged_books %>% group_by(language_code) %>% 
    select(is.numeric) %>% summarize_all(.funs = list(.mean = mean, 
    .sd = sd, .var = var, .median = median, .min = min, .max = max))

# we should now clean up summary_stat_lang
summary_stat_lang %>% pivot_longer(is.numeric, names_to = "variables", 
    values_to = "values") %>% separate(variables, into = c("variable", 
    "stat"), sep = "_\\.") %>% pivot_wider(names_from = "stat", 
    values_from = "values") %>% mutate_if(is.numeric, round, 
    3) %>% gt()
language_code variable mean sd var median min max
en-CA average_rating 4.091 0.120 1.400000e-02 4.105 3.945 4.210
en-CA ratings_count 4119.250 1434.519 2.057846e+06 4661.500 2043.000 5111.000
en-CA num_pages 223.750 43.531 1.894917e+03 239.500 160.000 256.000
en-CA text_reviews_count 298.750 163.732 2.680825e+04 267.000 155.000 506.000
en-GB average_rating 3.915 0.299 9.000000e-02 3.940 2.670 4.535
en-GB ratings_count 2838.202 11426.034 1.305543e+08 203.500 3.000 100729.000
en-GB num_pages 323.070 211.776 4.484897e+04 284.500 32.000 1262.000
en-GB text_reviews_count 121.746 396.289 1.570449e+05 15.000 0.000 3838.000
en-US average_rating 3.919 0.281 7.900000e-02 3.950 2.670 4.880
en-US ratings_count 3997.590 12032.907 1.447909e+08 618.000 2.000 189065.000
en-US num_pages 336.759 225.169 5.070120e+04 307.000 24.000 2480.000
en-US text_reviews_count 174.386 475.223 2.258366e+05 38.000 0.000 6599.000
eng average_rating 3.938 0.337 1.140000e-01 3.960 0.000 5.000
eng ratings_count 23255.991 139957.894 1.958821e+10 1183.000 0.000 4482504.000
eng num_pages 345.233 231.921 5.378729e+04 304.000 0.000 3342.000
eng text_reviews_count 704.307 3141.016 9.865984e+06 67.000 0.000 94265.000
mul average_rating 4.235 0.332 1.100000e-01 4.235 4.000 4.470
mul ratings_count 29.000 1.414 2.000000e+00 29.000 28.000 30.000
mul num_pages 496.000 45.255 2.048000e+03 496.000 464.000 528.000
mul text_reviews_count 3.000 1.414 2.000000e+00 3.000 2.000 4.000

When categorized by language codes, our metric trends follows from both. However, we note that the multilingual books in particular has a higher average_rating and much smaller ratings_count/text_reviews_count compared to the other english books. This is probably first due to its smaller sample size compared to other english books as well as the fact that it requires substantial knowledge to read and review books in a different language, but they are often of high quality given the substantial translation work behind.

# calculate summary stats, group by authors and language
# codes notice there are too many authors, how about we just
# choose 3 of my favorites authors!
summary_stat_auth_lang <- merged_books %>% filter(authors %in% 
    c("Stephen King", "J.K. Rowling", "Ovid")) %>% group_by(authors, 
    language_code) %>% select(is.numeric) %>% summarize_all(.funs = list(.mean = mean, 
    .sd = sd, .var = var, .median = median, .min = min, .max = max))

# we should now clean up summary_stat_auth_lang
summary_stat_auth_lang %>% pivot_longer(is.numeric, names_to = "variables", 
    values_to = "values") %>% separate(variables, into = c("variable", 
    "stat"), sep = "_\\.") %>% pivot_wider(names_from = "stat", 
    values_from = "values") %>% mutate_if(is.numeric, round, 
    3) %>% gt()
language_code variable mean sd var median min max
J.K. Rowling
eng average_rating 4.546 0.128 1.600000e-02 4.555 4.400 4.780
eng ratings_count 865151.200 1095578.454 1.200292e+12 33988.000 3129.000 2244729.000
eng num_pages 1003.800 1086.934 1.181426e+06 558.000 240.000 3342.000
eng text_reviews_count 13023.000 16478.216 2.715316e+08 857.000 139.000 36325.000
Ovid
eng average_rating 4.222 0.194 3.800000e-02 4.190 4.020 4.545
eng ratings_count 6213.250 16639.188 2.768626e+08 323.500 32.000 47387.000
eng num_pages 534.000 97.860 9.576571e+03 529.000 400.000 723.000
eng text_reviews_count 110.500 264.793 7.011543e+04 14.000 3.000 764.000
Stephen King
en-US average_rating 3.947 0.043 2.000000e-03 3.930 3.915 3.995
en-US ratings_count 1470.000 1332.980 1.776837e+06 1282.000 241.000 2887.000
en-US num_pages 360.667 30.089 9.053330e+02 376.000 326.000 380.000
en-US text_reviews_count 127.667 140.678 1.979033e+04 84.000 14.000 285.000
eng average_rating 4.081 0.266 7.100000e-02 4.110 3.585 4.535
eng ratings_count 83279.095 204672.983 4.189103e+10 12729.000 281.000 944903.000
eng num_pages 458.238 231.374 5.353379e+04 463.000 66.000 1096.000
eng text_reviews_count 1609.286 3374.593 1.138788e+07 227.000 7.000 15105.000

General trend again follows: average_rating has a much smaller spread compared to other variables. The table seems short since selected authors only publish in particular language. Nevertheless, we can conclude that J.K. Rowling’s titles are on average rated higher than that of Stephen King’s and Ovid’s, though this may be confounded by the fact that her titles have much more ratings compared to titles of other authors.

Visualizations

cor_matrix %>% as.data.frame() %>% rownames_to_column("var1") %>% 
    pivot_longer(-1, "var2", values_to = "correlation") %>% ggplot(aes(var1, 
    var2, fill = correlation)) + geom_tile() + scale_fill_gradient2(low = "red", 
    mid = "white", high = "blue") + geom_text(aes(label = round(correlation, 
    2)), color = "black", size = 4) + theme(axis.text.x = element_text(angle = 90, 
    hjust = 1)) + coord_fixed() + xlab("") + ylab("")

The correlation heatmap reveals that from our numerical variables, only ratings_count and text_reviews_count is significantly correlated, which makes sense logically since one would expect a partitular title to be more critically acclaimed the more people have read and rated that particular title. On the other hand, other pairs of variables are weakly correlated, which is somewhat surprising since one would expect more critically acclaimed titles to be particularly high or low in quality/rating.

library(ggExtra)
p1 <- ggplot(merged_books) + geom_point(aes(x = num_pages, y = average_rating, 
    color = language_code, shape = 1)) + scale_shape_identity() + 
    scale_color_brewer(palette = "Set1", label = c("eng-Canada", 
        "eng-Great Britain", "eng-US", "eng", "Multilingual")) + 
    scale_y_continuous(breaks = seq(0, 5, 0.2)) + scale_x_continuous(breaks = seq(0, 
    3500, 500)) + labs(color = "Language Codes", x = "Total Pages", 
    y = "Average Rating", title = "Book Average Rating vs Total Pages") + 
    theme_minimal() + theme(plot.title = element_text(hjust = 0.5), 
    legend.position = c(0.8, 0.35))
ggMarginal(p1, type = "densigram")

The marginal distribution of novel average ratings is roughly normal (symmetric) with mean around rating 4.0, which implies that on average the books in this merged dataset have a relatively high rating (they are good books!). The marginal distribution of total number of pages of the novels is right skewed with mean around 300 pages, which makes sense given that books (except textbooks, especially novels) are typically short and do not go over 1000 pages. This scatterplot does not indicate an apparent relationship between average_rating and total pages, which also makes sense since quantity does not imply quality. When colorized by language codes, we see that almost all books are in English (especially the longer ones, since the effort to translate said books would be monumental) with each sub-category consisting of largely variant book ratings with no significant patterns.

months <- c("January", "February", "March", "April", "May", "June", 
    "July", "August", "September", "October", "Novemember", "December")
# Let's filter down to just a few categories for clearer
# visualization
merged_books %>% filter(categories %in% c("Fiction", "Science", 
    "Business & Economics", "Computers", "Religion")) %>% mutate(months = factor(months[as.numeric(month)], 
    levels = months)) %>% ggplot(aes(x = as.numeric(day), y = average_rating, 
    fill = categories)) + scale_fill_brewer(palette = "Dark2") + 
    geom_bar(stat = "summary", fun = mean) + facet_wrap(~months) + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        plot.title = element_text(hjust = 0.5)) + labs(title = "Book Ratings per Day of each Month", 
    x = "Day", y = "Mean Average rating", fill = "Genre") + scale_x_continuous(breaks = c(1, 
    seq(5, 31, 5)))

Based on the categories of books we have selected, it appears that authors like to release these genre books at the first day of each month (most likely to maximize the first monthly sales possible), given that we have the most number of stacked bars at usually day 1 of each month except August (which happens to be around the time of school starting in the fall). For our dataset, the purple colored bars appears most often, meaning that fiction is the most popular category of novels given that it’s published the most. Across each month and day, the mean average_rating of each genre appears roughly similar, which makes sense which the quality of a particular novel category should have no relation with respect to the title’s publication date. The frequency of each category also remains relative constant throughout each month and day, with the exception of March which curiously have no computer publications.

Dimensionality Reduction

library(cluster)
# we should first see how many clusters we should create
pam_dat <- merged_books %>% select(is.numeric) %>% scale %>% 
    as.data.frame
sil_width <- vector()
for (i in 2:10) {
    pam_fit <- pam(pam_dat, k = i)
    sil_width[i] <- pam_fit$silinfo$avg.width
}

# visualize our results to select k
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "Number of Clusters (k)", 
    breaks = 1:10) + labs(title = "Average Silhouette Width per Number of Clusters", 
    y = "Average Silhouette Width") + theme(plot.title = element_text(hjust = 0.5))

Based on this silhouette width plot, we see that we should elect three clusters since that’s the number of clusters that results in the highest average silhouette width, which means that our 3 supposed clusters should be more cohesive, distinct, and parsimonious compared to other number of clusters.

pam <- pam_dat %>% pam(3)
# now that we ran PAM, save the cluster assignment
pam_dat <- pam_dat %>% mutate(cluster = as.factor(pam$clustering))

library(GGally)
ggpairs(pam_dat, columns = 1:4, aes(color = cluster))

For the first column, particularly the num_pages vs average_rating scatterplot, we see three distinct clusters where the red clusters seems to consist of titles with relatively large number of pages, while the blue and green clusters separate out titles with low page numbers but with low and high average_ratings respectively (at z=0, or the overall mean). The same trend is also found in other graphs, where the blue clusters mainly consist of low-rated, small books, while the red cluster consist of large books and the green cluster highly-rated, small books. The clusters seems to overlap relative to variables like text_reviews_count and reviews_count which is highly correlated with each other and thus hard to differentiate.

# Interpreting Average Silhouette Width
plot(silhouette(pam), col = 1:3, border = NA)

Our average silhouette width was 0.27, which according to the cutoffs is within 0.26-0.5, indicating that the cluster structure is weak and likely superficial.