Project1

Introduction

I have always loved reading, so I thought may not use R and data science to help me decided which book I should read next! After poking around on Kaggle, I have found two datasets containing a myraid of information on many titles. The first dataset by Soumik originated from him scraping the GoodReads API using ISBNs and contains the variables ‘bookID’ (a unique integer associated with each book), ‘title’, ‘authors’, ‘average rating’ (0-5 scale), ‘isbn’, ‘isbn13’, ‘langauge code’ (eng or other languages), ‘num_pages’, ‘ratings_count’ (total number of ratings), ‘text_reviews_count’ (total number of written text reviews). Out of these variables, ‘ratings_count’ and ‘average_rating’s is highly associated, since the latter is simply the mean of the total ratings score divided by the former. However, we wish to compare average ratings so that book suggestions will not be biased by the number of ratings. Furthermore, I would expect ’ratings_count’ and ‘text_review_count’ to be highly correlated, since if a book has many ratings, which means that it must be somewhat popular (or notorious), so the said book probably also has many written critic ratings.

The second dataset authored by Dylan Castillo was originally intended to be an extension of Soumik’s dataset, but this time using the Google Books API (Crawler) by ISBNs. This dataset contains the variables ‘isbn13’, ‘isbn10’, ‘title’, ‘subtitle’, ‘authors’, ‘categories’ (genre information), ‘thumbnail’ (urls of book cover), ‘description’ (short summary), ‘published_year’, and ‘average_rating’ (0-5 again). As one expects, this dataset should largely overlap with the first chosen dataset, thus when we join we can afford an inner join by ISBN (11 and 13) as both datasets contain essentially the same information for different titles. For this dataset, I would expect an association between ‘ratings_count’ and ‘average_ratings’, where the latter again is total ratings normalized by the former. However, I also speculate that the more ratings a particular book has, then the higher its ratings will be, given that people cared enough to bestow a rating and can flush out extreme reviews.

Finally, the dataset can be found at:
https://www.kaggle.com/jealousleopard/goodreadsbooks
https://www.kaggle.com/dylanjcastillo/7k-books-with-metadata

Tidying

library(tidyverse)
select <- dplyr::select  # resolve conflict with MASS

# read in datasets
books <- read.csv("books.csv")
books2 <- read.csv("books2.csv")

# first look at dataset
books %>% glimpse()

## Rows: 11,131
## Columns: 12
## $ bookID             <chr> "1", "2", "4", "5", "8", "9", "10", "12", "13", "14~
## $ title              <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors            <chr> "J.K. Rowling/Mary GrandPrÃ©", "J.K. Rowling/Mary G~
## $ average_rating     <chr> "4.57", "4.49", "4.42", "4.56", "4.78", "3.74", "4.~
## $ isbn               <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13             <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ language_code      <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ num_pages          <chr> "652", "870", "352", "435", "2690", "152", "3342", ~
## $ ratings_count      <int> 2095690, 2153167, 6333, 2339585, 41428, 19, 28242, ~
## $ text_reviews_count <int> 27591, 29221, 244, 36325, 164, 1, 808, 254, 4080, 4~
## $ publication_date   <chr> "9/16/2006", "9/1/2004", "11/1/2003", "5/1/2004", "~
## $ publisher          <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~

books2 %>% glimpse()

## Rows: 6,810
## Columns: 12
## $ isbn13         <dbl> 9.780002e+12, 9.780002e+12, 9.780006e+12, 9.780006e+12,~
## $ isbn10         <chr> "0002005883", "0002261987", "0006163831", "0006178731",~
## $ title          <chr> "Gilead", "Spider's Web", "The One Tree", "Rage of ange~
## $ subtitle       <chr> "", "A Novel", "", "", "", "", "", "A History of the In~
## $ authors        <chr> "Marilynne Robinson", "Charles Osborne;Agatha Christie"~
## $ categories     <chr> "Fiction", "Detective and mystery stories", "American f~
## $ thumbnail      <chr> "http://books.google.com/books/content?id=KQZCPgAACAAJ&~
## $ description    <chr> "A NOVEL THAT READERS and critics have been eagerly ant~
## $ published_year <int> 2004, 2000, 1982, 1993, 2002, 2002, 1977, 1998, 1994, 1~
## $ average_rating <dbl> 3.85, 3.83, 3.97, 3.93, 4.15, 4.09, 4.27, 4.41, 4.15, 4~
## $ num_pages      <int> 247, 241, 479, 512, 170, 176, 560, 608, 743, 489, 501, ~
## $ ratings_count  <int> 361, 5164, 172, 29532, 33684, 37569, 3975, 65, 103, 435~

The datasets appears tidy and does not require pivot functions, which will be used to clean up summary statistics later. However, after some more in-depth visual inspection, I have noticed some trends. In particular, some observations (books) seems to have information keyed-in wrong, some field order got scrambled (leaving NAs in one variable and obviously false information in another), some information/columns share the same information, some variables are not the correct type. We fix a variety of small issues below; when all is done, we have lost 5 observations in the first dataset.

# fix some obvious data issues: Fields with NAs are keyed
# wrong, for simplicity let's drop them we also don't need
# the bookID field
books <- books %>% na.omit() %>% select(-bookID)

# in books1, average_rating, num_pages should be dbl and int
# respectively
books <- books %>% mutate(average_rating = as.numeric(average_rating), 
    num_pages = as.numeric(num_pages))
# Let's drop the authors and leave only the primary (ie. get
# rid of all authors after the first '/')
books <- books %>% separate(authors, into = "authors", sep = "/")

# we need to make isbn13 in books2 a character vector
books2 <- books2 %>% mutate(isbn13 = as.character(isbn13))
# we should drop the shared description columns in books2: no
# good way to combine
books2 <- books2 %>% select(-c("title", "authors", "published_year"))

# dataset seems tidely now, but let's make sure let's
# investigate if each book (observation) is uniquely
# identified by ISBN10 and ISBN13 in each dataset
books %>% select(isbn, isbn13) %>% summarize_all(n_distinct)

##    isbn isbn13
## 1 11126  11127

books2 %>% select(isbn10, isbn13) %>% summarize_all(n_distinct)

##   isbn10 isbn13
## 1   6810   6810

# we note that books2 is tidy, where all books have an unique
# isbn10 and isbn13 However, there seems to be a duplicate in
# books1?  we simply excluded that one
books <- books %>% filter(!duplicated(isbn))

Joining/Merging

After contemplation, I have decided to perform an inner join. For one, the second dataset is an extension of the first, which means that we should obtain a fairly large amount of shared observations (5686 total). This way, we also don’t have to worry about managing possible NAs in our merged dataset. The problem comes when deciding the which columns to merge on, again given that the two datasets shares a decent amount of similar variables (ie. authors, isbn, title, average_rating, num_pages, ratings_count). In fact, values in these shared columns aren’t exactly the same, and so I have selected only the isbn columns (isbn=isbn10, isbn13) to merge on since that’s the universal convention to uniquely identify a book.

# we can now join our dataset!
merged_books <- books %>% inner_join(books2, convert = T, by = c(isbn = "isbn10", 
    isbn13 = "isbn13"))
merged_books %>% glimpse()

## Rows: 5,686
## Columns: 18
## $ title              <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors            <chr> "J.K. Rowling", "J.K. Rowling", "J.K. Rowling", "J.~
## $ average_rating.x   <dbl> 4.57, 4.49, 4.42, 4.56, 4.78, 3.74, 4.73, 4.38, 4.2~
## $ isbn               <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13             <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ language_code      <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ num_pages.x        <dbl> 652, 870, 352, 435, 2690, 152, 3342, 815, 215, 815,~
## $ ratings_count.x    <int> 2095690, 2153167, 6333, 2339585, 41428, 19, 28242, ~
## $ text_reviews_count <int> 27591, 29221, 244, 36325, 164, 1, 808, 4080, 460, 1~
## $ publication_date   <chr> "9/16/2006", "9/1/2004", "11/1/2003", "5/1/2004", "~
## $ publisher          <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~
## $ subtitle           <chr> "", "", "", "", "5 Years of Magic, Adventure, and M~
## $ categories         <chr> "Juvenile Fiction", "Juvenile Fiction", "Juvenile F~
## $ thumbnail          <chr> "http://books.google.com/books/content?id=QzI0BgAAQ~
## $ description        <chr> "When Harry Potter and the Half-Blood Prince opens,~
## $ average_rating.y   <dbl> 4.56, 4.49, 4.41, 4.55, 4.78, 3.69, 4.73, 4.38, 4.2~
## $ num_pages.y        <int> 652, 870, 352, 435, 2690, 152, 3342, 815, 215, 815,~
## $ ratings_count.y    <int> 1944099, 1996446, 6267, 2149872, 38872, 18, 27410, ~

After joining, we lost about 5440 (about half) observations. This could mean that we have lost some levels of our categorical variables (ie. we no longer have non-english books), but it turns out it’s not a big problem given that our categorical data has many levels.

To simplify information, I have decided that for the shared character variables (ie. title, author, published_year which can be derived from date), to simply keep the first dataset’s information and drop the same columns in the second dataset, since they describe the same information and there’s really no good way to combine the information together somehow. For the shared numerical variables (ie. average_rating, num_pages, ratings_count), I have decided to take an average of them after joining, and then deleting both the original columns.

# we now create a new column as the ceiled average of the two
# .x and .y columns besides average_rating, which should be
# continuous anyways
merged_books <- merged_books %>% mutate(average_rating = (average_rating.x + 
    average_rating.y)/2, num_pages = ceiling((num_pages.x + num_pages.y)/2), 
    ratings_count = ceiling((ratings_count.x + ratings_count.y)/2))

# we can now drop the .x and .y columns all together
merged_books <- merged_books %>% select(-(contains(".")))

# finally, move some columns around for easy viewing!
merged_books <- merged_books %>% relocate(title, authors, average_rating, 
    ratings_count, num_pages, language_code, categories, isbn, 
    isbn13)

merged_books %>% glimpse()

## Rows: 5,686
## Columns: 15
## $ title              <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors            <chr> "J.K. Rowling", "J.K. Rowling", "J.K. Rowling", "J.~
## $ average_rating     <dbl> 4.565, 4.490, 4.415, 4.555, 4.780, 3.715, 4.730, 4.~
## $ ratings_count      <dbl> 2019895, 2074807, 6300, 2244729, 40150, 19, 27826, ~
## $ num_pages          <dbl> 652, 870, 352, 435, 2690, 152, 3342, 815, 215, 815,~
## $ language_code      <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ categories         <chr> "Juvenile Fiction", "Juvenile Fiction", "Juvenile F~
## $ isbn               <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13             <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ text_reviews_count <int> 27591, 29221, 244, 36325, 164, 1, 808, 4080, 460, 1~
## $ publication_date   <chr> "9/16/2006", "9/1/2004", "11/1/2003", "5/1/2004", "~
## $ publisher          <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~
## $ subtitle           <chr> "", "", "", "", "5 Years of Magic, Adventure, and M~
## $ thumbnail          <chr> "http://books.google.com/books/content?id=QzI0BgAAQ~
## $ description        <chr> "When Harry Potter and the Half-Blood Prince opens,~

Wrangling

library(gt)  # tables
# time for some exploration!  We can first separate
# publication date into month, day, year new columns
merged_books <- merged_books %>% separate(publication_date, into = c("month", 
    "day", "year"))
merged_books %>% glimpse()

## Rows: 5,686
## Columns: 17
## $ title              <chr> "Harry Potter and the Half-Blood Prince (Harry Pott~
## $ authors            <chr> "J.K. Rowling", "J.K. Rowling", "J.K. Rowling", "J.~
## $ average_rating     <dbl> 4.565, 4.490, 4.415, 4.555, 4.780, 3.715, 4.730, 4.~
## $ ratings_count      <dbl> 2019895, 2074807, 6300, 2244729, 40150, 19, 27826, ~
## $ num_pages          <dbl> 652, 870, 352, 435, 2690, 152, 3342, 815, 215, 815,~
## $ language_code      <chr> "eng", "eng", "eng", "eng", "eng", "en-US", "eng", ~
## $ categories         <chr> "Juvenile Fiction", "Juvenile Fiction", "Juvenile F~
## $ isbn               <chr> "0439785960", "0439358078", "0439554896", "04396554~
## $ isbn13             <chr> "9780439785969", "9780439358071", "9780439554893", ~
## $ text_reviews_count <int> 27591, 29221, 244, 36325, 164, 1, 808, 4080, 460, 1~
## $ month              <chr> "9", "9", "11", "5", "9", "4", "9", "4", "8", "1", ~
## $ day                <chr> "16", "1", "1", "1", "13", "26", "12", "30", "3", "~
## $ year               <chr> "2006", "2004", "2003", "2004", "2004", "2005", "20~
## $ publisher          <chr> "Scholastic Inc.", "Scholastic Inc.", "Scholastic",~
## $ subtitle           <chr> "", "", "", "", "5 Years of Magic, Adventure, and M~
## $ thumbnail          <chr> "http://books.google.com/books/content?id=QzI0BgAAQ~
## $ description        <chr> "When Harry Potter and the Half-Blood Prince opens,~

# which book has the highest average_rating with more than
# 100 ratings? Calvin and Hobbes!
merged_books %>% filter(ratings_count >= 100) %>% arrange(desc(average_rating)) %>% 
    select(title) %>% slice(1) %>% .$title

## [1] "The Complete Calvin and Hobbes"

# which book from J.K. Rowling has the highest rating? A set
# of Harry Potter series!
merged_books %>% filter(str_detect(authors, "J.K. Rowling")) %>% 
    arrange(desc(average_rating)) %>% slice(1) %>% .$title

## [1] "Harry Potter Boxed Set  Books 1-5 (Harry Potter  #1-5)"

The Complete Calvin and Hobbes seems to be the best rated book with more than 100 ratings (a childhood classics!). The highest rated title from J.K. Rowling is the Harry Potter series, no surprises here!

# show the correlation matrix between numerical variables
# Interpretation can be found under the visualization section
cor_matrix <- merged_books %>% select(is.numeric) %>% cor() %>% 
    round(3)
colnames(cor_matrix) %>% cbind(cor_matrix) %>% as.data.frame %>% 
    gt()

.	average_rating	ratings_count	num_pages	text_reviews_count
average_rating	1	0.043	0.18	0.037
ratings_count	0.043	1	0.041	0.876
num_pages	0.18	0.041	1	0.04
text_reviews_count	0.037	0.876	0.04	1

# show some number of distinct for non-numerical variables
merged_books %>% select(is.character) %>% dplyr::summarise_all(.funs = list(.distinct = n_distinct)) %>% 
    pivot_longer(everything(), names_to = "variables", values_to = "distinct_count") %>% 
    separate(variables, into = c("variable"), sep = "_\\.") %>% 
    gt()

variable	distinct_count
title	5528
authors	2843
language_code	5
categories	406
isbn	5686
isbn13	5686
month	12
day	31
year	77
publisher	1373
subtitle	1767
thumbnail	5474
description	5478

Expectedly, we have 12 unique values for month (1-12) and day (1-31). However, we see that even though each book has an unique isbn and isbn13, there are some duplicated titles (most likely different versions/reprints). We have only 5 language codes, which indicates that many were lost when we admitted all NA entries. The thumbnail and description are largely unique, while there’s also a surprising amount of unique publishers and categories.

# calculate summary stats, no groups (mean, sd, var,
# quantile, min, max)
summary_stat <- merged_books %>% select(is.numeric) %>% summarize_all(.funs = list(.mean = mean, 
    .sd = sd, .var = var, .median = median, .min = min, .max = max), 
    na.rm = T)

# we should now clean up summary_stat using pivot functions:
# we first stuff all stats into a column with their values we
# then separate out the statistics from their variable into
# their own column finally, we make a separate column per
# stat function for table viewing
summary_stat %>% round(3) %>% pivot_longer(everything(), names_to = "variables", 
    values_to = "values") %>% separate(variables, into = c("variable", 
    "stat"), sep = "_\\.") %>% pivot_wider(names_from = "stat", 
    values_from = "values") %>% gt()

variable	mean	sd	var	median	max
average_rating	3.935	0.329	1.080000e-01	3.96	5
ratings_count	20125.570	128461.947	1.650247e+10	1031.50	4482504
num_pages	343.568	230.518	5.313871e+04	304.00	3342
text_reviews_count	617.817	2889.346	8.348322e+06	60.00	94265

On average, our merged dataset has a relative high average_rating at ~4, and given that sd = 0.329, that means 95% of our titles has an average_rating [3.4, 4.6], or “above average” rating. The average_rating median is very similar to the mean, which indicates that there’s no extreme outliers pulling the mean (given the restricted range 0-5). The min and max logically follows as 0 and 5. The other metrics in contrast has relatively a much larger spread and often conflicting mean and median values, indicative of outliers (ie. very long/popular books). Thus, we cannot make generalizations as easily.

# calculate summary stats, group by language_code
summary_stat_lang <- merged_books %>% group_by(language_code) %>% 
    select(is.numeric) %>% summarize_all(.funs = list(.mean = mean, 
    .sd = sd, .var = var, .median = median, .min = min, .max = max))

# we should now clean up summary_stat_lang
summary_stat_lang %>% pivot_longer(is.numeric, names_to = "variables", 
    values_to = "values") %>% separate(variables, into = c("variable", 
    "stat"), sep = "_\\.") %>% pivot_wider(names_from = "stat", 
    values_from = "values") %>% mutate_if(is.numeric, round, 
    3) %>% gt()

language_code	variable	mean	sd	var	median	min	max
en-CA	average_rating	4.091	0.120	1.400000e-02	4.105	3.945	4.210
en-CA	ratings_count	4119.250	1434.519	2.057846e+06	4661.500	2043.000	5111.000
en-CA	num_pages	223.750	43.531	1.894917e+03	239.500	160.000	256.000
en-CA	text_reviews_count	298.750	163.732	2.680825e+04	267.000	155.000	506.000
en-GB	average_rating	3.915	0.299	9.000000e-02	3.940	2.670	4.535
en-GB	ratings_count	2838.202	11426.034	1.305543e+08	203.500	3.000	100729.000
en-GB	num_pages	323.070	211.776	4.484897e+04	284.500	32.000	1262.000
en-GB	text_reviews_count	121.746	396.289	1.570449e+05	15.000	0.000	3838.000
en-US	average_rating	3.919	0.281	7.900000e-02	3.950	2.670	4.880
en-US	ratings_count	3997.590	12032.907	1.447909e+08	618.000	2.000	189065.000
en-US	num_pages	336.759	225.169	5.070120e+04	307.000	24.000	2480.000
en-US	text_reviews_count	174.386	475.223	2.258366e+05	38.000	0.000	6599.000
eng	average_rating	3.938	0.337	1.140000e-01	3.960	0.000	5.000
eng	ratings_count	23255.991	139957.894	1.958821e+10	1183.000	0.000	4482504.000
eng	num_pages	345.233	231.921	5.378729e+04	304.000	0.000	3342.000
eng	text_reviews_count	704.307	3141.016	9.865984e+06	67.000	0.000	94265.000
mul	average_rating	4.235	0.332	1.100000e-01	4.235	4.000	4.470
mul	ratings_count	29.000	1.414	2.000000e+00	29.000	28.000	30.000
mul	num_pages	496.000	45.255	2.048000e+03	496.000	464.000	528.000
mul	text_reviews_count	3.000	1.414	2.000000e+00	3.000	2.000	4.000

When categorized by language codes, our metric trends follows from both. However, we note that the multilingual books in particular has a higher average_rating and much smaller ratings_count/text_reviews_count compared to the other english books. This is probably first due to its smaller sample size compared to other english books as well as the fact that it requires substantial knowledge to read and review books in a different language, but they are often of high quality given the substantial translation work behind.

# calculate summary stats, group by authors and language
# codes notice there are too many authors, how about we just
# choose 3 of my favorites authors!
summary_stat_auth_lang <- merged_books %>% filter(authors %in% 
    c("Stephen King", "J.K. Rowling", "Ovid")) %>% group_by(authors, 
    language_code) %>% select(is.numeric) %>% summarize_all(.funs = list(.mean = mean, 
    .sd = sd, .var = var, .median = median, .min = min, .max = max))

# we should now clean up summary_stat_auth_lang
summary_stat_auth_lang %>% pivot_longer(is.numeric, names_to = "variables", 
    values_to = "values") %>% separate(variables, into = c("variable", 
    "stat"), sep = "_\\.") %>% pivot_wider(names_from = "stat", 
    values_from = "values") %>% mutate_if(is.numeric, round, 
    3) %>% gt()

language_code	variable	mean	sd	var	median	min	max
J.K. Rowling
eng	average_rating	4.546	0.128	1.600000e-02	4.555	4.400	4.780
eng	ratings_count	865151.200	1095578.454	1.200292e+12	33988.000	3129.000	2244729.000
eng	num_pages	1003.800	1086.934	1.181426e+06	558.000	240.000	3342.000
eng	text_reviews_count	13023.000	16478.216	2.715316e+08	857.000	139.000	36325.000
Ovid
eng	average_rating	4.222	0.194	3.800000e-02	4.190	4.020	4.545
eng	ratings_count	6213.250	16639.188	2.768626e+08	323.500	32.000	47387.000
eng	num_pages	534.000	97.860	9.576571e+03	529.000	400.000	723.000
eng	text_reviews_count	110.500	264.793	7.011543e+04	14.000	3.000	764.000
Stephen King
en-US	average_rating	3.947	0.043	2.000000e-03	3.930	3.915	3.995
en-US	ratings_count	1470.000	1332.980	1.776837e+06	1282.000	241.000	2887.000
en-US	num_pages	360.667	30.089	9.053330e+02	376.000	326.000	380.000
en-US	text_reviews_count	127.667	140.678	1.979033e+04	84.000	14.000	285.000
eng	average_rating	4.081	0.266	7.100000e-02	4.110	3.585	4.535
eng	ratings_count	83279.095	204672.983	4.189103e+10	12729.000	281.000	944903.000
eng	num_pages	458.238	231.374	5.353379e+04	463.000	66.000	1096.000
eng	text_reviews_count	1609.286	3374.593	1.138788e+07	227.000	7.000	15105.000

General trend again follows: average_rating has a much smaller spread compared to other variables. The table seems short since selected authors only publish in particular language. Nevertheless, we can conclude that J.K. Rowling’s titles are on average rated higher than that of Stephen King’s and Ovid’s, though this may be confounded by the fact that her titles have much more ratings compared to titles of other authors.

Visualizations

cor_matrix %>% as.data.frame() %>% rownames_to_column("var1") %>% 
    pivot_longer(-1, "var2", values_to = "correlation") %>% ggplot(aes(var1, 
    var2, fill = correlation)) + geom_tile() + scale_fill_gradient2(low = "red", 
    mid = "white", high = "blue") + geom_text(aes(label = round(correlation, 
    2)), color = "black", size = 4) + theme(axis.text.x = element_text(angle = 90, 
    hjust = 1)) + coord_fixed() + xlab("") + ylab("")

The correlation heatmap reveals that from our numerical variables, only ratings_count and text_reviews_count is significantly correlated, which makes sense logically since one would expect a partitular title to be more critically acclaimed the more people have read and rated that particular title. On the other hand, other pairs of variables are weakly correlated, which is somewhat surprising since one would expect more critically acclaimed titles to be particularly high or low in quality/rating.

library(ggExtra)
p1 <- ggplot(merged_books) + geom_point(aes(x = num_pages, y = average_rating, 
    color = language_code, shape = 1)) + scale_shape_identity() + 
    scale_color_brewer(palette = "Set1", label = c("eng-Canada", 
        "eng-Great Britain", "eng-US", "eng", "Multilingual")) + 
    scale_y_continuous(breaks = seq(0, 5, 0.2)) + scale_x_continuous(breaks = seq(0, 
    3500, 500)) + labs(color = "Language Codes", x = "Total Pages", 
    y = "Average Rating", title = "Book Average Rating vs Total Pages") + 
    theme_minimal() + theme(plot.title = element_text(hjust = 0.5), 
    legend.position = c(0.8, 0.35))
ggMarginal(p1, type = "densigram")

The marginal distribution of novel average ratings is roughly normal (symmetric) with mean around rating 4.0, which implies that on average the books in this merged dataset have a relatively high rating (they are good books!). The marginal distribution of total number of pages of the novels is right skewed with mean around 300 pages, which makes sense given that books (except textbooks, especially novels) are typically short and do not go over 1000 pages. This scatterplot does not indicate an apparent relationship between average_rating and total pages, which also makes sense since quantity does not imply quality. When colorized by language codes, we see that almost all books are in English (especially the longer ones, since the effort to translate said books would be monumental) with each sub-category consisting of largely variant book ratings with no significant patterns.

months <- c("January", "February", "March", "April", "May", "June", 
    "July", "August", "September", "October", "Novemember", "December")
# Let's filter down to just a few categories for clearer
# visualization
merged_books %>% filter(categories %in% c("Fiction", "Science", 
    "Business & Economics", "Computers", "Religion")) %>% mutate(months = factor(months[as.numeric(month)], 
    levels = months)) %>% ggplot(aes(x = as.numeric(day), y = average_rating, 
    fill = categories)) + scale_fill_brewer(palette = "Dark2") + 
    geom_bar(stat = "summary", fun = mean) + facet_wrap(~months) + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        plot.title = element_text(hjust = 0.5)) + labs(title = "Book Ratings per Day of each Month", 
    x = "Day", y = "Mean Average rating", fill = "Genre") + scale_x_continuous(breaks = c(1, 
    seq(5, 31, 5)))

Based on the categories of books we have selected, it appears that authors like to release these genre books at the first day of each month (most likely to maximize the first monthly sales possible), given that we have the most number of stacked bars at usually day 1 of each month except August (which happens to be around the time of school starting in the fall). For our dataset, the purple colored bars appears most often, meaning that fiction is the most popular category of novels given that it’s published the most. Across each month and day, the mean average_rating of each genre appears roughly similar, which makes sense which the quality of a particular novel category should have no relation with respect to the title’s publication date. The frequency of each category also remains relative constant throughout each month and day, with the exception of March which curiously have no computer publications.

Dimensionality Reduction

library(cluster)
# we should first see how many clusters we should create
pam_dat <- merged_books %>% select(is.numeric) %>% scale %>% 
    as.data.frame
sil_width <- vector()
for (i in 2:10) {
    pam_fit <- pam(pam_dat, k = i)
    sil_width[i] <- pam_fit$silinfo$avg.width
}

# visualize our results to select k
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "Number of Clusters (k)", 
    breaks = 1:10) + labs(title = "Average Silhouette Width per Number of Clusters", 
    y = "Average Silhouette Width") + theme(plot.title = element_text(hjust = 0.5))

Based on this silhouette width plot, we see that we should elect three clusters since that’s the number of clusters that results in the highest average silhouette width, which means that our 3 supposed clusters should be more cohesive, distinct, and parsimonious compared to other number of clusters.

pam <- pam_dat %>% pam(3)
# now that we ran PAM, save the cluster assignment
pam_dat <- pam_dat %>% mutate(cluster = as.factor(pam$clustering))

library(GGally)
ggpairs(pam_dat, columns = 1:4, aes(color = cluster))

For the first column, particularly the num_pages vs average_rating scatterplot, we see three distinct clusters where the red clusters seems to consist of titles with relatively large number of pages, while the blue and green clusters separate out titles with low page numbers but with low and high average_ratings respectively (at z=0, or the overall mean). The same trend is also found in other graphs, where the blue clusters mainly consist of low-rated, small books, while the red cluster consist of large books and the green cluster highly-rated, small books. The clusters seems to overlap relative to variables like text_reviews_count and reviews_count which is highly correlated with each other and thus hard to differentiate.

# Interpreting Average Silhouette Width
plot(silhouette(pam), col = 1:3, border = NA)

Our average silhouette width was 0.27, which according to the cutoffs is within 0.26-0.5, indicating that the cluster structure is weak and likely superficial.