Technical Analytical Assessment for Field Data Science team
The data contains events from the months of January 2019 and January 2021; the 2019 can be used as a basis of comparison for the activity in 2021.
The destinations contained in the data are selected top-tier cities in Germany, UK, France, Italy, and Spain
The table below shows the definition of attributes:
Attributes | Definition |
---|---|
event_date | The date of the search or booking activity. This is different from the departure_checkin_date, which is the actual date of travel activity: if a customer makes a booking today for check-in on 1 October, today’s date would be the event date, and 1 October would be the “departure_checkin_date” |
activity_group | An indicator of the type of event on the hotel website. Classified as “Searching” (a general search for a room) or “Booking” (an actual reservation) |
reason_for_travel_detailed | this is the Reason for Travel, categorized as either “Business”, “Leisure-Non-Family” (leisure travel consisting of 1 or 2 people), and “Leisure-Family” (leisure travel consisting of 3 or more people) |
origin_country_code | 2-letter ISO country code indicating the origin country from which the booking or search activity was made |
hotel_city,hotel_state,hotel_country | raw information specifying the destination in question. Note that this data is in an unclean state, provided straight from the source system. |
traveler_value_group | Indicator of traveler value (overall spend + loyalty status), grouped into “Low”, “Medium”, “High” |
departure_checkin_date | Start date of hotel stay; see event_date to determine how the date information differs. |
total_number_events | Aggregated count of total number of events |
Install and Launch R Packages
packages = c('plotly',
'lubridate',
'DT',
'vtable',
'tidyverse')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
plotly - interactive data visualization
lubridate - date wrangling
DT - interactive data tables
vtable - summary statistics
tidyverse - set of packages for data science workflow to import, tidy, transform, explore and visualize the data
Import Data
raw_data <- read_csv("data/sample_data.csv")
glimpse(raw_data)
Rows: 724,795
Columns: 10
$ event_date <date> 2019-01-07, 2019-01-16, 2021-01-~
$ activity_group <chr> "Booking", "Searching", "Searchin~
$ reason_for_travel_detailed <chr> "Leisure - Non-Family", "Business~
$ origin_country_code <chr> "DE", "GB", "DE", "US", "DE", "DE~
$ hotel_city <chr> "dusseldorf", "dusseldorf", "duss~
$ hotel_state <chr> NA, NA, NA, NA, NA, NA, NA, NA, N~
$ hotel_country <chr> "de", "de", "de", "de", "de", "de~
$ traveler_value_group <chr> "Med", "Med", "Low", "Low", "Low"~
$ departure_checkin_date <date> 2019-01-25, 2019-02-05, 2021-01-~
$ total_number_events <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, ~
Summary Statistics
st(raw_data)
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|
activity_group | 724795 | ||||||
… Booking | 89118 | 12.3% | |||||
… Searching | 635677 | 87.7% | |||||
reason_for_travel_detailed | 724795 | ||||||
… Business | 313401 | 43.2% | |||||
… Leisure - Family | 58859 | 8.1% | |||||
… Leisure - Non-Family | 352535 | 48.6% | |||||
traveler_value_group | 724795 | ||||||
… High | 96496 | 13.3% | |||||
… Low | 382542 | 52.8% | |||||
… Med | 245172 | 33.8% | |||||
… None | 585 | 0.1% | |||||
total_number_events | 724795 | 2.295 | 4.651 | 1 | 1 | 2 | 390 |
Data Wrangling - Date & Traveler Value Group
raw_data$event_wday <- wday(raw_data$event_date,
label = TRUE,
abbr = TRUE)
raw_data$event_year <- factor(year(raw_data$event_date))
raw_data$event_mday <- factor(paste(month(raw_data$event_date,
label = TRUE,
abbr = TRUE),
str_pad(mday(raw_data$event_date),2,side="left", pad=0),
sep = " "))
raw_data$checkin_wday <- factor(wday(raw_data$departure_checkin_date,
label = TRUE,
abbr = TRUE))
raw_data$checkin_year <- factor(year(raw_data$departure_checkin_date))
raw_data$checkin_month <- factor(month(raw_data$departure_checkin_date,
label = TRUE,
abbr = TRUE))
raw_data$traveler_value_group <- factor(raw_data$traveler_value_group, order = TRUE,
levels = c("Low", "Med", "High"))
glimpse(raw_data)
Rows: 724,795
Columns: 16
$ event_date <date> 2019-01-07, 2019-01-16, 2021-01-~
$ activity_group <chr> "Booking", "Searching", "Searchin~
$ reason_for_travel_detailed <chr> "Leisure - Non-Family", "Business~
$ origin_country_code <chr> "DE", "GB", "DE", "US", "DE", "DE~
$ hotel_city <chr> "dusseldorf", "dusseldorf", "duss~
$ hotel_state <chr> NA, NA, NA, NA, NA, NA, NA, NA, N~
$ hotel_country <chr> "de", "de", "de", "de", "de", "de~
$ traveler_value_group <ord> Med, Med, Low, Low, Low, Low, Hig~
$ departure_checkin_date <date> 2019-01-25, 2019-02-05, 2021-01-~
$ total_number_events <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, ~
$ event_wday <ord> Mon, Wed, Thu, Thu, Thu, Sun, Thu~
$ event_year <fct> 2019, 2019, 2021, 2021, 2019, 202~
$ event_mday <fct> Jan 07, Jan 16, Jan 28, Jan 28, J~
$ checkin_wday <ord> Fri, Tue, Fri, Sun, Thu, Fri, Mon~
$ checkin_year <fct> 2019, 2019, 2021, 2021, 2019, 202~
$ checkin_month <ord> Jan, Feb, Jan, Jul, Jan, Mar, Mar~
Added event_wday, event_year, event_mday, checkin_wday, checkin_year, checkin_month attribues for a better understanding of total_number_events according to dates.
Transformed traveler_value_group into ordinal modeling type according to ascending order of Low, Med, and High.
Event Year EDA
color_2019 <- "darkgray"
color_2021 <- "purple"
color_bg <- "snow"
# event_year_cnt <- raw_data %>%
# count(event_year) %>%
# plot_ly(x=~event_year,
# y=~n,
# type= 'bar',
# marker = list(color = color_2021),
# hovertemplate = paste('Event Year: %{x}<br>',
# 'Count: %{y}',
# '<extra></extra>')) %>%
# layout(xaxis = list(title = ""),
# yaxis = list(title = "Total Number of Events"),
# hoverlabel=list(bgcolor=color_bg))
#event_year_cnt
event_year_data <- raw_data %>%
group_by(event_year) %>%
summarise(event_year_sum = sum(total_number_events))
event_year_total <- event_year_data %>%
plot_ly(x=~event_year,
y=~event_year_sum,
type= 'bar',
marker = list(color = c(color_2019,color_2021)),
hovertemplate = paste('Event Year: %{x}<br>',
'Total Number of Events: %{y}',
'<extra></extra>')) %>%
layout(title = "Event Year EDA",
xaxis = list(title = ""),
yaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
event_year_total
event_year_data$pct_total <- prop.table(event_year_data$event_year_sum)
DT::datatable(data = event_year_data,
options = list(pageLength = 2),
colnames = c("Event Year", "Total Number of Events", "Percent of Total"),
rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('event_year_sum', digits = 0) %>%
formatPercentage('pct_total', 2)
Event Month Day EDA
event_mday_2019 <- raw_data %>%
filter(event_year == 2019) %>%
group_by(event_mday) %>%
summarise(mday_2019_sum = sum(total_number_events)) %>%
arrange(event_mday)
event_mday_2021 <- raw_data %>%
filter(event_year == 2021) %>%
group_by(event_mday) %>%
summarise(mday_2021_sum = sum(total_number_events)) %>%
arrange(event_mday)
event_mday_data <- left_join(event_mday_2019,event_mday_2021)
event_mday_total <- event_mday_data %>%
plot_ly(x=~event_mday,
y=~mday_2019_sum ,
type = 'scatter',
mode = 'lines+markers',
line = list(color = color_2019),
marker = list(color = color_2019),
name = '2019',
hovertemplate = paste('Event Month Day: %{x}, 2019<br>',
'Total Number of Events: %{y}',
'<extra></extra>')) %>%
add_trace(y =~mday_2021_sum,
mode = 'lines+markers',
line = list(color = color_2021),
marker = list(color = color_2021),
name = '2021',
hovertemplate = paste('Event Month Day: %{x}, 2021<br>',
'Total Number of Events: %{y}',
'<extra></extra>')) %>%
layout(title = "Event Month Day EDA",
xaxis = list(title = ""),
yaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
event_mday_total
event_mday_data$diff_val <- event_mday_data$mday_2021_sum - event_mday_data$mday_2019_sum
event_mday_data$diff_pct <- event_mday_data$diff_val/event_mday_data$mday_2019_sum
DT::datatable(data = event_mday_data,
options = list(pageLength = 31),
colnames = c("Event Day of Month", "2019 Total Events", "2021 Total Events", "Difference (Values)" ,"Difference (%)"),
rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('mday_2019_sum', digits = 0) %>%
formatRound('mday_2021_sum', digits = 0) %>%
formatRound('diff_val', digits = 0) %>%
formatPercentage('diff_pct', 2)
Event Weekday EDA
event_wday_2019 <- raw_data %>%
filter(event_year == 2019) %>%
group_by(event_wday) %>%
summarise(wday_2019_sum = sum(total_number_events))
event_wday_2021 <- raw_data %>%
filter(event_year == 2021) %>%
group_by(event_wday) %>%
summarise(wday_2021_sum = sum(total_number_events))
event_wday_data <- left_join(event_wday_2019,event_wday_2021)
event_wday_total <- event_wday_data %>%
plot_ly(x=~event_wday,
y=~wday_2019_sum ,
type= 'bar',
marker = list(color = color_2019),
name = '2019',
hovertemplate = paste('Event Week Day: %{x} 2019<br>',
'Total Number of Events: %{y}',
'<extra></extra>')) %>%
add_trace(y =~wday_2021_sum,
marker = list(color = color_2021),
name = '2021',
hovertemplate = paste('Event Week Day: %{x} 2021<br>',
'Total Number of Events: %{y}',
'<extra></extra>')) %>%
layout(title = "Event Weekday EDA",
xaxis = list(title = ""),
yaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
event_wday_total
event_wday_data$diff <- event_wday_data$wday_2021_sum - event_wday_data$wday_2019_sum
event_wday_data$pct_diff <- event_wday_data$diff/event_wday_data$wday_2019_sum
DT::datatable(data = event_wday_data,
options = list(pageLength = 7),
colnames = c("Event Weekday", "2019 Total Events", "2021 Total Events", "Difference (Values)" ,"Difference (%)"),
rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('wday_2019_sum', digits = 0) %>%
formatRound('wday_2021_sum', digits = 0) %>%
formatRound('diff', digits = 0) %>%
formatPercentage('pct_diff', 2)
Activity Group EDA
activity_data <- raw_data %>%
group_by(activity_group) %>%
summarise(activity_sum = sum(total_number_events))
activity_total <- activity_data %>%
plot_ly(x=~activity_group,
y=~activity_sum,
type= 'bar',
marker = list(color = c(color_2021,color_2019)),
hovertemplate = paste('Activity Group: %{x}<br>',
'Total Number of Events: %{y}',
'<extra></extra>')) %>%
layout(title = "Activity Group EDA",
xaxis = list(title = ""),
yaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
activity_total
activity_data$pct_total <- prop.table(activity_data$activity_sum)
DT::datatable(data = activity_data,
options = list(pageLength = 2),
colnames = c("Activity Group", "Total Number of Events", "Percent of Total"),
rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('activity_sum', digits = 0) %>%
formatPercentage('pct_total', 2)
From the total number of events, only less than 8% are doing actual reservation. It would be interesting to see what the are the factors that affect the conversion from ‘Searching’ to ‘Booking’.
An ‘conversion’ attribute can be use as a target variable for a prediction model.
Reason for Travel Detailed EDA
color_biz <- "orchid" #business
color_lfam <- "darkorchid" #leisure-family
color_lnf <- "purple" #leisure-non-family
reason_travel_data <- raw_data %>%
group_by(reason_for_travel_detailed) %>%
summarise(reason_travel_sum = sum(total_number_events)) %>%
arrange(desc(reason_travel_sum))
activity_total <- reason_travel_data %>%
plot_ly(x=~reorder(reason_for_travel_detailed,desc(reason_travel_sum)),
y=~reason_travel_sum,
type= 'bar',
marker = list(color = c(color_biz,color_lfam, color_lnf)),
hovertemplate = paste('Reason For Travel Detailed: %{x}<br>',
'Total Number of Events: %{y}',
'<extra></extra>')) %>%
layout(title = "Reason for Travel Detailed EDA",
xaxis = list(title = ""),
yaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
activity_total
reason_travel_data$pct_total <- prop.table(reason_travel_data$reason_travel_sum)
DT::datatable(data = reason_travel_data,
options = list(pageLength = 4),
colnames = c("Reason for Travel Detailed", "Total Number of Events", "Percent of Total"),
rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('reason_travel_sum', digits = 0) %>%
formatPercentage('pct_total', 2)
Almost 95% of the total events are from Leisure - Non-Family and Business. Although Leisure - Family accounts for more than 5%, it would be good to know the spending patterns.
Intuitively, leisure travel consisting of 3 or more people for Leisure-Family category will spend more in the hotel services like rooms and food than Leisure - Non-Family and Business.
Origin Country Code
origin_country_data <- raw_data %>%
group_by(origin_country_code) %>%
summarise(origin_country_sum = sum(total_number_events)) %>%
arrange(desc(origin_country_sum))
origin_country_total <- origin_country_data %>%
slice_head(n = 10) %>%
plot_ly(x=~origin_country_sum,
y=~reorder(origin_country_code,origin_country_sum),
type= 'bar',
marker = list(color = color_2021),
hovertemplate = paste('Origin Country: %{y}<br>',
'Total Number of Events: %{x}',
'<extra></extra>')) %>%
layout(title = "Top 10 Origin Country Code EDA",
yaxis = list(title = ""),
xaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
origin_country_total
origin_country_data$pct_total <- prop.table(origin_country_data$origin_country_sum)
DT::datatable(data = origin_country_data,
options = list(pageLength = 10),
colnames = c("Origin Country Code", "Total Number of Events", "Percent of Total"),
#rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('origin_country_sum', digits = 0) %>%
formatPercentage('pct_total', 2)
Hotel City
hotel_city_data <- raw_data %>%
group_by(hotel_city) %>%
summarise(hotel_city_sum = sum(total_number_events)) %>%
arrange(desc(hotel_city_sum))
hotel_city_total <- hotel_city_data %>%
slice_head(n = 10) %>%
plot_ly(x=~hotel_city_sum,
y=~reorder(hotel_city,hotel_city_sum),
type= 'bar',
marker = list(color = color_2021),
hovertemplate = paste('Hotel City: %{y}<br>',
'Total Number of Events: %{x}',
'<extra></extra>')) %>%
layout(title = "Top 10 Hotel City EDA",
yaxis = list(title = ""),
xaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
hotel_city_total
hotel_city_data$pct_total <- prop.table(hotel_city_data$hotel_city_sum)
DT::datatable(data = hotel_city_data,
options = list(pageLength = 10),
colnames = c("Hotel City", "Total Number of Events", "Percent of Total"),
#rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('hotel_city_sum', digits = 0) %>%
formatPercentage('pct_total', 2)
Hotel State
hotel_state_data <- raw_data %>%
group_by(hotel_state) %>%
summarise(hotel_state_sum = sum(total_number_events)) %>%
arrange(desc(hotel_state_sum))
hotel_state_total <- hotel_state_data %>%
slice_head(n = 10) %>%
plot_ly(x=~hotel_state_sum,
y=~reorder(hotel_state,hotel_state_sum),
type= 'bar',
marker = list(color = color_2021),
hovertemplate = paste('Hotel State: %{y}<br>',
'Total Number of Events: %{x}',
'<extra></extra>')) %>%
layout(title = "Top 10 Hotel State EDA",
yaxis = list(title = ""),
xaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
hotel_state_total
hotel_state_data$pct_total <- prop.table(hotel_state_data$hotel_state_sum)
DT::datatable(data = hotel_state_data,
options = list(pageLength = 10),
colnames = c("Hotel State", "Total Number of Events", "Percent of Total"),
#rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('hotel_state_sum', digits = 0) %>%
formatPercentage('pct_total', 2)
Hotel Country
hotel_country_data <- raw_data %>%
group_by(hotel_country) %>%
summarise(hotel_country_sum = sum(total_number_events)) %>%
arrange(desc(hotel_country_sum))
hotel_country_total <- hotel_country_data %>%
slice_head(n = 10) %>%
plot_ly(x=~hotel_country_sum,
y=~reorder(hotel_country,hotel_country_sum),
type= 'bar',
marker = list(color = color_2021),
hovertemplate = paste('Hotel Country: %{y}<br>',
'Total Number of Events: %{x}',
'<extra></extra>')) %>%
layout(title = "Top 10 Hotel Country EDA",
yaxis = list(title = ""),
xaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
hotel_country_total
hotel_country_data$pct_total <- prop.table(hotel_country_data$hotel_country_sum)
DT::datatable(data = hotel_country_data,
options = list(pageLength = 10),
colnames = c("Hotel Country", "Total Number of Events", "Percent of Total"),
#rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('hotel_country_sum', digits = 0) %>%
formatPercentage('pct_total', 2)
London, GB accounts to almost 20% of the total number of events
As noted in the Data Dictionary, hotel_city, hotel_state, and hotel_country are raw data from the source systems and requires data cleaning to be useable.
Traveler Value Group EDA
color_low <- "orchid" #business
color_med <- "darkorchid" #leisure-family
color_high <- "purple" #leisure-non-family
traveler_value_data <- raw_data %>%
group_by(traveler_value_group) %>%
summarise(traveler_value_sum = sum(total_number_events))%>%
arrange(desc(traveler_value_sum))
traveler_value_total <- traveler_value_data %>%
plot_ly(x=~reorder(traveler_value_group,desc(traveler_value_sum)),
y=~traveler_value_sum,
type= 'bar',
marker = list(color = c(color_low,color_med, color_high)),
hovertemplate = paste('Traveler Value Group: %{x}<br>',
'Total Number of Events: %{y}',
'<extra></extra>')) %>%
layout(title = "Traveler Value Group EDA",
xaxis = list(title = ""),
yaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
traveler_value_total
traveler_value_data$pct_total <- prop.table(traveler_value_data$traveler_value_sum)
DT::datatable(data = traveler_value_data,
options = list(pageLength = 4),
colnames = c("Traveler Value Group", "Total Number of Events", "Percent of Total"),
rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('traveler_value_sum', digits = 0) %>%
formatPercentage('pct_total', 2)
Similar to reason_for_travel_detailed, it would be good to know the spending patterns of different Traveler Value Group.
Intuitively, “High” will spend more in the hotel services like rooms, food, and loyalty. Therefore, they require the best quality of customer care.
Checkin Year EDA
checkin_year_data <- raw_data %>%
group_by(checkin_year) %>%
summarise(checkin_year_sum = sum(total_number_events))
checkin_year_total <- checkin_year_data %>%
plot_ly(x=~checkin_year,
y=~checkin_year_sum,
type= 'bar',
marker = list(color = c(color_2019,color_low, color_2021, color_low, color_low, color_low)),
hovertemplate = paste('Checkin Year: %{x}<br>',
'Total Number of Events: %{y}',
'<extra></extra>')) %>%
layout(title = "Checkin Year EDA",
xaxis = list(title = ""),
yaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
checkin_year_total
checkin_year_data$pct_total <- prop.table(checkin_year_data$checkin_year_sum)
DT::datatable(data = checkin_year_data,
options = list(pageLength = 10),
colnames = c("Checkin Year", "Total Number of Events", "Percent of Total"),
rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('checkin_year_sum', digits = 0) %>%
formatPercentage('pct_total', 2)
Checkin Month Day EDA
checkin_month_data <- raw_data %>%
group_by(checkin_month) %>%
summarise(checkin_month_sum = sum(total_number_events))
event_month_total <- checkin_month_data %>%
plot_ly(x=~checkin_month,
y=~checkin_month_sum,
type= 'bar',
marker = list(color = color_2021),
hovertemplate = paste('Checkin Month: %{x}<br>',
'Total Number of Events: %{y}',
'<extra></extra>')) %>%
layout(title = "Checkin Month EDA",
xaxis = list(title = ""),
yaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
event_month_total
checkin_month_data$pct_total <- prop.table(checkin_month_data$checkin_month_sum)
DT::datatable(data = checkin_month_data,
options = list(pageLength = 12),
colnames = c("Checkin Month", "Total Number of Events", "Percent of Total"),
rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('checkin_month_sum', digits = 0) %>%
formatPercentage('pct_total', 2)
Total Number of Events Histogram
total_event_data <- raw_data %>%
count(total_number_events) %>%
rename(count = n) %>%
arrange(desc(count))
total_event_hist <- total_event_data %>%
slice_head(n = 10) %>%
plot_ly(x=~count,
y=~reorder(total_number_events,count),
type= 'bar',
marker = list(color = color_2021),
hovertemplate = paste('Total Number of Events: %{y} <br>',
'Count: %{x}',
'<extra></extra>')) %>%
layout(title = "Top 10 Total Number of Events Histogram",
xaxis = list(title = "Count"),
yaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
total_event_hist
total_event_bp <- raw_data %>%
select(total_number_events) %>%
plot_ly(y= ~total_number_events,
type= 'box',
name = "Suspected Outliers",
boxpoints = 'suspectedoutliers',
marker = list(color = color_2021,
outliercolor = color_2019,
line = list(outliercolor = color_2019,
outlierwidth = 5)),
line = list(color = color_2021)) %>%
layout(title = "Total Number of Events Boxplot",
xaxis = list(title = ""),
yaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
total_event_bp
total_event_data$pct_total <- prop.table(total_event_data$count)
DT::datatable(data = total_event_data,
options = list(pageLength = 10),
colnames = c("Total Number of Events", "Count", "Percent of Total"),
#rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('count', digits = 0) %>%
formatPercentage('pct_total', 2)
Looking at the histogram, boxplot and table, there are extreme outliers like the maximum value of 390 total number of events.
This may suggest that there are some issues with the data acquisition or incorrect data entries.
Clean Hotel City
clean_data <- raw_data
#Replace Hotel City
clean_data <- clean_data %>%
mutate(hotel_city = ifelse(str_detect(hotel_city, "dusseldorf"), "duesseldorf", hotel_city)) %>%
mutate(hotel_city = ifelse(str_detect(hotel_city, "firenze"), "florence", hotel_city)) %>%
mutate(hotel_city = ifelse(str_detect(hotel_city, "kln"), "koeln", hotel_city)) %>%
mutate(hotel_city = ifelse(str_detect(hotel_city, "mnchen"), "munich", hotel_city)) %>%
mutate(hotel_city = ifelse(str_detect(hotel_city, "seville"), "sevilla", hotel_city)) %>%
mutate(hotel_city = ifelse(str_detect(hotel_city, "venezia"), "venice", hotel_city))
hotel_city_clean_data <- clean_data %>%
group_by(hotel_city) %>%
summarise(hotel_city_sum = sum(total_number_events)) %>%
arrange(desc(hotel_city_sum))
hotel_city_clean_data$pct_total <- prop.table(hotel_city_clean_data$hotel_city_sum)
DT::datatable(data = hotel_city_clean_data,
options = list(pageLength = 10),
colnames = c("Hotel City", "Total Number of Events", "Percent of Total"),
#rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('hotel_city_sum', digits = 0) %>%
formatPercentage('pct_total', 2)
Clean Hotel Country
#Replace Hotel Country
clean_data <- clean_data %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "alemania"), "de", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "allemagne"), "de", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "germania"), "de", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "espaa"), "es", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "catalunya"), "es", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "gran bretaa"), "gb", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "gran bretagna"), "gb", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "grande bretagne"), "gb", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "grobritannien"), "gb", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "regno unito"), "gb", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "reino unido"), "gb", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "royaume-uni"), "gb", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "sct"), "gb", hotel_country)) %>%
mutate(hotel_country = ifelse(str_detect(hotel_country, "on"), "gb", hotel_country)) %>% #london
mutate(hotel_country = ifelse(str_detect(hotel_country, "ns"), "gb", hotel_country)) %>% #liverpool
mutate(hotel_country = ifelse(str_detect(hotel_country, "jp"), "fr", hotel_country)) %>% #bordeaux
mutate(hotel_country = ifelse(str_detect(hotel_country, "br"), "es", hotel_country)) %>% #barcelona
mutate(hotel_country = ifelse(str_detect(hotel_country, "co"), "es", hotel_country)) #madrid
hotel_country_clean_data <- clean_data %>%
group_by(hotel_country) %>%
summarise(hotel_country_sum = sum(total_number_events)) %>%
arrange(desc(hotel_country_sum))
hotel_country_clean_data$pct_total <- prop.table(hotel_country_clean_data$hotel_country_sum)
DT::datatable(data = hotel_country_clean_data,
options = list(pageLength = 11),
colnames = c("Hotel Country", "Total Number of Events", "Percent of Total"),
#rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('hotel_country_sum', digits = 0) %>%
formatPercentage('pct_total', 2)
Issues:
Merged and renamed countries. However US, CA, AU, and RU are still incorrect (does not match correctly with hotel_city)
There are missing countries and (not set) entries
Assuming the hotel_city is correct, a better approach is to map the hotel_country from the hotel_city
Additionally, hotel_state is unusable because of 93.37% of the data are missing. If this attribute is needed, hotel_state can be mapped from the hotel_city
clean_data <- clean_data %>%
select(-c(hotel_state,hotel_country))
write_csv(clean_data,"./data/clean_data.csv")
glimpse(clean_data)
Rows: 724,795
Columns: 14
$ event_date <date> 2019-01-07, 2019-01-16, 2021-01-~
$ activity_group <chr> "Booking", "Searching", "Searchin~
$ reason_for_travel_detailed <chr> "Leisure - Non-Family", "Business~
$ origin_country_code <chr> "DE", "GB", "DE", "US", "DE", "DE~
$ hotel_city <chr> "duesseldorf", "duesseldorf", "du~
$ traveler_value_group <ord> Med, Med, Low, Low, Low, Low, Hig~
$ departure_checkin_date <date> 2019-01-25, 2019-02-05, 2021-01-~
$ total_number_events <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, ~
$ event_wday <ord> Mon, Wed, Thu, Thu, Thu, Sun, Thu~
$ event_year <fct> 2019, 2019, 2021, 2021, 2019, 202~
$ event_mday <fct> Jan 07, Jan 16, Jan 28, Jan 28, J~
$ checkin_wday <ord> Fri, Tue, Fri, Sun, Thu, Fri, Mon~
$ checkin_year <fct> 2019, 2019, 2021, 2021, 2019, 202~
$ checkin_month <ord> Jan, Feb, Jan, Jul, Jan, Mar, Mar~
For the final data to be used for analysis, hotel_state and hotel_country will be excluded since the final report only requires top destination per city.
Additionally, the outliers in the total_number_events and missing values from other fields will be kept as is. The analysis assumes that the data inputs are correct.
booking_2019_data <- clean_data %>%
filter(activity_group == "Booking") %>%
filter(event_year == 2019) %>%
group_by(hotel_city) %>%
summarise(booking_2019_sum = sum(total_number_events))
booking_2021_data <- clean_data %>%
filter(activity_group == "Booking") %>%
filter(event_year == 2021) %>%
group_by(hotel_city) %>%
summarise(booking_2021_sum = sum(total_number_events))
booking_city_data <- left_join(booking_2019_data, booking_2021_data)
booking_city_total <- booking_city_data %>%
plot_ly(x=~booking_2021_sum,
y=~reorder(hotel_city,booking_2021_sum),
type= 'bar',
marker = list(color = color_2021),
name = '2021',
hovertemplate = paste('Hotel City: %{y}<br>',
'2021 Total Number of Events: %{x}',
'<extra></extra>')) %>%
add_trace(x=~booking_2019_sum,
marker = list(color = color_2019),
name = '2019',
hovertemplate = paste('Hotel City: %{y}<br>',
'2019 Total Number of Events: %{x}',
'<extra></extra>')) %>%
layout(title = "Booking Top Destinations by City",
yaxis = list(title = ""),
xaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
booking_city_total
booking_city_data$diff <- booking_city_data$booking_2021_sum - booking_city_data$booking_2019_sum
booking_city_data$pct_diff <- booking_city_data$diff / booking_city_data$booking_2019_sum
DT::datatable(data = booking_city_data %>%
arrange(desc(pct_diff)),
options = list(pageLength = 10),
colnames = c("Booking Destination by City", "2019 Total Events", "2021 Total Events", "Difference (Values)" ,"Difference (%)"),
rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('booking_2019_sum', digits = 0) %>%
formatRound('booking_2021_sum', digits = 0) %>%
formatRound('diff', digits = 0) %>%
formatPercentage('pct_diff', 2)
booking_city_diff <- booking_city_data %>%
plot_ly(x=~pct_diff,
y=~reorder(hotel_city,pct_diff),
type= 'bar',
marker = list(color = color_2021),
hovertemplate = paste('Hotel City: %{y}<br>',
'2021-2019 Difference(%): %{x}',
'<extra></extra>')) %>%
layout(title = "Destinations by City Percentage Change (2021 vs. 2019)",
yaxis = list(title = ""),
xaxis = list(title = "Percentage Change(%)",tickformat = "%"),
hoverlabel=list(bgcolor=color_bg))
booking_city_diff
booking_country_2019_data <- clean_data %>%
filter(activity_group == "Booking") %>%
filter(event_year == 2019) %>%
group_by(origin_country_code) %>%
summarise(booking_country_2019_sum = sum(total_number_events))
booking_country_2021_data <- clean_data %>%
filter(activity_group == "Booking") %>%
filter(event_year == 2021) %>%
group_by(origin_country_code) %>%
summarise(booking_country_2021_sum = sum(total_number_events))
booking_country_data <- full_join(booking_country_2019_data, booking_country_2021_data)
booking_country_total <- booking_country_data %>%
slice_head(n = 10) %>%
plot_ly(x=~booking_country_2021_sum,
y=~reorder(origin_country_code,booking_country_2021_sum),
type= 'bar',
marker = list(color = color_2021),
name = '2021',
hovertemplate = paste('Origin Country: %{y}<br>',
'2021 Total Number of Events: %{x}',
'<extra></extra>')) %>%
add_trace(x=~booking_country_2019_sum,
marker = list(color = color_2019),
name = '2019',
hovertemplate = paste('Origin Country: %{y}<br>',
'2019 Total Number of Events: %{x}',
'<extra></extra>')) %>%
layout(title = "Top 10 Origin Country",
yaxis = list(title = ""),
xaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
booking_country_total
booking_country_data$diff <- booking_country_data$booking_country_2021_sum - booking_country_data$booking_country_2019_sum
booking_country_data$pct_diff <- booking_country_data$diff / booking_country_data$booking_country_2019_sum
DT::datatable(data = booking_country_data %>%
arrange(desc(booking_country_2021_sum)),
options = list(pageLength = 10),
colnames = c("Booking Origin Country", "2019 Total Events", "2021 Total Events", "Difference (Values)" ,"Difference (%)"),
rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('booking_country_2019_sum', digits = 0) %>%
formatRound('booking_country_2021_sum', digits = 0) %>%
formatRound('diff', digits = 0) %>%
formatPercentage('pct_diff', 2)
In terms of percent change of activity (2021 vs 2019), only florence has positive growth of 203.92%.
The rest of the destinations experienced negative percent change from venice (-29.69%) up to the worst hit cities which are leeds (-93.82%), frankfurt(-93.84%), koeln (-93.85%).
In terms of total number of events, birmingham has the highest events with 2,650 followed by london with 2,600.
The bottom 3 destination cities have less than 100 events which are malaga with 83, cologne with 52, and palma de mallorca with a mere 38.
The top origin countries from which bookings were made are US, FR, GB, DE, and ES.
searching_2019_data <- clean_data %>%
filter(activity_group == "Searching") %>%
filter(event_year == 2019) %>%
group_by(hotel_city) %>%
summarise(searching_2019_sum = sum(total_number_events))
searching_2021_data <- clean_data %>%
filter(activity_group == "Searching") %>%
filter(event_year == 2021) %>%
group_by(hotel_city) %>%
summarise(searching_2021_sum = sum(total_number_events))
searching_city_data <- left_join(searching_2019_data, searching_2021_data)
searching_city_total <- searching_city_data %>%
plot_ly(x=~searching_2021_sum,
y=~reorder(hotel_city,searching_2021_sum),
type= 'bar',
marker = list(color = color_2021),
name = '2021',
hovertemplate = paste('Hotel City: %{y}<br>',
'2021 Total Number of Events: %{x}',
'<extra></extra>')) %>%
add_trace(x=~searching_2019_sum,
marker = list(color = color_2019),
name = '2019',
hovertemplate = paste('Hotel City: %{y}<br>',
'2019 Total Number of Events: %{x}',
'<extra></extra>')) %>%
layout(title = "Searching Top Destinations by City",
yaxis = list(title = ""),
xaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
searching_city_total
searching_city_data$diff <- searching_city_data$searching_2021_sum - searching_city_data$searching_2019_sum
searching_city_data$pct_diff <- searching_city_data$diff / searching_city_data$searching_2019_sum
DT::datatable(data = searching_city_data %>%
arrange(desc(pct_diff)),
options = list(pageLength = 10),
colnames = c("Searching Destination by City", "2019 Total Events", "2021 Total Events", "Difference (Values)" ,"Difference (%)"),
rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('searching_2019_sum', digits = 0) %>%
formatRound('searching_2021_sum', digits = 0) %>%
formatRound('diff', digits = 0) %>%
formatPercentage('pct_diff', 2)
searching_city_diff <- searching_city_data %>%
plot_ly(x=~pct_diff,
y=~reorder(hotel_city,pct_diff),
type= 'bar',
marker = list(color = color_2021),
hovertemplate = paste('Hotel City: %{y}<br>',
'2021-2019 Difference(%): %{x}',
'<extra></extra>')) %>%
layout(title = "Destinations by City Percentage Change (2021 vs. 2019)",
yaxis = list(title = ""),
xaxis = list(title = "Percentage Change(%)",tickformat = "%"),
hoverlabel=list(bgcolor=color_bg))
searching_city_diff
searching_country_2019_data <- clean_data %>%
filter(activity_group == "Searching") %>%
filter(event_year == 2019) %>%
group_by(origin_country_code) %>%
summarise(searching_country_2019_sum = sum(total_number_events))
searching_country_2021_data <- clean_data %>%
filter(activity_group == "Searching") %>%
filter(event_year == 2021) %>%
group_by(origin_country_code) %>%
summarise(searching_country_2021_sum = sum(total_number_events))
searching_country_data <- full_join(searching_country_2019_data, searching_country_2021_data)
searching_country_total <- searching_country_data %>%
slice_head(n = 10) %>%
plot_ly(x=~searching_country_2021_sum,
y=~reorder(origin_country_code,searching_country_2021_sum),
type= 'bar',
marker = list(color = color_2021),
name = '2021',
hovertemplate = paste('Origin Country: %{y}<br>',
'2021 Total Number of Events: %{x}',
'<extra></extra>')) %>%
add_trace(x=~searching_country_2019_sum,
marker = list(color = color_2019),
name = '2019',
hovertemplate = paste('Origin Country: %{y}<br>',
'2019 Total Number of Events: %{x}',
'<extra></extra>')) %>%
layout(title = "Top 10 Origin Country",
yaxis = list(title = ""),
xaxis = list(title = "Total Number of Events"),
hoverlabel=list(bgcolor=color_bg))
searching_country_total
searching_country_data$diff <- searching_country_data$searching_country_2021_sum - searching_country_data$searching_country_2019_sum
searching_country_data$pct_diff <- searching_country_data$diff / searching_country_data$searching_country_2019_sum
DT::datatable(data = searching_country_data %>%
arrange(desc(searching_country_2021_sum)),
options = list(pageLength = 10),
colnames = c("Booking Origin Country", "2019 Total Events", "2021 Total Events", "Difference (Values)" ,"Difference (%)"),
rownames = FALSE,
filter = "none",
autoHideNavigation = TRUE) %>%
formatRound('searching_country_2019_sum', digits = 0) %>%
formatRound('searching_country_2021_sum', digits = 0) %>%
formatRound('diff', digits = 0) %>%
formatPercentage('pct_diff', 2)
In terms of percent change of activity (2021 vs 2019), florence, venince, and birmingham have positive percent change with values of 181.83%, 27.73% and 1.51% respectively.
The rest of the destinations experienced negative percent change from palma de mallorca (-33.84%) up to the worst hit cities which are koeln (-92.54%) and leeds (-93.70%)
In terms of total number of events, london has the highest events with 54,812 and consistently at the top for 2019 and 2021.
The bottom 4 destination cities have less than 2000 events which are malaga (1,836), cologne (1,817), hannover (1,573), and leeds (1,210).
The top origin countries from which bookings were made are US, FR, GB, ES, and DE.
Add new attribute like a flag to understand the factors that affect the conversion from ‘Searching’ to ‘Booking’. An ‘conversion’ attribute can be use as a target variable for a prediction model.
Add spending patterns information to know how ‘Reason for Travel Detailed’ and ‘Traveler Value Group’ contribute to the bottom line.
Add user demographics information, time of event, and campaigns used to advertise the hotel to better plan and invest marketing and operational resources.
Suggest to focus the marketing and operational resources to the top origin countries from which bookings were made such as US, FR, GB, DE, and ES. Additionally, may need to investigate why there are no bookings or even search from US for 2019. This may be a data source issue.
Finally, aside from looking at the City Hotel, perform additional interactive visualization and data analysis to other attributes (reason_for_travel_detailed, traveler_value_group, and departure_checkin_date).
For attribution, please cite this work as
Dolit (2021, Sept. 11). FinTech & Analytics: Adara Challenge Analysis. Retrieved from https://adolit.github.io/posts/2021-09-11-adara-challenge-analysis/
BibTeX citation
@misc{dolit2021adara, author = {Dolit, Archie}, title = {FinTech & Analytics: Adara Challenge Analysis}, url = {https://adolit.github.io/posts/2021-09-11-adara-challenge-analysis/}, year = {2021} }