Adara Challenge Analysis

R Adara Analytical Assessment

Technical Analytical Assessment for Field Data Science team

Archie Dolit https://www.linkedin.com/in/adolit/ (School of Computing and Information Systems, Singapore Management University)https://scis.smu.edu.sg/
09-11-2021

1. Exploratory Data Analysis

1.1 Data Dictionary

The data contains events from the months of January 2019 and January 2021; the 2019 can be used as a basis of comparison for the activity in 2021.

The destinations contained in the data are selected top-tier cities in Germany, UK, France, Italy, and Spain

The table below shows the definition of attributes:

Attributes Definition
event_date The date of the search or booking activity. This is different from the departure_checkin_date, which is the actual date of travel activity: if a customer makes a booking today for check-in on 1 October, today’s date would be the event date, and 1 October would be the “departure_checkin_date”
activity_group An indicator of the type of event on the hotel website. Classified as “Searching” (a general search for a room) or “Booking” (an actual reservation)
reason_for_travel_detailed this is the Reason for Travel, categorized as either “Business”, “Leisure-Non-Family” (leisure travel consisting of 1 or 2 people), and “Leisure-Family” (leisure travel consisting of 3 or more people)
origin_country_code 2-letter ISO country code indicating the origin country from which the booking or search activity was made
hotel_city,hotel_state,hotel_country raw information specifying the destination in question.
Note that this data is in an unclean state, provided straight from the source system.
traveler_value_group Indicator of traveler value (overall spend + loyalty status), grouped into “Low”, “Medium”, “High”
departure_checkin_date Start date of hotel stay; see event_date to determine how the date information differs.
total_number_events Aggregated count of total number of events

1.2 Data Exploration

Install and Launch R Packages

hide
packages = c('plotly', 
             'lubridate',
             'DT',
             'vtable',
             'tidyverse')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Import Data

hide
raw_data <- read_csv("data/sample_data.csv")
glimpse(raw_data)
Rows: 724,795
Columns: 10
$ event_date                 <date> 2019-01-07, 2019-01-16, 2021-01-~
$ activity_group             <chr> "Booking", "Searching", "Searchin~
$ reason_for_travel_detailed <chr> "Leisure - Non-Family", "Business~
$ origin_country_code        <chr> "DE", "GB", "DE", "US", "DE", "DE~
$ hotel_city                 <chr> "dusseldorf", "dusseldorf", "duss~
$ hotel_state                <chr> NA, NA, NA, NA, NA, NA, NA, NA, N~
$ hotel_country              <chr> "de", "de", "de", "de", "de", "de~
$ traveler_value_group       <chr> "Med", "Med", "Low", "Low", "Low"~
$ departure_checkin_date     <date> 2019-01-25, 2019-02-05, 2021-01-~
$ total_number_events        <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, ~

Summary Statistics

hide
st(raw_data)
(#tab:summary 2 csv)Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
activity_group 724795
… Booking 89118 12.3%
… Searching 635677 87.7%
reason_for_travel_detailed 724795
… Business 313401 43.2%
… Leisure - Family 58859 8.1%
… Leisure - Non-Family 352535 48.6%
traveler_value_group 724795
… High 96496 13.3%
… Low 382542 52.8%
… Med 245172 33.8%
… None 585 0.1%
total_number_events 724795 2.295 4.651 1 1 2 390

Data Wrangling - Date & Traveler Value Group

hide
raw_data$event_wday <- wday(raw_data$event_date,
                          label = TRUE,
                          abbr = TRUE)

raw_data$event_year <- factor(year(raw_data$event_date))

raw_data$event_mday <- factor(paste(month(raw_data$event_date,
                             label = TRUE,
                             abbr = TRUE),
                             str_pad(mday(raw_data$event_date),2,side="left", pad=0),
                             sep = " "))

raw_data$checkin_wday <- factor(wday(raw_data$departure_checkin_date,
                          label = TRUE,
                          abbr = TRUE))

raw_data$checkin_year <- factor(year(raw_data$departure_checkin_date))

raw_data$checkin_month <- factor(month(raw_data$departure_checkin_date,
                             label = TRUE,
                             abbr = TRUE))

raw_data$traveler_value_group <- factor(raw_data$traveler_value_group, order = TRUE,
                                        levels = c("Low", "Med", "High"))

glimpse(raw_data)
Rows: 724,795
Columns: 16
$ event_date                 <date> 2019-01-07, 2019-01-16, 2021-01-~
$ activity_group             <chr> "Booking", "Searching", "Searchin~
$ reason_for_travel_detailed <chr> "Leisure - Non-Family", "Business~
$ origin_country_code        <chr> "DE", "GB", "DE", "US", "DE", "DE~
$ hotel_city                 <chr> "dusseldorf", "dusseldorf", "duss~
$ hotel_state                <chr> NA, NA, NA, NA, NA, NA, NA, NA, N~
$ hotel_country              <chr> "de", "de", "de", "de", "de", "de~
$ traveler_value_group       <ord> Med, Med, Low, Low, Low, Low, Hig~
$ departure_checkin_date     <date> 2019-01-25, 2019-02-05, 2021-01-~
$ total_number_events        <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, ~
$ event_wday                 <ord> Mon, Wed, Thu, Thu, Thu, Sun, Thu~
$ event_year                 <fct> 2019, 2019, 2021, 2021, 2019, 202~
$ event_mday                 <fct> Jan 07, Jan 16, Jan 28, Jan 28, J~
$ checkin_wday               <ord> Fri, Tue, Fri, Sun, Thu, Fri, Mon~
$ checkin_year               <fct> 2019, 2019, 2021, 2021, 2019, 202~
$ checkin_month              <ord> Jan, Feb, Jan, Jul, Jan, Mar, Mar~

Event Year EDA

hide
color_2019 <- "darkgray"
color_2021 <- "purple"
color_bg <- "snow"

# event_year_cnt <- raw_data %>%
#   count(event_year) %>%
#   plot_ly(x=~event_year,
#           y=~n,
#           type= 'bar',
#           marker = list(color = color_2021),
#           hovertemplate = paste('Event Year: %{x}<br>',
#                                 'Count: %{y}',
#                                 '<extra></extra>')) %>%
#   layout(xaxis = list(title = ""),
#          yaxis = list(title = "Total Number of Events"),
#          hoverlabel=list(bgcolor=color_bg))
#event_year_cnt

event_year_data <- raw_data %>%
  group_by(event_year) %>%
  summarise(event_year_sum = sum(total_number_events))

event_year_total <- event_year_data %>%
  plot_ly(x=~event_year,
          y=~event_year_sum,
          type= 'bar',
          marker = list(color = c(color_2019,color_2021)),
          hovertemplate = paste('Event Year: %{x}<br>',
                                'Total Number of Events: %{y}',
                                '<extra></extra>')) %>%
  layout(title = "Event Year EDA",
         xaxis = list(title = ""),
         yaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

event_year_total
hide
event_year_data$pct_total <- prop.table(event_year_data$event_year_sum) 

DT::datatable(data = event_year_data,
              options = list(pageLength = 2),
              colnames = c("Event Year", "Total Number of Events", "Percent of Total"),
              rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('event_year_sum', digits = 0) %>%
  formatPercentage('pct_total', 2)

Event Month Day EDA

hide
event_mday_2019 <- raw_data %>%
  filter(event_year == 2019) %>%
  group_by(event_mday) %>%
  summarise(mday_2019_sum = sum(total_number_events)) %>%
  arrange(event_mday)

event_mday_2021 <- raw_data %>%
  filter(event_year == 2021) %>%
  group_by(event_mday) %>%
  summarise(mday_2021_sum = sum(total_number_events)) %>%
  arrange(event_mday)

event_mday_data <- left_join(event_mday_2019,event_mday_2021)
  
event_mday_total <- event_mday_data %>%
  plot_ly(x=~event_mday,
          y=~mday_2019_sum ,
          type = 'scatter',
          mode = 'lines+markers',
          line = list(color = color_2019),
          marker = list(color = color_2019),
          name = '2019',
          hovertemplate = paste('Event Month Day: %{x}, 2019<br>',
                                'Total Number of Events: %{y}',
                                '<extra></extra>')) %>%
  add_trace(y =~mday_2021_sum,
            mode = 'lines+markers',
            line = list(color = color_2021),
            marker = list(color = color_2021),
            name = '2021',
            hovertemplate = paste('Event Month Day: %{x}, 2021<br>',
                                  'Total Number of Events: %{y}',
                                  '<extra></extra>')) %>%
  layout(title = "Event Month Day EDA",
         xaxis = list(title = ""),
         yaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

event_mday_total
hide
event_mday_data$diff_val <- event_mday_data$mday_2021_sum - event_mday_data$mday_2019_sum
event_mday_data$diff_pct <- event_mday_data$diff_val/event_mday_data$mday_2019_sum

DT::datatable(data = event_mday_data,
              options = list(pageLength = 31),
              colnames = c("Event Day of Month", "2019 Total Events", "2021 Total Events", "Difference (Values)" ,"Difference (%)"),
              rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('mday_2019_sum', digits = 0) %>%
  formatRound('mday_2021_sum', digits = 0) %>%
  formatRound('diff_val', digits = 0) %>%
  formatPercentage('diff_pct', 2)

Event Weekday EDA

hide
event_wday_2019 <- raw_data %>%
  filter(event_year == 2019) %>%
  group_by(event_wday) %>%
  summarise(wday_2019_sum = sum(total_number_events))

event_wday_2021 <- raw_data %>%
  filter(event_year == 2021) %>%
  group_by(event_wday) %>%
  summarise(wday_2021_sum = sum(total_number_events))

event_wday_data <- left_join(event_wday_2019,event_wday_2021)
  
event_wday_total <- event_wday_data %>%
  plot_ly(x=~event_wday,
          y=~wday_2019_sum ,
          type= 'bar',
          marker = list(color = color_2019),
          name = '2019',
          hovertemplate = paste('Event Week Day: %{x} 2019<br>',
                                'Total Number of Events: %{y}',
                                '<extra></extra>')) %>%
  add_trace(y =~wday_2021_sum,
            marker = list(color = color_2021),
            name = '2021',
            hovertemplate = paste('Event Week Day: %{x} 2021<br>',
                                  'Total Number of Events: %{y}',
                                  '<extra></extra>')) %>%
  layout(title = "Event Weekday EDA",
         xaxis = list(title = ""),
         yaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

event_wday_total
hide
event_wday_data$diff <- event_wday_data$wday_2021_sum - event_wday_data$wday_2019_sum
event_wday_data$pct_diff <- event_wday_data$diff/event_wday_data$wday_2019_sum

DT::datatable(data = event_wday_data,
              options = list(pageLength = 7),
              colnames = c("Event Weekday", "2019 Total Events", "2021 Total Events", "Difference (Values)" ,"Difference (%)"),
              rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('wday_2019_sum', digits = 0) %>%
  formatRound('wday_2021_sum', digits = 0) %>%
  formatRound('diff', digits = 0) %>%
  formatPercentage('pct_diff', 2)

Activity Group EDA

hide
activity_data <- raw_data %>%
  group_by(activity_group) %>%
  summarise(activity_sum = sum(total_number_events))

activity_total <- activity_data %>%
  plot_ly(x=~activity_group,
          y=~activity_sum,
          type= 'bar',
          marker = list(color = c(color_2021,color_2019)),
          hovertemplate = paste('Activity Group: %{x}<br>',
                                'Total Number of Events: %{y}',
                                '<extra></extra>')) %>%
  layout(title = "Activity Group EDA",
         xaxis = list(title = ""),
         yaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

activity_total
hide
activity_data$pct_total <- prop.table(activity_data$activity_sum) 

DT::datatable(data = activity_data,
              options = list(pageLength = 2),
              colnames = c("Activity Group", "Total Number of Events", "Percent of Total"),
              rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('activity_sum', digits = 0) %>%
  formatPercentage('pct_total', 2)

Reason for Travel Detailed EDA

hide
color_biz <- "orchid" #business
color_lfam <- "darkorchid"  #leisure-family
color_lnf  <- "purple"  #leisure-non-family

reason_travel_data <- raw_data %>%
  group_by(reason_for_travel_detailed) %>%
  summarise(reason_travel_sum = sum(total_number_events)) %>%
  arrange(desc(reason_travel_sum))

activity_total <- reason_travel_data %>%
  plot_ly(x=~reorder(reason_for_travel_detailed,desc(reason_travel_sum)),
          y=~reason_travel_sum,
          type= 'bar',
          marker = list(color = c(color_biz,color_lfam, color_lnf)),
          hovertemplate = paste('Reason For Travel Detailed: %{x}<br>',
                                'Total Number of Events: %{y}',
                                '<extra></extra>')) %>%
  layout(title = "Reason for Travel Detailed EDA",
         xaxis = list(title = ""),
         yaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

activity_total
hide
reason_travel_data$pct_total <- prop.table(reason_travel_data$reason_travel_sum)

DT::datatable(data = reason_travel_data,
              options = list(pageLength = 4),
              colnames = c("Reason for Travel Detailed", "Total Number of Events", "Percent of Total"),
              rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('reason_travel_sum', digits = 0) %>%
  formatPercentage('pct_total', 2)

Origin Country Code

hide
origin_country_data <- raw_data %>%
  group_by(origin_country_code) %>%
  summarise(origin_country_sum = sum(total_number_events)) %>%
  arrange(desc(origin_country_sum))

origin_country_total <- origin_country_data %>%
  slice_head(n = 10) %>%
  plot_ly(x=~origin_country_sum,
          y=~reorder(origin_country_code,origin_country_sum),
          type= 'bar',
          marker = list(color = color_2021),
          hovertemplate = paste('Origin Country: %{y}<br>',
                                'Total Number of Events: %{x}',
                                '<extra></extra>')) %>%
  layout(title = "Top 10 Origin Country Code EDA",
         yaxis = list(title = ""),
         xaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

origin_country_total
hide
origin_country_data$pct_total <- prop.table(origin_country_data$origin_country_sum)

DT::datatable(data = origin_country_data,
              options = list(pageLength = 10),
              colnames = c("Origin Country Code", "Total Number of Events", "Percent of Total"),
              #rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('origin_country_sum', digits = 0) %>%
  formatPercentage('pct_total', 2)

Hotel City

hide
hotel_city_data <- raw_data %>%
  group_by(hotel_city) %>%
  summarise(hotel_city_sum = sum(total_number_events)) %>%
  arrange(desc(hotel_city_sum))

hotel_city_total <- hotel_city_data %>%
  slice_head(n = 10) %>%
  plot_ly(x=~hotel_city_sum,
          y=~reorder(hotel_city,hotel_city_sum),
          type= 'bar',
          marker = list(color = color_2021),
          hovertemplate = paste('Hotel City: %{y}<br>',
                                'Total Number of Events: %{x}',
                                '<extra></extra>')) %>%
  layout(title = "Top 10 Hotel City EDA",
         yaxis = list(title = ""),
         xaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

hotel_city_total
hide
hotel_city_data$pct_total <- prop.table(hotel_city_data$hotel_city_sum)

DT::datatable(data = hotel_city_data,
              options = list(pageLength = 10),
              colnames = c("Hotel City", "Total Number of Events", "Percent of Total"),
              #rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('hotel_city_sum', digits = 0) %>%
  formatPercentage('pct_total', 2)

Hotel State

hide
hotel_state_data <- raw_data %>%
  group_by(hotel_state) %>%
  summarise(hotel_state_sum = sum(total_number_events)) %>%
  arrange(desc(hotel_state_sum))

hotel_state_total <- hotel_state_data %>%
  slice_head(n = 10) %>%
  plot_ly(x=~hotel_state_sum,
          y=~reorder(hotel_state,hotel_state_sum),
          type= 'bar',
          marker = list(color = color_2021),
          hovertemplate = paste('Hotel State: %{y}<br>',
                                'Total Number of Events: %{x}',
                                '<extra></extra>')) %>%
  layout(title = "Top 10 Hotel State EDA",
         yaxis = list(title = ""),
         xaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

hotel_state_total
hide
hotel_state_data$pct_total <- prop.table(hotel_state_data$hotel_state_sum)

DT::datatable(data = hotel_state_data,
              options = list(pageLength = 10),
              colnames = c("Hotel State", "Total Number of Events", "Percent of Total"),
              #rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('hotel_state_sum', digits = 0) %>%
  formatPercentage('pct_total', 2)

Hotel Country

hide
hotel_country_data <- raw_data %>%
  group_by(hotel_country) %>%
  summarise(hotel_country_sum = sum(total_number_events)) %>%
  arrange(desc(hotel_country_sum))

hotel_country_total <- hotel_country_data %>%
  slice_head(n = 10) %>%
  plot_ly(x=~hotel_country_sum,
          y=~reorder(hotel_country,hotel_country_sum),
          type= 'bar',
          marker = list(color = color_2021),
          hovertemplate = paste('Hotel Country: %{y}<br>',
                                'Total Number of Events: %{x}',
                                '<extra></extra>')) %>%
  layout(title = "Top 10 Hotel Country EDA",
         yaxis = list(title = ""),
         xaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

hotel_country_total
hide
hotel_country_data$pct_total <- prop.table(hotel_country_data$hotel_country_sum)

DT::datatable(data = hotel_country_data,
              options = list(pageLength = 10),
              colnames = c("Hotel Country", "Total Number of Events", "Percent of Total"),
              #rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('hotel_country_sum', digits = 0) %>%
  formatPercentage('pct_total', 2)

Traveler Value Group EDA

hide
color_low <- "orchid" #business
color_med <- "darkorchid"  #leisure-family
color_high  <- "purple"  #leisure-non-family

traveler_value_data <- raw_data %>%
  group_by(traveler_value_group) %>%
  summarise(traveler_value_sum = sum(total_number_events))%>%
  arrange(desc(traveler_value_sum))

traveler_value_total <- traveler_value_data %>%
  plot_ly(x=~reorder(traveler_value_group,desc(traveler_value_sum)),
          y=~traveler_value_sum,
          type= 'bar',
          marker = list(color = c(color_low,color_med, color_high)),
          hovertemplate = paste('Traveler Value Group: %{x}<br>',
                                'Total Number of Events: %{y}',
                                '<extra></extra>')) %>%
  layout(title = "Traveler Value Group EDA",
         xaxis = list(title = ""),
         yaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

traveler_value_total
hide
traveler_value_data$pct_total <- prop.table(traveler_value_data$traveler_value_sum)

DT::datatable(data = traveler_value_data,
              options = list(pageLength = 4),
              colnames = c("Traveler Value Group", "Total Number of Events", "Percent of Total"),
              rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('traveler_value_sum', digits = 0) %>%
  formatPercentage('pct_total', 2)

Checkin Year EDA

hide
checkin_year_data <- raw_data %>%
  group_by(checkin_year) %>%
  summarise(checkin_year_sum = sum(total_number_events))

checkin_year_total <- checkin_year_data %>%
  plot_ly(x=~checkin_year,
          y=~checkin_year_sum,
          type= 'bar',
          marker = list(color = c(color_2019,color_low, color_2021, color_low, color_low, color_low)),
          hovertemplate = paste('Checkin Year: %{x}<br>',
                                'Total Number of Events: %{y}',
                                '<extra></extra>')) %>%
  layout(title = "Checkin Year EDA",
         xaxis = list(title = ""),
         yaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

checkin_year_total
hide
checkin_year_data$pct_total <- prop.table(checkin_year_data$checkin_year_sum) 

DT::datatable(data = checkin_year_data,
              options = list(pageLength = 10),
              colnames = c("Checkin Year", "Total Number of Events", "Percent of Total"),
              rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('checkin_year_sum', digits = 0) %>%
  formatPercentage('pct_total', 2)

Checkin Month Day EDA

hide
checkin_month_data <- raw_data %>%
  group_by(checkin_month) %>%
  summarise(checkin_month_sum = sum(total_number_events)) 

event_month_total <- checkin_month_data %>%
  plot_ly(x=~checkin_month,
          y=~checkin_month_sum,
          type= 'bar',
          marker = list(color = color_2021),
          hovertemplate = paste('Checkin Month: %{x}<br>',
                                'Total Number of Events: %{y}',
                                '<extra></extra>')) %>%
  layout(title = "Checkin Month EDA",
         xaxis = list(title = ""),
         yaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

event_month_total
hide
checkin_month_data$pct_total <- prop.table(checkin_month_data$checkin_month_sum) 

DT::datatable(data = checkin_month_data,
              options = list(pageLength = 12),
              colnames = c("Checkin Month", "Total Number of Events", "Percent of Total"),
              rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('checkin_month_sum', digits = 0) %>%
  formatPercentage('pct_total', 2)

Total Number of Events Histogram

hide
total_event_data <- raw_data %>%
  count(total_number_events) %>%
  rename(count = n) %>%
  arrange(desc(count))

total_event_hist <- total_event_data %>%
  slice_head(n = 10) %>%
  plot_ly(x=~count,
          y=~reorder(total_number_events,count),
          type= 'bar',
          marker = list(color = color_2021),
          hovertemplate = paste('Total Number of Events: %{y} <br>',
                                'Count: %{x}',
                                '<extra></extra>')) %>%
  layout(title = "Top 10 Total Number of Events Histogram",
         xaxis = list(title = "Count"),
         yaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

total_event_hist
hide
total_event_bp <- raw_data %>%
  select(total_number_events) %>%
  plot_ly(y= ~total_number_events,
          type= 'box',
          name = "Suspected Outliers",
          boxpoints = 'suspectedoutliers',
          marker = list(color = color_2021,
                        outliercolor = color_2019,
                        line = list(outliercolor = color_2019,
                                    outlierwidth = 5)),
          line = list(color = color_2021)) %>%
  layout(title = "Total Number of Events Boxplot",
         xaxis = list(title = ""),
         yaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

total_event_bp
hide
total_event_data$pct_total <- prop.table(total_event_data$count)

DT::datatable(data = total_event_data,
              options = list(pageLength = 10),
              colnames = c("Total Number of Events", "Count", "Percent of Total"),
              #rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('count', digits = 0) %>%
  formatPercentage('pct_total', 2)

1.3 Data Cleaning

Clean Hotel City

hide
clean_data <- raw_data

#Replace Hotel City
clean_data <- clean_data %>%
  mutate(hotel_city = ifelse(str_detect(hotel_city, "dusseldorf"), "duesseldorf", hotel_city)) %>%
  mutate(hotel_city = ifelse(str_detect(hotel_city, "firenze"), "florence", hotel_city)) %>%
  mutate(hotel_city = ifelse(str_detect(hotel_city, "kln"), "koeln", hotel_city)) %>%
  mutate(hotel_city = ifelse(str_detect(hotel_city, "mnchen"), "munich", hotel_city)) %>%
  mutate(hotel_city = ifelse(str_detect(hotel_city, "seville"), "sevilla", hotel_city)) %>%
  mutate(hotel_city = ifelse(str_detect(hotel_city, "venezia"), "venice", hotel_city))


hotel_city_clean_data <- clean_data %>%
  group_by(hotel_city) %>%
  summarise(hotel_city_sum = sum(total_number_events)) %>%
  arrange(desc(hotel_city_sum))

hotel_city_clean_data$pct_total <- prop.table(hotel_city_clean_data$hotel_city_sum)

DT::datatable(data = hotel_city_clean_data,
              options = list(pageLength = 10),
              colnames = c("Hotel City", "Total Number of Events", "Percent of Total"),
              #rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('hotel_city_sum', digits = 0) %>%
  formatPercentage('pct_total', 2)

Clean Hotel Country

hide
#Replace Hotel Country
clean_data <- clean_data %>%
  mutate(hotel_country = ifelse(str_detect(hotel_country, "alemania"), "de", hotel_country)) %>%
  mutate(hotel_country = ifelse(str_detect(hotel_country, "allemagne"), "de", hotel_country)) %>%
  mutate(hotel_country = ifelse(str_detect(hotel_country, "germania"), "de", hotel_country)) %>%
  mutate(hotel_country = ifelse(str_detect(hotel_country, "espaa"), "es", hotel_country)) %>%
  mutate(hotel_country = ifelse(str_detect(hotel_country, "catalunya"), "es", hotel_country)) %>%
  mutate(hotel_country = ifelse(str_detect(hotel_country, "gran bretaa"), "gb", hotel_country)) %>%
  mutate(hotel_country = ifelse(str_detect(hotel_country, "gran bretagna"), "gb", hotel_country)) %>%
  mutate(hotel_country = ifelse(str_detect(hotel_country, "grande bretagne"), "gb", hotel_country)) %>%
  mutate(hotel_country = ifelse(str_detect(hotel_country, "grobritannien"), "gb", hotel_country)) %>%
  mutate(hotel_country = ifelse(str_detect(hotel_country, "regno unito"), "gb", hotel_country)) %>%    
  mutate(hotel_country = ifelse(str_detect(hotel_country, "reino unido"), "gb", hotel_country)) %>%
  mutate(hotel_country = ifelse(str_detect(hotel_country, "royaume-uni"), "gb", hotel_country)) %>%
  mutate(hotel_country = ifelse(str_detect(hotel_country, "sct"), "gb", hotel_country)) %>%      
  mutate(hotel_country = ifelse(str_detect(hotel_country, "on"), "gb", hotel_country)) %>%   #london
  mutate(hotel_country = ifelse(str_detect(hotel_country, "ns"), "gb", hotel_country)) %>%   #liverpool
  mutate(hotel_country = ifelse(str_detect(hotel_country, "jp"), "fr", hotel_country)) %>%   #bordeaux
  mutate(hotel_country = ifelse(str_detect(hotel_country, "br"), "es", hotel_country)) %>%   #barcelona
  mutate(hotel_country = ifelse(str_detect(hotel_country, "co"), "es", hotel_country)) #madrid

hotel_country_clean_data <- clean_data %>%
  group_by(hotel_country) %>%
  summarise(hotel_country_sum = sum(total_number_events)) %>%
  arrange(desc(hotel_country_sum))

hotel_country_clean_data$pct_total <- prop.table(hotel_country_clean_data$hotel_country_sum)

DT::datatable(data = hotel_country_clean_data,
              options = list(pageLength = 11),
              colnames = c("Hotel Country", "Total Number of Events", "Percent of Total"),
              #rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('hotel_country_sum', digits = 0) %>%
  formatPercentage('pct_total', 2)

Issues:

Final Clean Data
hide
clean_data <- clean_data %>%
  select(-c(hotel_state,hotel_country))

write_csv(clean_data,"./data/clean_data.csv")

glimpse(clean_data)
Rows: 724,795
Columns: 14
$ event_date                 <date> 2019-01-07, 2019-01-16, 2021-01-~
$ activity_group             <chr> "Booking", "Searching", "Searchin~
$ reason_for_travel_detailed <chr> "Leisure - Non-Family", "Business~
$ origin_country_code        <chr> "DE", "GB", "DE", "US", "DE", "DE~
$ hotel_city                 <chr> "duesseldorf", "duesseldorf", "du~
$ traveler_value_group       <ord> Med, Med, Low, Low, Low, Low, Hig~
$ departure_checkin_date     <date> 2019-01-25, 2019-02-05, 2021-01-~
$ total_number_events        <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, ~
$ event_wday                 <ord> Mon, Wed, Thu, Thu, Thu, Sun, Thu~
$ event_year                 <fct> 2019, 2019, 2021, 2021, 2019, 202~
$ event_mday                 <fct> Jan 07, Jan 16, Jan 28, Jan 28, J~
$ checkin_wday               <ord> Fri, Tue, Fri, Sun, Thu, Fri, Mon~
$ checkin_year               <fct> 2019, 2019, 2021, 2021, 2019, 202~
$ checkin_month              <ord> Jan, Feb, Jan, Jul, Jan, Mar, Mar~

2. Insights Discovery

2.1 Booking

hide
booking_2019_data <- clean_data %>%
  filter(activity_group == "Booking") %>%
  filter(event_year == 2019) %>%
  group_by(hotel_city) %>%
  summarise(booking_2019_sum = sum(total_number_events))

booking_2021_data <- clean_data %>%
  filter(activity_group == "Booking") %>%
  filter(event_year == 2021) %>%
  group_by(hotel_city) %>%
  summarise(booking_2021_sum = sum(total_number_events))

booking_city_data <- left_join(booking_2019_data, booking_2021_data)

booking_city_total <- booking_city_data %>%
 plot_ly(x=~booking_2021_sum,
         y=~reorder(hotel_city,booking_2021_sum),
         type= 'bar',
         marker = list(color = color_2021),
         name = '2021',
         hovertemplate = paste('Hotel City: %{y}<br>',
                               '2021 Total Number of Events: %{x}',
                               '<extra></extra>')) %>%
  add_trace(x=~booking_2019_sum,
            marker = list(color = color_2019),
            name = '2019',
            hovertemplate = paste('Hotel City: %{y}<br>',
                               '2019 Total Number of Events: %{x}',
                               '<extra></extra>')) %>%
  layout(title = "Booking Top Destinations by City",
         yaxis = list(title = ""),
         xaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

booking_city_total
hide
booking_city_data$diff <- booking_city_data$booking_2021_sum - booking_city_data$booking_2019_sum
booking_city_data$pct_diff <- booking_city_data$diff / booking_city_data$booking_2019_sum

DT::datatable(data = booking_city_data %>%
                arrange(desc(pct_diff)),
              options = list(pageLength = 10),
              colnames = c("Booking Destination by City", "2019 Total Events", "2021 Total Events", "Difference (Values)" ,"Difference (%)"),
              rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('booking_2019_sum', digits = 0) %>%
  formatRound('booking_2021_sum', digits = 0) %>%
  formatRound('diff', digits = 0) %>%
  formatPercentage('pct_diff', 2)
hide
booking_city_diff <- booking_city_data %>%
 plot_ly(x=~pct_diff,
         y=~reorder(hotel_city,pct_diff),
         type= 'bar',
         marker = list(color = color_2021),
         hovertemplate = paste('Hotel City: %{y}<br>',
                               '2021-2019 Difference(%): %{x}',
                               '<extra></extra>')) %>%
  layout(title = "Destinations by City Percentage Change (2021 vs. 2019)",
         yaxis = list(title = ""),
         xaxis = list(title = "Percentage Change(%)",tickformat = "%"),
         hoverlabel=list(bgcolor=color_bg))

booking_city_diff
hide
booking_country_2019_data <- clean_data %>%
  filter(activity_group == "Booking") %>%
  filter(event_year == 2019) %>%
  group_by(origin_country_code) %>%
  summarise(booking_country_2019_sum = sum(total_number_events))

booking_country_2021_data <- clean_data %>%
  filter(activity_group == "Booking") %>%
  filter(event_year == 2021) %>%
  group_by(origin_country_code) %>%
  summarise(booking_country_2021_sum = sum(total_number_events))

booking_country_data <- full_join(booking_country_2019_data, booking_country_2021_data)

booking_country_total <- booking_country_data %>%
  slice_head(n = 10) %>%
  plot_ly(x=~booking_country_2021_sum,
         y=~reorder(origin_country_code,booking_country_2021_sum),
         type= 'bar',
         marker = list(color = color_2021),
         name = '2021',
         hovertemplate = paste('Origin Country: %{y}<br>',
                               '2021 Total Number of Events: %{x}',
                               '<extra></extra>')) %>%
  add_trace(x=~booking_country_2019_sum,
            marker = list(color = color_2019),
            name = '2019',
            hovertemplate = paste('Origin Country: %{y}<br>',
                               '2019 Total Number of Events: %{x}',
                               '<extra></extra>')) %>%
  layout(title = "Top 10 Origin Country",
         yaxis = list(title = ""),
         xaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

booking_country_total
hide
booking_country_data$diff <- booking_country_data$booking_country_2021_sum - booking_country_data$booking_country_2019_sum
booking_country_data$pct_diff <- booking_country_data$diff / booking_country_data$booking_country_2019_sum

DT::datatable(data = booking_country_data %>%
                arrange(desc(booking_country_2021_sum)),
              options = list(pageLength = 10),
              colnames = c("Booking Origin Country", "2019 Total Events", "2021 Total Events", "Difference (Values)" ,"Difference (%)"),
              rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('booking_country_2019_sum', digits = 0) %>%
  formatRound('booking_country_2021_sum', digits = 0) %>%
  formatRound('diff', digits = 0) %>%
  formatPercentage('pct_diff', 2)

2.2 Searching

hide
searching_2019_data <- clean_data %>%
  filter(activity_group == "Searching") %>%
  filter(event_year == 2019) %>%
  group_by(hotel_city) %>%
  summarise(searching_2019_sum = sum(total_number_events))

searching_2021_data <- clean_data %>%
  filter(activity_group == "Searching") %>%
  filter(event_year == 2021) %>%
  group_by(hotel_city) %>%
  summarise(searching_2021_sum = sum(total_number_events))

searching_city_data <- left_join(searching_2019_data, searching_2021_data)

searching_city_total <- searching_city_data %>%
 plot_ly(x=~searching_2021_sum,
         y=~reorder(hotel_city,searching_2021_sum),
         type= 'bar',
         marker = list(color = color_2021),
         name = '2021',
         hovertemplate = paste('Hotel City: %{y}<br>',
                               '2021 Total Number of Events: %{x}',
                               '<extra></extra>')) %>%
  add_trace(x=~searching_2019_sum,
            marker = list(color = color_2019),
            name = '2019',
            hovertemplate = paste('Hotel City: %{y}<br>',
                               '2019 Total Number of Events: %{x}',
                               '<extra></extra>')) %>%
  layout(title = "Searching Top Destinations by City",
         yaxis = list(title = ""),
         xaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

searching_city_total
hide
searching_city_data$diff <- searching_city_data$searching_2021_sum - searching_city_data$searching_2019_sum
searching_city_data$pct_diff <- searching_city_data$diff / searching_city_data$searching_2019_sum

DT::datatable(data = searching_city_data %>%
                arrange(desc(pct_diff)),
              options = list(pageLength = 10),
              colnames = c("Searching Destination by City", "2019 Total Events", "2021 Total Events", "Difference (Values)" ,"Difference (%)"),
              rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('searching_2019_sum', digits = 0) %>%
  formatRound('searching_2021_sum', digits = 0) %>%
  formatRound('diff', digits = 0) %>%
  formatPercentage('pct_diff', 2)
hide
searching_city_diff <- searching_city_data %>%
 plot_ly(x=~pct_diff,
         y=~reorder(hotel_city,pct_diff),
         type= 'bar',
         marker = list(color = color_2021),
         hovertemplate = paste('Hotel City: %{y}<br>',
                               '2021-2019 Difference(%): %{x}',
                               '<extra></extra>')) %>%
  layout(title = "Destinations by City Percentage Change (2021 vs. 2019)",
         yaxis = list(title = ""),
         xaxis = list(title = "Percentage Change(%)",tickformat = "%"),
         hoverlabel=list(bgcolor=color_bg))

searching_city_diff
hide
searching_country_2019_data <- clean_data %>%
  filter(activity_group == "Searching") %>%
  filter(event_year == 2019) %>%
  group_by(origin_country_code) %>%
  summarise(searching_country_2019_sum = sum(total_number_events))

searching_country_2021_data <- clean_data %>%
  filter(activity_group == "Searching") %>%
  filter(event_year == 2021) %>%
  group_by(origin_country_code) %>%
  summarise(searching_country_2021_sum = sum(total_number_events))

searching_country_data <- full_join(searching_country_2019_data, searching_country_2021_data)

searching_country_total <- searching_country_data %>%
  slice_head(n = 10) %>%
  plot_ly(x=~searching_country_2021_sum,
         y=~reorder(origin_country_code,searching_country_2021_sum),
         type= 'bar',
         marker = list(color = color_2021),
         name = '2021',
         hovertemplate = paste('Origin Country: %{y}<br>',
                               '2021 Total Number of Events: %{x}',
                               '<extra></extra>')) %>%
  add_trace(x=~searching_country_2019_sum,
            marker = list(color = color_2019),
            name = '2019',
            hovertemplate = paste('Origin Country: %{y}<br>',
                               '2019 Total Number of Events: %{x}',
                               '<extra></extra>')) %>%
  layout(title = "Top 10 Origin Country",
         yaxis = list(title = ""),
         xaxis = list(title = "Total Number of Events"),
         hoverlabel=list(bgcolor=color_bg))

searching_country_total
hide
searching_country_data$diff <- searching_country_data$searching_country_2021_sum - searching_country_data$searching_country_2019_sum
searching_country_data$pct_diff <- searching_country_data$diff / searching_country_data$searching_country_2019_sum

DT::datatable(data = searching_country_data %>%
                arrange(desc(searching_country_2021_sum)),
              options = list(pageLength = 10),
              colnames = c("Booking Origin Country", "2019 Total Events", "2021 Total Events", "Difference (Values)" ,"Difference (%)"),
              rownames = FALSE,
              filter = "none",
              autoHideNavigation = TRUE) %>%
  formatRound('searching_country_2019_sum', digits = 0) %>%
  formatRound('searching_country_2021_sum', digits = 0) %>%
  formatRound('diff', digits = 0) %>%
  formatPercentage('pct_diff', 2)

3. Actionable Recommendations

Citation

For attribution, please cite this work as

Dolit (2021, Sept. 11). FinTech & Analytics: Adara Challenge Analysis. Retrieved from https://adolit.github.io/posts/2021-09-11-adara-challenge-analysis/

BibTeX citation

@misc{dolit2021adara,
  author = {Dolit, Archie},
  title = {FinTech & Analytics: Adara Challenge Analysis},
  url = {https://adolit.github.io/posts/2021-09-11-adara-challenge-analysis/},
  year = {2021}
}