Case Study Description: To predict the trip duration of a New York taxi cab ride, we can build different types of features and evaluate them. We will start by describing what is a feature in this context; then we will develop some elementary features and add features using the software package featuretools. We will assess how these features perform in predicting trip duration.

1. Setup

Note: This uses an alternative to python’s featuretools for R called featuretoolsR

To Install featuretoolsR tou need to have “devtools” installed.
Then you can run: install_github(“magnusfurugard/featuretoolsR”)

# You wil need to define (if not already) your Python path.
# reticulate::py_discover_config()
library("reticulate")
use_python("~/Library/r-miniconda/envs/r-reticulate/bin/python")

library("featuretoolsR")
library("dplyr")
library("gbm")

1.1 Functions

featuretoolsR does not include the original encode_features python function, so a custom one was made called custom_encode_features which returns one-hot encoding of categorical variables.
Additionally, custom function to replace python SimpleImputer class and Rscores were made.

# R^2 Calculation
r_squared <- function(predicted,actual) {
  r_squared <- 1 - (sum((actual - predicted)^2) / sum((actual - mean(actual))^2))
  return(r_squared)
}

# Custom encode feature to support missing function un featuretoolsR
  custom_encode_features <- function(feature_matrix, to_encode, include_unknown, top_n) {
    temp <- feature_matrix
    for(feature in to_encode) {
      
      # We generate the list of top_n unique values for feature to encode
      if(include_unknown == TRUE) {
        encoded_feature <- names(head(sort(table(feature_matrix[[feature]],exclude = NULL),decreasing=TRUE),top_n))
      } else {
        encoded_feature <- names(head(sort(table(feature_matrix[[feature]]),decreasing=TRUE),top_n))
      }
      
      for(i in 1:length(encoded_feature)) {
        temp <- temp %>%	
          mutate( 
            !!paste(feature,"=",encoded_feature[i]) := ifelse(feature_matrix[[feature]] == encoded_feature[i] , 1, 0)
          )
      }
    }
    return(temp[,!names(temp) %in% to_encode])
  }

# Compute features function based on dsx-cs6 utils
  compute_features <- function(features,entities) {
    
    # We create feature matrix using featuretools calculate_feature_matrix function
    feature_matrix <- calculate_feature_matrix(entities,features,
                                               approximate='36d',
                                               verbose=TRUE) 
    
    # Since encode_features function is missing, we hot-encode pickup_neighborhood and dropoff_neighborhood 
    # with a custom function created: custom_encode_features
    print("Finishing computing...")
    feature_matrix <- custom_encode_features(feature_matrix, to_encode=c("pickup_neighborhood","dropoff_neighborhood"), include_unknown=FALSE, top_n=1000) 
  
    # Return Output
    return(feature_matrix)
  
  }
  
# Logical to integer
logical_to_integer <- function(feature_matrix) {
  # Convert logical values to integers
  logical_features <- names(feature_matrix[,unlist(lapply(feature_matrix, is.logical))])
  for(feature in logical_features) {
    feature_matrix[[feature]] <- as.numeric(as.integer(feature_matrix[[feature]]))
  }
  return(feature_matrix)
}

logical_to_factor <- function(feature_matrix) {
  # Convert logical values to integers
  logical_features <- names(feature_matrix[,unlist(lapply(feature_matrix, is.logical))])
  for(feature in logical_features) {
    feature_matrix[[feature]] <- as.factor(feature_matrix[[feature]])
  }
  return(feature_matrix)
}

custom_fit <- function(data) {
  out <- list()
  numeric_features <- names(data[,unlist(lapply(data, is.numeric))])
  for(feature in numeric_features) {
    out[[feature]] <- mean(data[[feature]] ,na.rm = TRUE)
  }
  return(out)
}

# Custom fit_transform
custom_transform <- function(data,fit) {
  numeric_features <- names(fit)
  for(feature in numeric_features) {
    data[[feature]][which(is.na(data[[feature]]))] <-fit[[feature]]
  }
  return(data)
}
  
# get_train_test_fm based on dsx-cs6 utils
get_train_test_fm <- function(feature_matrix, original_data, percentage) {
  
  out <- list()
  
  nrows <- nrow(feature_matrix)
  head <- nrows * percentage
  tail <- nrows - head
  
  X_train <- head(feature_matrix,head)
  y_train <- head(original_data$trip_duration,head)
  
  # Emulating fit_transform
  fit <- custom_fit(X_train)
  X_train = custom_transform(X_train,fit)
  
  X_test = tail(feature_matrix,tail)
  y_test = tail(original_data$trip_duration,tail)
  X_test = custom_transform(X_test,fit)
  
  # Return values
  out$X_train <- X_train 
  out$y_train <- y_train
  out$X_test <- X_test
  out$y_test <- y_test
  
  return(out)
  
}

1.2 Data Load

# Needed to load picke file
source_python("read_pickle.py")

Opening a pickle file in R additional info

# We load data
trips <- read_pickle_file("./trips.pkl")
dropoff_neighborhoods <- read.csv("./dropoff_neighborhoods.csv")
pickup_neighborhoods <- read.csv("./pickup_neighborhoods.csv")

# We set type
trips$pickup_neighborhood <- as.character(trips$pickup_neighborhood)
trips$dropoff_neighborhood <- as.character(as.character(trips$dropoff_neighborhood))
dropoff_neighborhoods$neighborhood_id <- as.character(dropoff_neighborhoods$neighborhood_id)
pickup_neighborhoods$neighborhood_id <- as.character(pickup_neighborhoods$neighborhood_id)

# Histogram
hist(trips$trip_duration, main="Trip Duration Distribution", 
  xlab="Trip Duration in Seconds",
  ylab="Number of Trips",
  col="lightblue")

# Tells us how many trips are in the dataset
paste("Trips in dataset:",nrow(trips))

As seen in the output above, we can observe there’s 10.000 trips in the dataset. Aditionally, our histogram seems right-skewed with median around 616 seconds (10.26 minutes) and mean around 746 seconds (12.43 minutes).

1.2 Entities and Relationships

# Create entityset
entities <- as_entityset(
  trips, 
  index = "id", 
  entity_id = "trips", 
  id = "trips", 
  time_index = "pickup_datetime"
) %>%
add_entity(
  pickup_neighborhoods, 
  entity_id = "pickup_neighborhoods", 
  index = "neighborhood_id"
) %>%
add_entity(
  dropoff_neighborhoods, 
  entity_id = "dropoff_neighborhoods", 
  index = "neighborhood_id"
)

# Add index relationships
entities <- entities %>%
add_relationship(
  parent_set = "pickup_neighborhoods", 
  child_set = "trips", 
  parent_idx = "neighborhood_id", 
  child_idx = "pickup_neighborhood"
)  %>%
  add_relationship(
    parent_set = "dropoff_neighborhoods", 
    child_set = "trips", 
    parent_idx = "neighborhood_id", 
    child_idx = "dropoff_neighborhood"
  ) 

2. First Model

2.1 Transform Primitives

trans_primitives <- c("is_weekend")
features <- dfs(
  entityset=entities,
  target_entity="trips",
  trans_primitives=trans_primitives,
  agg_primitives=NULL,
  ignore_variables= list(trips=c("pickup_latitude", "pickup_longitude",
                                 "dropoff_latitude", "dropoff_longitude", "trip_duration")),
  features_only=TRUE
)
print(paste("Number of features:",length(features)))

Even though some features were not included, there’s some new tranformed/generated ones:

Pickup neighborhood was transformed (hot encoded) from a categorical variable into many binary features, one for each neighborhood.
A completely new transformed feature containing whether the trip was made on a weekend was generated.

With the addition of these features we expect an increased model performance, especially due to the new engineered IS_WEEKEND feature, since both traffic flow and travel behaviors typically change between working days and weekends, therefore, these newly added features would most likely influece trip_duration, reflecting in better prediction.

2.2 Dataset Creation

feature_matrix <- compute_features(features, entities)
head(feature_matrix,5)

2.3 Split Dataset

train_test = get_train_test_fm(feature_matrix, trips, .75)

y_train = log(train_test$y_train + 1)
y_test = log(train_test$y_test + 1)
X_train <- logical_to_integer(train_test$X_train)
X_test <- logical_to_integer(train_test$X_test)

trainingDataset = cbind(y_train,X_train)
print('Data split successful!')

2.4 Fitting Generalized Boosted Regression Model

gbm <- gbm(y_train  ~ ., data = trainingDataset,verbose = TRUE,
           n.trees = 100,
           distribution="gaussian",
           interaction.depth = 3,
           n.minobsinnode = 1,
           shrinkage = 0.1)

2.5 Results from predictions

R^2

predicted_gbm <- predict(gbm,X_test,n.trees = 100)
paste("R Squared:",round(r_squared(y_test,predicted_gbm),4))

2.6 Feature Influence

head(round(relative.influence(gbm,sort= TRUE)/sum(relative.influence(gbm)),4),25)

3. Second Model

3.1 Transform Primitives

trans_primitives <- c("minute","day","week","month","weekday","is_weekend")
features <- dfs(
  entityset=entities,
  target_entity="trips",
  trans_primitives=trans_primitives,
  agg_primitives=NULL,
  ignore_variables= list(trips=c("pickup_latitude", "pickup_longitude",
                                 "dropoff_latitude", "dropoff_longitude", "trip_duration")),
  features_only=TRUE
)
print(paste("Number of features:",length(features)))

3.2 Dataset Creation

feature_matrix <- compute_features(features, entities)
head(feature_matrix,5)

3.3 Split Dataset

train_test = get_train_test_fm(feature_matrix, trips, .75)

y_train = log(train_test$y_train + 1)
y_test = log(train_test$y_test + 1)
X_train <- logical_to_integer(train_test$X_train)
X_test <- logical_to_integer(train_test$X_test)

trainingDataset = cbind(y_train,X_train)
print('Data split successful!')

3.4 Fitting Generalized Boosted Regression Model

gbm <- gbm(y_train  ~ ., data = trainingDataset,verbose = TRUE,
           n.trees = 100,
           distribution="gaussian",
           interaction.depth = 3,
           n.minobsinnode = 1,
           shrinkage = 0.1)

3.5 Results from predictions

R^2

predicted_gbm <- predict(gbm,X_test,n.trees = 100)
paste("R Squared:",round(r_squared(y_test,predicted_gbm),4))

New transform features (Minute, Hour, Day, Week, Month, Weekday) were generated in addition to Weekend Feature from the previous model. These transform features seemed to increase model performance since R^2 increased from 0.755 to 0.80.
There was very little sacrifice in terms of training time (5 secs).

3.6 Feature Influence

head(round(relative.influence(gbm,sort= TRUE)/sum(relative.influence(gbm)),4),25)

4. Third Model

4.1 Transform Primitives

trans_primitives <- c("minute","day","week","month","weekday","is_weekend")
aggregation_primitives = c("count","sum","mean","median","std","max","min")
features <- dfs(
  entityset=entities,
  target_entity="trips",
  trans_primitives=trans_primitives,
  agg_primitives=aggregation_primitives,
  ignore_variables= list(trips=c("pickup_latitude", "pickup_longitude",
                                 "dropoff_latitude", "dropoff_longitude", "trip_duration")),
  features_only=TRUE
)
print(paste("Number of features:",length(features)))

4.2 Dataset Creation

feature_matrix <- compute_features(features, entities)
head(feature_matrix,5)

4.3 Split Dataset

train_test = get_train_test_fm(feature_matrix, trips, .75)

y_train = log(train_test$y_train + 1)
y_test = log(train_test$y_test + 1)
X_train <- logical_to_integer(train_test$X_train)
X_test <- logical_to_integer(train_test$X_test)

trainingDataset = cbind(y_train,X_train)
print('Data split successful!')

4.4 Fitting Generalized Boosted Regression Model

gbm <- gbm(y_train  ~ ., data = trainingDataset,verbose = TRUE,
           n.trees = 100,
           distribution="gaussian",
           interaction.depth = 3,
           n.minobsinnode = 1,
           shrinkage = 0.1)

4.5 Results from predictions

R^2

predicted_gbm <- predict(gbm,X_test,n.trees = 100)
paste("R Squared:",round(r_squared(y_test,predicted_gbm),4))

4.6 Feature Influence

head(round(relative.influence(gbm,sort= TRUE)/sum(relative.influence(gbm)),4),25)

It’s coherent to see that trip_discance, pickup/drop off hour and drop off latitude/longitude would be amongst the features that most influence trip_duration, since they reflect the effect of distance, highest traffic hours and location-wise variables that would influence the trip in terms of duration.

5. Evaluation & Final thoughts

y_pred <- predict(gbm,X_test,n.trees = 100)
y_pred <- exp(y_pred) - 1 # undo the log we took earlier

y_test <- train_test$y_test

print('y_pred and y_test computation successful!')

# Print the first 5 predictions and real values
head(round(y_pred),10)
head(round(y_test),10)

# Histogram of y_test
hist(y_test, main="Real Trip Duration Distribution", 
  xlab="Trip Duration in Seconds",
  ylab="Number of Trips",
  col="lightblue")
summary(y_test)

# Histogram of y_pred
hist(y_pred, main="Predicted Trip Duration Distribution", 
  xlab="Trip Duration in Seconds",
  ylab="Number of Trips",
  col="lightgreen")
summary(y_pred)

Green output histogram from our new latest model seems coherent since it’s somewhat similar to the original histogram in blue.
We can observe our prediction distribution seems to be positive/right skewed with mean 724 (original 746) and Median 594 (original 616), therefore model predictions roughly follows the same distribution as the original.
New aggregated features seems to very slightly increase model R^2, since it increased from 0.677 to 0.713 with only a minimal sacrifice of training time (+1 sec).

If more time is provided, it would be ideal to see if there’s different stacked features or depths we could explore. Additionally, domain specific knowledge could be added to generate new features or to expand information to construct with.