2 minute read

This is my final project for our Machine Learning class. We were tasked to create a machine learning model using any dataset of our choice.

Motivation

Now that most of us are working from home, one thing we surely don’t miss is commuting. Prior to the pandemic, most of us have experienced the carmageddon along EDSA. Traffic is one of the biggest crisis in our country that several administrations failed to solve. One of the major causes of the heavy traffic in our country is the poor public transportation system [1]. With the constant breakdown of the country’s main transit system [2] and lack of options due to insanely long queue and madness experienced when using public transportation, many Filipinos resort to alternatives. In particular, ride sharing services has grown exponentially during the last decade, specifically Grab.

In 2018 alone, Grab users traveled the cumulative total of 920 million kilometers [3] with over 35,000 partner drivers. Grabe reported to have cut down travel time by 70%, but still customers demand for better service [4]. Since Grab receives around 600,000 bookings per day, [5] they experience an undersupply of vehicles to meet the passenger demand. Hence, it is a common frustration among Filipinos booking for Grab to experience the following: (1) waiting for a long time to book a ride to no avail, (2) costly ride, and (3) drivers cancelling their ride.

This study focuses on the non-allocation of Grab Taxi bookings. This study aims to answer the question, How can we predict the allocation of a booking of Grab Taxi?

Business Value

Predicting the status of a grab booking will be beneficial to different stakeholders such as the Grab passengers, Grab partner drivers, Grab, and the local government units.

  • Grab Passengers: Grab passengers can use this system to check if their planned schedule booking will be completed or cancelled given the booking information. In this way, they can plan ahead.
  • Grab and its partner drivers: Grab can use this system to understand the factors why its partner drivers cancel their allocated grab booking. Through this, they can better implement their demand capacity planning.
  • Government: By understanding location hotspots, local government units can improve public transportation by pushing initiatives such as the P2P bus-service.

Methodology

Exploring the Grab Dataset

The dataset was downloaded and extracted from Tableau Public. Prior to filtering and data processing, the Grab dataset is composed of 197,188 rows and 21 columns. A row corresponds to a booking information for Grab Taxi.

The target variable defined in this study is whether a Grab booking was allocated (regardless if it’s cancelled by the driver or passenger) or unallocated. Here we inspect the distribution of each class in our dataset. This can provide us information on the appropriate metric and methodology to implement to arrive with a relevant machine learning model.

Figure 1. Percentage of Allocation and Non-Allocation of Grab Taxi Bookings
Although the target variable is not heavily imbalanced, it is still important to consider other metrics aside from accuracy such as recall and precision when evaluating the machine learning model.

Both recall and precision will be relevant to Grab as it will help in capacity planning. These metrics will also be relevant to the users so they can better anticipate and plan their day ahead.

Figure 2. Daily Percentage of Unallocated Bookings
On a daily-basis, the percentage of bookings that were unallocated tend to be during weekdays, specifically during Fridays.

Figure 3. Percentage of Unallocated Bookings per Pick-up Geohash
There are some areas wherein no bookings were allocated.

Results

  • With only a few features available in the dataset, Feature Engineering was a crucial step to create valuable insights out of the limited dataset and improve performance of machine learning model.
  • The machine learning model was able to predict with $69\%$ accuracy which is over the threshold set at $PCC_{1.25}$ with $63\%$. LightGBM was used for its speed and efficiency in handling a huge dataset.
  • The top features found were hour_18, day_of_week_Friday, and day_of_week_Sunday. These features were determined based on information gain.