Anomaly Detection with New York City taxi data

Unsupervised Machine Learning R

In this entry, I will be conducting anomaly detection to identify points of anomaly in the taxi passengers data in New York City from July 2014 to January 2015 at half-hourly intervals.

(4 min read)

Tarid Wongvorachan (University of Alberta)https://www.ualberta.ca
2021-12-04

Introduction to Anomaly Detection

Dataset

Show code
library(tidyverse) #for data manipulation
library(lubridate) #for date/time data management
library(zoo) # moving averages  
library(anomalize) #for anomaly detection

data <- read_csv("nyc_taxi.csv")
head(data)
# A tibble: 6 x 2
  timestamp           value
  <dttm>              <dbl>
1 2014-07-01 00:00:00 10844
2 2014-07-01 00:30:00  8127
3 2014-07-01 01:00:00  6210
4 2014-07-01 01:30:00  4656
5 2014-07-01 02:00:00  3820
6 2014-07-01 02:30:00  2873
Show code
ggplot(data, aes(x = timestamp, y = value)) + 
  geom_point(shape = 1, alpha = 0.5) +
  labs(x = "Time", y = "Count") +
  labs(alpha = "", colour="Legend")

Investigating the Moving Average

Show code
data <- data %>% 
  dplyr::mutate (MA48 = zoo::rollmean(value, k = 48, fill = NA),
                MA336 = zoo::rollmean(value, k = 336, fill = NA)) %>%
  dplyr::ungroup()

#Plot the moving average line chart

data %>%
  gather(metric, value, MA48:MA336) %>%
  ggplot(aes(timestamp, value, color = metric)) +
  geom_line() +
  ggtitle("Rolling average of NYC taxi passenger count")+
  theme(plot.title = element_text(hjust = 0.5))

Time Series Decomposition with Anomalies

Show code
data %>% 
  time_decompose(value, method = "stl", frequency = "auto", trend = "auto") %>%
  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>%
  plot_anomaly_decomposition()

Anomaly Detection

Show code
data %>% 
  time_decompose(value) %>%
  anomalize(remainder) %>%
  time_recompose() %>%
  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)

Show code
data_anomalous <- data %>% 
  time_decompose(value) %>%
  anomalize(remainder) %>%
  time_recompose() %>%
  filter(anomaly == 'Yes') 

head(data_anomalous)
# A time tibble: 6 x 10
# Index: timestamp
  timestamp           observed season  trend remainder remainder_l1
  <dttm>                 <dbl>  <dbl>  <dbl>     <dbl>        <dbl>
1 2014-07-04 07:30:00     4926  2817. 14875.   -12767.       -9534.
2 2014-07-04 08:00:00     5165  3903. 14875.   -13612.       -9534.
3 2014-07-04 08:30:00     5776  4346. 14874.   -13445.       -9534.
4 2014-07-04 09:00:00     7338  3304. 14874.   -10840.       -9534.
5 2014-07-05 07:00:00     3658  -735. 14848.   -10454.       -9534.
6 2014-07-05 07:30:00     4345  2817. 14847.   -13319.       -9534.
# ... with 4 more variables: remainder_l2 <dbl>, anomaly <chr>,
#   recomposed_l1 <dbl>, recomposed_l2 <dbl>
Show code
data %>% 
  time_decompose(value) %>%
  anomalize(remainder, alpha = 0.025, max_anoms = 0.05) %>%
  time_recompose() %>%
  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)

Show code
data_anomalous_tuned <- data %>% 
  time_decompose(value) %>%
  anomalize(remainder, alpha = 0.025, max_anoms = 0.05) %>%
  time_recompose() %>%
  filter(anomaly == 'Yes') 

head(data_anomalous_tuned)
# A time tibble: 6 x 10
# Index: timestamp
  timestamp           observed  season  trend remainder remainder_l1
  <dttm>                 <dbl>   <dbl>  <dbl>     <dbl>        <dbl>
1 2014-09-21 01:00:00    25371  -8034. 14994.    18411.      -17691.
2 2014-10-19 01:00:00    25610  -8034. 15905.    17739.      -17691.
3 2014-11-01 01:30:00    23736  -9536. 15566.    17707.      -17691.
4 2014-11-01 02:00:00    23245 -10442. 15565.    18121.      -17691.
5 2014-11-02 01:00:00    39197  -8034. 15597.    31634.      -17691.
6 2014-11-02 01:30:00    35212  -9536. 15598.    29151.      -17691.
# ... with 4 more variables: remainder_l2 <dbl>, anomaly <chr>,
#   recomposed_l1 <dbl>, recomposed_l2 <dbl>

Application of Anomaly Detection

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Wongvorachan (2021, Dec. 4). Tarid Wongvorachan: Anomaly Detection with New York City taxi data. Retrieved from https://taridwong.github.io/posts/2021-12-04-anomaly-detection-with-new-york-city-taxi-data/

BibTeX citation

@misc{wongvorachan2021anomaly,
  author = {Wongvorachan, Tarid},
  title = {Tarid Wongvorachan: Anomaly Detection with New York City taxi data},
  url = {https://taridwong.github.io/posts/2021-12-04-anomaly-detection-with-new-york-city-taxi-data/},
  year = {2021}
}