Anomaly Detection with New York City taxi data

Unsupervised Machine Learning R

In this entry, I will be conducting anomaly detection to identify points of anomaly in the taxi passengers data in New York City from July 2014 to January 2015 at half-hourly intervals.

Tarid Wongvorachan (University of Alberta)

Introduction to Anomaly Detection


library(tidyverse) #for data manipulation
library(lubridate) #for date/time data management
library(zoo) # moving averages  
library(anomalize) #for anomaly detection

data <- read_csv("nyc_taxi.csv")
# A tibble: 6 x 2
  timestamp           value
  <dttm>              <dbl>
1 2014-07-01 00:00:00 10844
2 2014-07-01 00:30:00  8127
3 2014-07-01 01:00:00  6210
4 2014-07-01 01:30:00  4656
5 2014-07-01 02:00:00  3820
6 2014-07-01 02:30:00  2873
ggplot(data, aes(x = timestamp, y = value)) + 
  geom_point(shape = 1, alpha = 0.5) +
  labs(x = "Time", y = "Count") +
  labs(alpha = "", colour="Legend")

Investigating the Moving Average

data <- data %>% 
  dplyr::mutate (MA48 = zoo::rollmean(value, k = 48, fill = NA),
                MA336 = zoo::rollmean(value, k = 336, fill = NA)) %>%

#Plot the moving average line chart

data %>%
  gather(metric, value, MA48:MA336) %>%
  ggplot(aes(timestamp, value, color = metric)) +
  geom_line() +
  ggtitle("Rolling average of NYC taxi passenger count")+
  theme(plot.title = element_text(hjust = 0.5))

Time Series Decomposition with Anomalies

data %>% 
  time_decompose(value, method = "stl", frequency = "auto", trend = "auto") %>%
  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>%

Anomaly Detection

data %>% 
  time_decompose(value) %>%
  anomalize(remainder) %>%
  time_recompose() %>%
  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)

data_anomalous <- data %>% 
  time_decompose(value) %>%
  anomalize(remainder) %>%
  time_recompose() %>%
  filter(anomaly == 'Yes') 

# A time tibble: 6 x 10
# Index: timestamp
  timestamp           observed season  trend remainder remainder_l1
  <dttm>                 <dbl>  <dbl>  <dbl>     <dbl>        <dbl>
1 2014-07-04 07:30:00     4926  2817. 14875.   -12767.       -9534.
2 2014-07-04 08:00:00     5165  3903. 14875.   -13612.       -9534.
3 2014-07-04 08:30:00     5776  4346. 14874.   -13445.       -9534.
4 2014-07-04 09:00:00     7338  3304. 14874.   -10840.       -9534.
5 2014-07-05 07:00:00     3658  -735. 14848.   -10454.       -9534.
6 2014-07-05 07:30:00     4345  2817. 14847.   -13319.       -9534.
# ... with 4 more variables: remainder_l2 <dbl>, anomaly <chr>,
#   recomposed_l1 <dbl>, recomposed_l2 <dbl>
data %>% 
  time_decompose(value) %>%
  anomalize(remainder, alpha = 0.025, max_anoms = 0.05) %>%
  time_recompose() %>%
  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)

data_anomalous_tuned <- data %>% 
  time_decompose(value) %>%
  anomalize(remainder, alpha = 0.025, max_anoms = 0.05) %>%
  time_recompose() %>%
  filter(anomaly == 'Yes') 

# A time tibble: 6 x 10
# Index: timestamp
  timestamp           observed  season  trend remainder remainder_l1
  <dttm>                 <dbl>   <dbl>  <dbl>     <dbl>        <dbl>
1 2014-09-21 01:00:00    25371  -8034. 14994.    18411.      -17691.
2 2014-10-19 01:00:00    25610  -8034. 15905.    17739.      -17691.
3 2014-11-01 01:30:00    23736  -9536. 15566.    17707.      -17691.
4 2014-11-01 02:00:00    23245 -10442. 15565.    18121.      -17691.
5 2014-11-02 01:00:00    39197  -8034. 15597.    31634.      -17691.
6 2014-11-02 01:30:00    35212  -9536. 15598.    29151.      -17691.
# ... with 4 more variables: remainder_l2 <dbl>, anomaly <chr>,
#   recomposed_l1 <dbl>, recomposed_l2 <dbl>

Application of Anomaly Detection


