Time Series Data Transformations for Better Forecasting
Written on
When dealing with time series data, it is essential to identify ways to simplify patterns by eliminating superfluous variation. A consistent pattern enhances the accuracy of future forecasts.
Calendar variations must also be considered, as the number of days in each month can lead to discrepancies. For instance, the graph illustrating historical monthly crime rates in Baltimore reveals that while the grey line shows monthly totals, the red time series presents a more uniform pattern after removing calendar effects through daily averages. The adjusted data offers a clearer representation of underlying trends, devoid of distortions due to calendar-related variability. This adjustment is crucial for improving the accuracy of forecasting models, preventing them from being misled by irregularities.
library(readr) library(dplyr) library(lubridate) library(tsibble)
crime <- readr::read_csv("../data/baltimore_crime.csv")
# Aggregate by days tb_crime <- tibble(crime) %>%
select(CrimeDate) %>%
group_by(CrimeDate) %>%
summarise(total = n()) %>%
arrange(CrimeDate) %>%
mutate(CrimeDate = as_date(CrimeDate, format="%m/%d/%Y")) %>%
filter(between(year(CrimeDate), 2011, 2015))
ts_crime <- tb_crime %>% as_tsibble(index = CrimeDate)
######## Monthly Average ######## ts_crime_monthly_avg <- ts_crime %>%
index_by(Month = ~ floor_date(.x, "month")) %>%
filter(between(year(Month), 2011, 2015)) %>%
summarise(Monthly_Total = sum(total)) %>%
mutate(Avg_perDay = Monthly_Total / Num_Days)
Adjusting time series data for population allows for per-capita figures, providing normalized data that enables more accurate comparisons across time or geographic regions. This adjustment helps reveal genuine trends by considering population growth, as increases in raw numbers may merely indicate population growth rather than actual consumption increases. Such normalization is vital for informed decision-making in policy, economic assessments, and understanding societal trends.
library(readr) library(dplyr) library(tsibble)
auto <- readr::read_csv('../data/us_car_reg.csv', col_names = c('year', 'total')) us_pop <- readr::read_csv('../data/us_pop.csv', col_names = c('year', 'total'))
auto_tsb <- tsibble(auto, index = year) us_pop_tsb <- tsibble(us_pop, index = year)
auto_pop <- left_join(auto_tsb, us_pop_tsb, by="year")
auto_pop <- auto_pop %>%
rename(car_regs_total = total.x, population = total.y) %>%
mutate(cars_per_1000 = (car_regs_total / population) * 1000)
For financial time series analysis involving dollar amounts, adjusting for inflation is crucial to examine patterns in real terms rather than nominal terms. By removing the inflationary effects, one can achieve a clearer view of economic trends.
To adjust for inflation, select a Consumer Price Index (CPI) and gather its historical data. The Bureau of Labor Statistics in the U.S. provides this information, and the FRED database is a reliable source for economic indicators.
Revenue adjusted to the price level of 2017 can be calculated using the following formula:
library(readr) library(dplyr)
# Load dataset watch_sales <- readr::read_csv("../data/watch_sales.csv")
# Adjust for inflation using CPI index from 2017 watch_sales_adjusted <- watch_sales %>%
mutate(revenue_in_2017_prices = (nominal / cpi) * 240.01)
To enhance the predictive accuracy of a model, it is essential to ensure that the data is stationary with stable variance over time. The air passenger traffic dataset from Kaggle illustrates such fluctuations, where the variance in the number of passengers increases over time.
The Box-Cox transformation employs a parameter known as lambda to stabilize variance. For the Passengers dataset, I utilized the Guerrero method to determine an optimal lambda value. This derived lambda is then applied to transform the data, resulting in more consistent variance.
library(readr) library(tsibble) library(dplyr) library(feasts)
air <- readr::read_csv("../data/air_passengers.csv")
air <- air %>%
mutate(Month = as.Date(paste0(Month, "-01"), format="%Y-%m-%d")) %>%
as_tsibble(index = Month)
# Apply Guerrero method to derive lambda lambda <- air %>%
features(Passengers, features = guerrero) %>%
pull(lambda_guerrero)
# Apply Box-Cox transformation air <- air %>%
mutate(Passengers_bc = box_cox(Passengers, lambda))
For those interested in diving deeper into the Box-Cox transformation, I recommend reading the article by Egor Howell.
Links
- Access code and data on my GitHub page
- United States population data from IMF DATAMAPPER
- United States automobile registrations available on Statista
- United States CPI Index from FRED
- Air passenger data from Kaggle