.. title:: Model selection with cross-validation .. _sun_spots: ================================== An end-to-end time series analysis ================================== In this example, we're going to look at why the :func:`pmdarima.arima.auto_arima` method should not be used as a silver bullet, and why qualitative investigation of the data may reveal important characteristics about it which may, in turn, affect how you approach the problem. For this problem, we'll use the :func:`pmdarima.datasets.load_sunspots` method, which loads a seasonal time series of monthly mean relative sunspot numbers from 1749 to 1983. Here's what you'll need to run this example: * Pmdarima 1.5.2+ (many plotting utilities were added in this version) * Matplotlib - It's not a hard requirement of the package itself, but you'll need it to run these examples (if you're running this in a notebook, make sure to include ``%matplotlib inline``) * Python 3.6+. If you're running python 3.5, simply remove the `F-string `_ print statements and it will work. Imports, data loading & splitting --------------------------------- Here's the set up for our problem. We'll load the data and split it into our train/test sets: .. code-block:: python import numpy as np import pandas as pd import matplotlib.pyplot as plt # %matplotlib inline import pmdarima as pm from pmdarima.datasets import load_sunspots from pmdarima.model_selection import train_test_split print(f"Using pmdarima {pm.__version__}") # Using pmdarima 1.5.2 y = load_sunspots(True) train_len = 2750 y_train, y_test = train_test_split(y, train_size=train_len) y_train.head() ======== ======= Jan 1749 58.0 Feb 1749 62.6 Mar 1749 70.0 Apr 1749 55.7 May 1749 85.0 dtype: float64 ======== ======= Time series data is a bit different from traditional machine learning in the sense that it's temporally sensitive. That means the larger our test set, the higher we would expect our error to be for later forecasts, since our model won't be able to effectively forecast for too many periods into the future. For this example, we'll take the first 2750 samples for training, and leave the last 70 as our test set. Examining the data ------------------ Pmdarima 1.5.2 introduced a handy utility for visualizing your time series data: :func:`pmdarima.utils.vizualization.tsdisplay`. We can use this to visalize our data, take a look at the auto-correlation plot, and see the histogram of values: .. code-block:: python from pmdarima.utils import tsdisplay tsdisplay(y_train, lag_max=100) .. image:: ../img/sunspots/untransformed.png :align: center :scale: 60% :alt: Untransformed Time Series Reading this plot can give us several pieces of information: * We are looking at a seasonal time series. Our apriori knowledge of the dataset informs us that this data has a seasonal periodicity of ``m=12``. * There seems to be some significant data skew, looking at the histogram. It is very zero-inflated. A lot of the statistical techniques used in time series modeling behave better when the data is normally distributed, so this may be something to look into. Fitting a baseline ------------------ Before we start manipulating our data, let's examine what would happen if we just fit a model straight out of the box: .. code-block:: python fit1 = pm.auto_arima(y_train, m=12, trace=True, suppress_warnings=True) :: Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 0, 1, 12); AIC=22834.181, BIC=22881.533, Fit time=7.496 seconds Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 0, 0, 12); AIC=23347.880, BIC=23359.718, Fit time=0.053 seconds Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 0, 0, 12); AIC=23087.391, BIC=23111.067, Fit time=0.458 seconds Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 0, 1, 12); AIC=22931.725, BIC=22955.401, Fit time=1.314 seconds Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 0, 0, 12); AIC=23345.881, BIC=23351.800, Fit time=0.030 seconds Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 0, 1, 12); AIC=22834.415, BIC=22875.848, Fit time=6.067 seconds Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 0, 0, 12); AIC=22834.465, BIC=22875.898, Fit time=8.268 seconds Fit ARIMA: order=(2, 1, 2) seasonal_order=(2, 0, 1, 12); AIC=22842.107, BIC=22895.378, Fit time=34.407 seconds Near non-invertible roots for order (2, 1, 2)(2, 0, 1, 12); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.995) Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 0, 2, 12); AIC=22844.771, BIC=22898.042, Fit time=28.426 seconds Near non-invertible roots for order (2, 1, 2)(1, 0, 2, 12); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.997) Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 0, 0, 12); AIC=22832.862, BIC=22868.375, Fit time=2.334 seconds Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 0, 0, 12); AIC=22896.004, BIC=22925.599, Fit time=0.658 seconds Fit ARIMA: order=(2, 1, 1) seasonal_order=(0, 0, 0, 12); AIC=22895.198, BIC=22924.793, Fit time=0.567 seconds Fit ARIMA: order=(3, 1, 2) seasonal_order=(0, 0, 0, 12); AIC=22830.683, BIC=22872.115, Fit time=2.810 seconds Fit ARIMA: order=(3, 1, 2) seasonal_order=(1, 0, 0, 12); AIC=22832.671, BIC=22880.023, Fit time=8.964 seconds Fit ARIMA: order=(3, 1, 2) seasonal_order=(0, 0, 1, 12); AIC=22832.590, BIC=22879.942, Fit time=8.747 seconds Fit ARIMA: order=(3, 1, 2) seasonal_order=(1, 0, 1, 12); AIC=22834.543, BIC=22887.814, Fit time=8.893 seconds Near non-invertible roots for order (3, 1, 2)(1, 0, 1, 12); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.998) Fit ARIMA: order=(3, 1, 1) seasonal_order=(0, 0, 0, 12); AIC=22895.344, BIC=22930.857, Fit time=0.997 seconds Fit ARIMA: order=(4, 1, 2) seasonal_order=(0, 0, 0, 12); AIC=22829.697, BIC=22877.049, Fit time=3.447 seconds Fit ARIMA: order=(4, 1, 2) seasonal_order=(1, 0, 0, 12); AIC=22831.750, BIC=22885.021, Fit time=11.401 seconds Fit ARIMA: order=(4, 1, 2) seasonal_order=(0, 0, 1, 12); AIC=22831.704, BIC=22884.975, Fit time=9.325 seconds Fit ARIMA: order=(4, 1, 2) seasonal_order=(1, 0, 1, 12); AIC=22832.081, BIC=22891.271, Fit time=13.537 seconds Near non-invertible roots for order (4, 1, 2)(1, 0, 1, 12); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.999) Fit ARIMA: order=(4, 1, 1) seasonal_order=(0, 0, 0, 12); AIC=22897.185, BIC=22938.618, Fit time=1.096 seconds Fit ARIMA: order=(5, 1, 2) seasonal_order=(0, 0, 0, 12); AIC=22899.855, BIC=22953.126, Fit time=2.716 seconds Fit ARIMA: order=(4, 1, 3) seasonal_order=(0, 0, 0, 12); AIC=22834.222, BIC=22887.493, Fit time=4.503 seconds Fit ARIMA: order=(3, 1, 3) seasonal_order=(0, 0, 0, 12); AIC=22890.085, BIC=22937.437, Fit time=3.098 seconds Fit ARIMA: order=(5, 1, 1) seasonal_order=(0, 0, 0, 12); AIC=22898.389, BIC=22945.741, Fit time=1.600 seconds Fit ARIMA: order=(5, 1, 3) seasonal_order=(0, 0, 0, 12); AIC=22894.744, BIC=22953.934, Fit time=4.333 seconds Examining the summary gives us: .. code-block:: python fit1.summary() :: Statespace Model Results ============================================================================== Dep. Variable: y No. Observations: 2750 Model: SARIMAX(4, 1, 2) Log Likelihood -11406.849 Date: Fri, 13 Dec 2019 AIC 22829.697 Time: 07:46:18 BIC 22877.049 Sample: 0 HQIC 22846.806 - 2750 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept 0.0012 0.012 0.104 0.917 -0.022 0.024 ar.L1 1.3845 0.025 55.356 0.000 1.335 1.433 ar.L2 -0.3867 0.027 -14.352 0.000 -0.440 -0.334 ar.L3 -0.0085 0.025 -0.340 0.734 -0.058 0.040 ar.L4 -0.0394 0.017 -2.264 0.024 -0.073 -0.005 ma.L1 -1.8193 0.020 -91.747 0.000 -1.858 -1.780 ma.L2 0.8570 0.020 42.443 0.000 0.817 0.897 sigma2 235.3266 4.006 58.739 0.000 227.474 243.179 =================================================================================== Ljung-Box (Q): 79.17 Jarque-Bera (JB): 1261.11 Prob(Q): 0.00 Prob(JB): 0.00 Heteroskedasticity (H): 1.21 Skew: 0.31 Prob(H) (two-sided): 0.00 Kurtosis: 6.26 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). Transforming our data --------------------- Since we expect our model to perform better over more normal data, let's experiment with log transformations and the `Box-Cox transformation `_, each of which is provided as an endogenous transformer in the Pmdarima package. .. code-block:: python from pmdarima.preprocessing import LogEndogTransformer y_train_log, _ = LogEndogTransformer(lmbda=1e-6).fit_transform(y_train) tsdisplay(y_train_log, lag_max=100) .. image:: ../img/sunspots/log-transformed.png :align: center :scale: 60% :alt: Log-transformed Time Series Hmm... The log transformation didn't seem to help too much. In fact, it seems like it just shifted the skew to the other tail. Let's try the Box-Cox transformation. When ``.fit()`` is called, it will learn the lambda transformation parameter: .. code-block:: python from pmdarima.preprocessing import BoxCoxEndogTransformer y_train_bc, _ = BoxCoxEndogTransformer(lmbda2=1e-6).fit_transform(y_train) tsdisplay(y_train_bc, lag_max=100) .. image:: ../img/sunspots/bc-transformed.png :align: center :scale: 60% :alt: Box-Cox transformed Time Series However, the Box-Cox transformation seems to work very well as a means to normalize the data! In fact, a test of normality shows overwhelmingly that this is a normal distribution: .. code-block:: python from scipy.stats import normaltest normaltest(y_train_bc)[1] # 3.751017646057429e-14 Fitting the transformed data ---------------------------- Pmdarima implements a scikit-learn-esque pipeline used to chain transformations and estimators together. Using this, we can centralize the entire transformer/model fit into one metaestimator: .. code-block:: python from pmdarima.pipeline import Pipeline fit2 = Pipeline([ ('boxcox', BoxCoxEndogTransformer(lmbda2=1e-6)), ('arima', pm.AutoARIMA(trace=True, suppress_warnings=True, m=12)) ]) fit2.fit(y_train) :: Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 0, 1, 12); AIC=9944.949, BIC=9992.301, Fit time=9.227 seconds Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 0, 0, 12); AIC=10560.255, BIC=10572.093, Fit time=0.049 seconds Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 0, 0, 12); AIC=10190.842, BIC=10214.518, Fit time=0.541 seconds Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 0, 1, 12); AIC=10000.189, BIC=10023.865, Fit time=1.834 seconds Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 0, 0, 12); AIC=10558.255, BIC=10564.174, Fit time=0.057 seconds Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 0, 1, 12); AIC=9944.067, BIC=9985.500, Fit time=7.050 seconds Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 0, 0, 12); AIC=9942.145, BIC=9977.659, Fit time=2.284 seconds Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 0, 0, 12); AIC=9944.073, BIC=9985.506, Fit time=7.021 seconds Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 0, 0, 12); AIC=9987.000, BIC=10016.595, Fit time=0.626 seconds Fit ARIMA: order=(2, 1, 1) seasonal_order=(0, 0, 0, 12); AIC=9986.300, BIC=10015.895, Fit time=0.697 seconds Fit ARIMA: order=(3, 1, 2) seasonal_order=(0, 0, 0, 12); AIC=9938.391, BIC=9979.824, Fit time=2.610 seconds Fit ARIMA: order=(3, 1, 2) seasonal_order=(1, 0, 0, 12); AIC=9939.926, BIC=9987.278, Fit time=12.276 seconds Fit ARIMA: order=(3, 1, 2) seasonal_order=(0, 0, 1, 12); AIC=9939.653, BIC=9987.005, Fit time=8.903 seconds Fit ARIMA: order=(3, 1, 2) seasonal_order=(1, 0, 1, 12); AIC=9941.966, BIC=9995.237, Fit time=12.801 seconds Near non-invertible roots for order (3, 1, 2)(1, 0, 1, 12); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.996) Fit ARIMA: order=(3, 1, 1) seasonal_order=(0, 0, 0, 12); AIC=9982.150, BIC=10017.664, Fit time=0.734 seconds Fit ARIMA: order=(4, 1, 2) seasonal_order=(0, 0, 0, 12); AIC=9985.360, BIC=10032.711, Fit time=2.084 seconds Fit ARIMA: order=(3, 1, 3) seasonal_order=(0, 0, 0, 12); AIC=9982.528, BIC=10029.880, Fit time=3.890 seconds Fit ARIMA: order=(2, 1, 3) seasonal_order=(0, 0, 0, 12); AIC=9981.766, BIC=10023.199, Fit time=2.940 seconds Fit ARIMA: order=(4, 1, 1) seasonal_order=(0, 0, 0, 12); AIC=9983.484, BIC=10024.917, Fit time=1.505 seconds Fit ARIMA: order=(4, 1, 3) seasonal_order=(0, 0, 0, 12); AIC=9987.366, BIC=10040.637, Fit time=1.602 seconds Pipeline(steps=[('boxcox', BoxCoxEndogTransformer(floor=1e-16, lmbda=None, lmbda2=1e-06, neg_action='raise')), ('arima', AutoARIMA(D=None, alpha=0.05, d=None, error_action='warn', information_criterion='aic', m=12, max_D=1, max_P=2, max_Q=2, max_d=2, max_order=5, max_p=5, max_q=5, maxiter=50, method='lbfgs', n_fits=10, n_jobs=1, offset_test_args=None, out_of_sample_size=0, random=False, random_state=None, scoring='mse', scoring_args=None, seasonal=True, seasonal_test='ocsb', seasonal_test_args=None, start_P=1, start_Q=1, start_p=2, start_params=None, ...))]) And the model summary: .. code-block:: python fit2.summary() :: Statespace Model Results ============================================================================== Dep. Variable: y No. Observations: 2750 Model: SARIMAX(3, 1, 2) Log Likelihood -4962.196 Date: Fri, 13 Dec 2019 AIC 9938.391 Time: 08:17:33 BIC 9979.824 Sample: 0 HQIC 9953.361 - 2750 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept 9.133e-05 0.001 0.076 0.940 -0.002 0.002 ar.L1 1.2561 0.026 47.432 0.000 1.204 1.308 ar.L2 -0.2495 0.025 -10.071 0.000 -0.298 -0.201 ar.L3 -0.0645 0.021 -3.140 0.002 -0.105 -0.024 ma.L1 -1.7521 0.022 -78.615 0.000 -1.796 -1.708 ma.L2 0.7935 0.021 36.997 0.000 0.752 0.836 sigma2 2.1643 0.044 49.641 0.000 2.079 2.250 =================================================================================== Ljung-Box (Q): 54.00 Jarque-Bera (JB): 336.44 Prob(Q): 0.07 Prob(JB): 0.00 Heteroskedasticity (H): 0.71 Skew: -0.11 Prob(H) (two-sided): 0.00 Kurtosis: 4.70 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). Notice that not only have our model parameters (predictably) changed, but the AIC is significantly lower! *But*, you say, *the data is on a different scale. Is AIC really a good measure?* Not per se. Let's look at an apples-to-apples comparison. Examining forecasts ------------------- Let's write a function that examines the forecasts over the next 70 periods and plots the residuals: .. code-block:: python from sklearn.metrics import mean_squared_error as mse def plot_forecasts(forecasts, title, figsize=(8, 12)): x = np.arange(y_train.shape[0] + forecasts.shape[0]) fig, axes = plt.subplots(2, 1, sharex=False, figsize=figsize) # Plot the forecasts axes[0].plot(x[:y_train.shape[0]], y_train, c='b') axes[0].plot(x[y_train.shape[0]:], forecasts, c='g') axes[0].set_xlabel(f'Sunspots (RMSE={np.sqrt(mse(y_test, forecasts)):.3f})') axes[0].set_title(title) # Plot the residuals resid = y_test - forecasts _, p = normaltest(resid) axes[1].hist(resid, bins=15) axes[1].axvline(0, linestyle='--', c='r') axes[1].set_title(f'Residuals (p={p:.3f})') plt.tight_layout() plt.show() Here's what the model on the untransformed series produces: .. image:: ../img/sunspots/model-1.png :align: center :scale: 60% :alt: Forecasts on the untransformed Time Series Notice that we are *not* using the SMAPE (symmetric mean absolute percentage error) as we normally might, because our time series contains zeros. When the actual or forecasted value is zero, SMAPE is known to produce misleadingly large error terms. Instead, we're using the RMSE. The second image shows the distribution of the residuals. The normal test shows that they are not normally distributed (which was kind of expected). Here's what the model over the transformed series produces: .. image:: ../img/sunspots/model-2.png :align: center :scale: 60% :alt: Forecasts on the Box-Cox transformed Time Series A few things to note: * These forecasts are on the same scale as the original data. This is because the pipeline is smart enough to inverse transform the forecasts back to the original scale when a transformer is used. * There is a lower RMSE in the forecasts (also expected, given our hypothesis was that normal data will perform better) * The residuals are not *quite* normally distributed, though they are closer than the untransformed model. Final thoughts -------------- By simply transforming our data prior to fitting our model, we were able to produce better forecasts (lower RMSE), a more simple model (fewer parameters) that fit significantly faster, and one that also had a much lower AIC. The key to take away from this is that, although convenient, the :func:`pmdarima.arima.auto_arima` method should not be the first thing you throw at a time series. Take the time to understand your data, clean it up, and make necessary transformations before you begin training models.