Simple Time Series Analysis

Introduction and Motivation

Some data is naturally seasonal: it varies in predictable ways over time (or across some dimension that can be modeled as time). These data are often things that we care about and would like to examine. For example, how do water and energy consumption vary across seasons in different geographic locations? What is the trend in changes of traffic levels (either growth or decline) for a particular web site over a six month period? How about for a particular resource on a particular website?

Having tools that help us answer these and other questions is important to understanding raw data, which in turn enables us to make informed decisions. Deciders in business rely on this type of information to determine where to invest resources; governments and policy makers to understand how to direct public funds in order to serve their constituents more effectively.

So what sort of tools can we apply? As it turns out, there is already a mature subfield of Statistics called Time Series Analysis that is exactly what we need. This post will serve as an introduction to a small subset of time series analysis through a short series of definitions and exploration of two relevant topics: the Linear Least Squares technique for finding the parameters of a linear equation that most closely matches a set of data points, and the lag difference operator that is useful for detecting sudden changes in a series of data.

Definitions and Notation

Note that, for reasons of convenience, I represent vectors using the braces of set notation instead of the more common parenthesis notation. That is, $\bar{v} = \{v_1, v_2, \dots, v_n\}$ instead of $\bar{v} = (v_1, v_2, \dots, v_n)$.

We represent time series data as points $v_k = (x, y)$ in the plane, where the $x$ value represents time, and the $y$ value represents a sample at time $x$. A sequence of $n$ samples, taken in order, forms a series, which I represent as a vector $\bar{v} = \{v_1, v_2, \dots, v_n\}$.

For expository purposes, we make the simplifying assumption of a constant, uniform sampling rate. That is, we allow the same amount of time between samples, and we do not skip samples. This enables us to define a bijection between time and the natural numbers.

By representing data in this manner, we can begin to answer questions about it. How does does the sampled quantity vary over time? What is the range of variation? Is there some discernible pattern to the variation; that is, does it change in some predictable way? Is it seasonal? That is, does the variation over time form some sort of repeating pattern?

Lag and Lag Difference

Suppose we have a time series represented as a vector $\bar{v} = \{v_1, v_2, v_3, \dots, v_n\}$ of length $n$. Then we define the first order lag operator $B$ on an element of $\bar{v}$ as:

      
        \[
            Bv_k = v_{k - 1}
              \text{ for } v_k \in \bar{v}
              \text{ where } k \in \mathbb{N}
              \text{ and } 1 \lt k \leq n
        \]

and we define the first order lag difference operator $\nabla\bar{v}$ so that:

      
        \[
            \begin{align*}
              \nabla\bar{v}
                &= \{(1 - B)v_k\,|\,k \in \mathbb{N}
                             \text{ and } 1 \lt k \leq n\} \\
                &= \{v_k - Bv_k\,|\,\mathbb{N}
                             \backslash \{1\} \ni k \leq n\} \\
                &= \{v_k - v_{k - 1}|\,\mathbb{N}
                             \backslash \{1\} \ni k \leq n\} \\
                &= \{v_2 - v_1, v_3 - v_2, \dots, v_n - v_{n - 1}\}
            \end{align*}
        \]

which produces a vector $\bar{v}^\prime$ of length $n - 1$.

Application of the first order lag difference operator to a time series is very roughly equivalent to finding the first derivative of an analytic function over the time domain that coincides with the points in the series. The salient observation is that calculating the lag difference of a time series with slow seasonal variation will result in a vector of very small values; a significant spike in the original data can be identified simply by looking for values in $\bar{v}^\prime$ that exceed some reasonable threshold.

Common Lisp code for $\nabla\bar{v}$ is:

      
        (defun backwards-difference (v)
          (mapcar #'- (cdr v) v))

where v is a time series represented as a Lisp list. Note that, if the time intervals in the series are periodic, one can draw a bijection between sample times and $\mathbb{N}$, thus simplifying things considerably when, e.g., computing least squares. This allows for simple list or vector representations of the series versus ordered sets of $(time, value)$ pairs.

It is possible to extend this technique nearly arbitrarily by defining the $k$th order lag operator:

      
        \[
              B^k v_i = v_{i - k}\qquad\text{where }
                            i \in \mathbb{N}, k \lt i \leq n
        \]

and then defining the $k$th order lag difference operator in the obvious way; this can account for a gradual spike that accumulates over some number of successive elements in the series, allowing one to find differences that grow relatively slowly over time. Note that the notation in much of the literature here can be confusing. In particular, the $k$th order lag difference operator for a vector element $v_i$ such that $i, k \in \mathbb{N}, k \lt i \leq n$ is $(1 - B)^k v_i$ which corresponds to the value $v_i - v_{i - k}$; i.e., one does not do a binomial expansion here but rather distributes. $B$ is a unary operator.

Note also that this technique can be applied several times; successive iterations roughly correspond to computing derivatives of higher order. For example, the $\nabla(\nabla\bar{v}) = \nabla^2\bar{v}$ is nominally equivalent to finding the second derivative with respect to time of some $f(t)$ analytic (in time) that corresponds with the points of $\bar{v}$. $\nabla(\nabla(\nabla\bar{v}))$ would be analogous to the third order time derivative, etc.

Smoothing

The obvious deficiency in this technique is that small spikes can give false positives. This can be handled in several ways; you can see if a seemingly meaningful difference is followed shortly by an equally significant difference of opposite sign, which tells you the spike was probably not interesting.

Other techniques are more powerful and may be helpful to eliminate noise. For instance, you may wish to apply some smoothing function to the data before analysis; something like the running 3 median covering the original time series will eliminate statistical outliers while preserving the overall structure of the original series. This is basically the set:

      
        \[
            \mathrm{med}_{[3]}\bar{v}
              = \left\{ \left\{v_1\right\}
                  \cap \left\{\mathrm{med}(v_{i - 1},
                              v_i, v_{i + 1})\,|\, i \in \mathbb{N},
                              1 \lt i \lt n\right\}
                  \cap \left\{v_n\right\}
                \right\}
        \]

(If the first and/or last element of the vector is itself an outlier, it may simply be discarded, provided you otherwise have enough elements in the series.)

I do not provide code for this, but it is straight forward to implement. It can also be dangerous, and before applying a smoothing function, one should be sure one understands its effects and whether it is actually safe to discard outliers.

Note that, just like the lag difference operator, smoothing functions can be applied multiple times, or the size of the median window can be opened; that is, instead of computing the median of three elements, one can take the median of five, etc.

Other techniques include low and high pass filtering as in e.g. discrete signal analysis.

Linear Least Squares and Curve Fitting

One may wish to find a linear approximation that roughly corresponds with a series or some function thereof (e.g., the first order lag difference of said series). Geometrically speaking, this means finding the parameters of the best-fit line $y = mx + b$ through the points $v_k$ of the time series as vector $\bar{v}$, where $m$ is the slope and $b$ is the $y$ intercept of this line. Suppose we represent the vector $\bar{v} = (v_1, v_2, \dots, v_n)$ as a sequence of $(x_i, y_i)$ pairs corresponding to points in the plane, where $v_i = (x_i, y_i)$ and each $x_i$ is a time value and $y_i$ is an observed sample. Thus, the $x$-axis represents time and the $y$-axis represents the sample domain. Then we may use the technique of linear least squares to find a pair of equations that may be used to recover the parameters $m$ and $b$. The essential idea is to minimize the vertical distance between points in the data set and points on the line; it is beyond the scope of this document to show the derivation but it is accessible to those with a decent working knowledge of linear algebra.

The equations for $m$ and $b$ are:

      
        \[
            \begin{align*}
              m &= \frac{n(\sum{xy}) - (\sum{y})(\sum{x})}
                            {n(\sum{x^2}) - (\sum{x})^2} \\
              b &= \frac{(\sum{y})(\sum{x^2}) - (\sum{x})(\sum{xy})}
                            {n(\sum{x^2}) - (\sum{x})^2}
            \end{align*}
        \]

Here the sums appear unbounded, but it should be taken that they are over the elements of the vector.

Supposing, as mentioned before, that we can define a bijection between time values and $\mathbb{N}$, then the time series becomes directly representable as a simple vector of length $n$, which we can represent as a list of sample values only. This also simplifies some of the sums in the calculations of $m$ and $b$; in particular the closed form for the sum of integers from $1$ to $n$ due to Gauss, and the computation of the $n$th square pyramidal number:

      
        \[
            \begin{align*}
              \sum_{k = 1}^n k   &= \frac{n(n + 1)}{2} \\
              \sum_{k = 1}^n k^2 &= \frac{n(n + 1)(2n + 1)}{6}
            \end{align*}
        \]

Given that, Common Lisp code for computation over a vector is as follows:

      
        (defun sum-1-to (n)
          (/ (* n (+ n 1)) 2))

        (defun sum-squares-1-to (n)
          (/ (* n (+ n 1) (+ (* n 2) 1)) 6))

        (defun sum-vector (v)
          (reduce #'+ v))

        (defun linear-least-squares (v)
          (let* ((n (length v))
                 (xs (loop for k from 1 to n collect k))
                 (s (sum-1-to n))
                 (r (sum-squares-1-to n))
                 (u (sum-vector (mapcar #'* xs v)))
                 (w (sum-vector v))
                 (m (/ (- (* n u) (* s w))
                       (- (* n r) (* s s))))
                 (b (/ (- (* w r) (* s u))
                       (- (* n r) (* s s)))))
            (list m b)))

The linear-least-squares function returns a Lisp list containing a pair of values representing $m$ and $b$, respectively. One may wish to pull the computation of the denominator into another binding, as it is the same in both equations. I leave it here for parity with the mathematical notation.

Also, it should be noted that the general technique of least squares can be generalized to higher dimensions and one can find quadratic and cubic least squares if so desired.

Subsequences of Time Series Data

Recall that in the smoothing section, we investigated a function that worked with contiguous three-element subsequences of the original time series data. This technique can be generally useful for other things as well, for example computation of least squares, where we may uncover more interesting phenomena than computations over the entire vector. When applying an operation to contiguous subsequences within the time series vector we may think of a sliding ``window'' across the larger sequence. In the case of least squares, we note that the smaller the window, the closer we get to an approximation of the continuous rate of change of the function. Suppose we apply this technique to a vector $\bar{v}$ of $n$ elements using a window size of $k$, then we end up with a vector $\bar{w}$ of $n - k + 1$ elements that correspond to pairs $w_i = (m_i, b_i)$. One may think of this as a very smoothed sequence in which $b_i$ that is above some application-specific threshold almost certainly represents a significant event.

Some Simple Time Series Analysis