Introduction To LSTM For Forecasting With TensorFlow And Neon

What you will learn

• LSTM

• forecasting with LSTM

• Neon and TensorFlow overview and comparison

Notebook available on github:

Forecasting with LSTM

#### Introduction

Nowadays we see a giant leap in Image Recognition mainly due to the development of **deeper Neural Networks**: when the task is to recognise and classify an image some specific topologies have demonstrated their efficacy. Most of the recent algorithms are based on deep stacks of convolutional and pooling layers to form some well-proven high performance networks as the Inception-V1 used in the GoogLeNet or the more recent Inception V3 and Inception V4.

The **Convolutional Neural Networks** are effective when it comes to recognise spatial patterns hierarchically organised like those found in an image but when it comes to recognise temporal patterns the Recurrent Neural Networks are proven to be much more effective than other solutions.

Since their first introduction the Recurrent Neural Networks (RNN) where deemed difficult to train and so their application was limited in the early years.

One of the most drawbacks of the RNN was the so-called “Vanishing Gradient” problem: basically a RNN remembers the past to predict the future (or a generic quantity for which it has been trained), the Vanishing Gradient is a premature loss of memory: in other words the RMM cannot remember things that happened in the distant past.

Recently, a prominent variation of the RNN has been introduced: it’s called **Long-Short Therm Memory (LSTM)**. This kind of network can manage the Vanishing Gradient problem very well and today It is more and more used in dozens of different applications such as text, genomes, handwriting, speak recognition, or stock markets forecasting.

Our goal is to predict what value a certain quantity will take few steps in the future. This setting implies the use of a regression algorithm to predict the numerical outcome of the temporal series.

The system we are studying is the** thermodynamics of a large building **and we will** forecast the internal temperature **based on the measurements of a collection of sensors. The temperature itself is recorded by a sensor and its past values are thus available to the model. Other measurements include the status of cooling and heating systems, as well as other environmental quantities. We will use both the temperature and the other exogenous inputs in the model; most of the signals past values will be available, and we will see how to integrate information from control variables, as well. Control variables are variables that are under our control and for which we know their values at an arbitrary point in time. For example, suppose we want to know the value that the temperature will take the next hour. We know all past values of the temperature and the energy of the cooling system and other signals up to now. However we can use future information as well, because we know all future value of our control variables, for example when the cooling system starts and stops.

We need to construct a model that takes into account the nonlinearity of the system and relates past values to future values of temperature. **In order to perform well, the model should see enough past values** to predict the temperature but not too much to be disoriented. In general ML models are stateless, they see a point in space and decide the output of the model based on the features of that single point. If we use a Multi Layer Perceptron (MLP) for example, we must provide it with lagged values in order to force it to take into account past history. Lagged values are portions of past history provided as a set of features to the algorithm. For example if we consider the past 3 values of a temperature to be meaningful for the algorithm we can reshape the dataset such that the feature at time t are the values of temperature at time t-1, t-2, t-3. The use of lagged values, however, poses two kind of problems. The choice of the number of past values to include is domain specific and it could be different for each feature (for example a feature needs 3 lagged values while another needs more); it is thus another hyperparameter to tune. Moreover using many lagged values increases the dimensionality of the problem and consequently the difficulty to find a model with desired accuracy. For these reasons we use a different model of neural networks with loops that allows information of past values to persist inside the nodes.

### Recurrent Neural Networks

This kind of networks are called Recurrent Neural Network (RNN) and they are particularly useful for modeling and generating sequences. They take time into account and they doesn’t need to be manually fed with lagged values. One can think of a RNN as the same neural network that receives values from its predecessor. One of the most intriguing features of RNN is that they potentially recover dependencies of the current value from past data. Sometimes RNNs need only recent information to recover the present, however if we need more context to recover the present, that is, the gap between the relevant information and the point were it is important is high, RNNs become unable to learn. Let’s look at an RNN in more details. In the picture below we can see a simple RNN with a single input, output and recurrent unit.

RNN with a single input, output and recurrent unit.

If we unroll the network as in picture below, the RNN looks like a series of copy of the same simple NN unit. Each copy updates its state with the output of a previous cell’s state as well as current input. We can see that inputs travel across time; past input (blue node) is related to current input (red node). However if the weight is less than one, as we travel along each time steps the effect of the input (blue node) diminishes as a function of the time interval between the two. On the contrary if the weight is greater than one the output will explode.

These problems are referred to as** Vanishing and Exploding Gradient** and they are a serious problem in training recurrent neural networks. Many solutions have been proposed based on some sort of regularization such as Truncated Backpropagation Through Time (TBPTT). However the most successful RNN architecture that overcome the problem is LSTM. We will explain how it works in next section

#### Long Short Term Memory (LSTM)

LSTM networks looks like ordinary RNN, but they have an “internal state” that has a self-connected recurrent edge with weight 1. This edge spans adjacent time steps, ensuring that error can pass through time steps without vanishing or exploding. The input node behaves like an ordinary neuron, it takes the current input and concatenate it with the output of the network at the preceding time step. There are three other cell, called **gates that control how the flux of information passes through the network and persist in the internal state.** The input gate controls how information enter the system (modifying the internal state) and it is combined with input node. Output gate controls how the internal state affects the output. Finally the forget gate controls how past values of internal states are combined with the result from input gate and input node. All gates and node take as input the combination of current input and past network output.

Intuitively in the forward pass LSTM can learn how to get input activation into internal state. If the input gate have 0 value, no input gate activation can enter the system. On the contrary if input gate saturate to 1, input activation modifies internal state. Similarly, the output gate learns when to let values out. **When both gates are closed, the internal state is not affected and persists throughout time steps neither increasing nor decreasing. In the backward pass the error can propagate back many steps unchanged (because weight is fixed at 1) neither exploding nor vanishing.** The function of the gates is to learn when to let error in or out. The forget layer allows the network to forget, i.e., to remove from the cell state irrelevant or misleading past information.

Intuitively in the forward pass LSTM can learn how to get input activation into internal state. If the input gate have 0 value, no input gate activation can enter the system. On the contrary if input gate saturate to 1, input activation modifies internal state. Similarly, the output gate learns when to let values out. **When both gates are closed, the internal state is not affected and persists throughout time steps neither increasing nor decreasing. In the backward pass the error can propagate back many steps unchanged (because weight is fixed at 1) neither exploding nor vanishing.** The function of the gates is to learn when to let error in or out. The forget layer allows the network to forget, i.e., to remove from the cell state irrelevant or misleading past information.

We choose to forecast a value; a possible approach is to shift the target n steps in the future, were n is the prediction horizon. LSTMs take care of remembering the relevant part of past history while fitting the data. However it is possible to improve the model with future knowledge. As suggested at the beginning of this section, we know all the possible future values of some features, the controlled variables. We can take advantage of this information and use lagged future values. In practice the vector of regressors is extended with future values of some feature. Let’s say we have feature X1, X2, X3, X4 at time t. For some reason feature X4 can be determined for future values, so our new input a time t will be X1(t), X2(t), X3(t), X4(t), X4(t+1), X4(t+2), X4(t+3), where 3 is the time horizon, Y(t+3) is the target. Given the peculiar nature of LSTM it is even possible to use only a portion of future value, to both ease the computation and improve performance. In this case we could use only X4(t) and X4(t+3), for example. This setting is depicted in the following picture.

In our case the time horizon is one hour.

Now that we have the input and an understanding of LSTM internals, in the next section we will look at two libraries to implement this model.

### TensorFlow And Neon

In this section we will briefly review the two libraries that we will use for this exercise: Google TensorFlow and Nervana Neon. These two libraries are both open source.

#### TensorFlow

TensorFlow was released by Google in November, 2015. It is a library for numerical computation on data flow graphs. Each node in the graph represents a mathematical operation, while the edges between nodes represent the multidimensional arrays (tensors) that flow between operations. TensorFlow is particularly suited for ML algorithm and it was designed to run on heterogeneous (and possibly distributed) resources.

**Since it allows to express the computation from low level details it is possible to construct any kind of NN architecture.** However during the time, some high level implementations of NN have been included in the library. In particular we will look at former SkFlow library that has been movedin contrib/learn package of TensorFlow. SkFlow is a higher level library in the stile of scikit-learn, and allows a seamless interaction with it. It is possible to specify the NN topology at a higher level, for example it is possible to use a LSTM layer without the cost of building all the gates and operations from scratch. We will use SkFlow in the example below.

#### Neon

Neon is the Nervana’s Python based library for Deep Learning. It is directly developed with NN in mind and for this reason it is easier to use and it provides higher level function for creating complex NN architectures. **One of the goal of Neon is to provide great performance along ease of use.** For this reason it works on selected NVIDIA hardware or Nervana’s hardware. It is also possible to use their Cloud service to run the code. There is no need for other libraries if we want to explore its Deep Learning features.

#### The Model

We implemented the model with both libraries to compare their performance, accuracy and usage. This is not intended to be a rigorous benchmarking test, but an overview of a particular use case (for detailed machine learning benchmarks). **The NN consist of a single hidden layer** **with 128 LSTM cells, and a single linear neuron as the output layer. **We chose a batch size of 256 and we trained the network for 100 epochs; we used Neon version 1.5.4 and TensorFlow version 0.10. There are few difference in the implementation of the two model due to the libraries, for example the optimization algorithm is different. We did that because we would like to compare both speed and accuracy, for this reason some of the parameters are the same for speed reasons (namely network size, batch size and epochs), while others are different to achieve similar accuracy (optimization algorithm).

#### Results

In the next graph you can see a graph of the overall time series. The blue line is the true value while the orange line is the predicted value. As can be seen the model is able to follow closely the actual value and thus predict the correct value with a low error.

This can be further seen if we zoom in the picture. Of course there are parts of the time series that are not followed perfectly by the predicted value, because the dynamic of the system varies rapidly or because its cyclic behavior is not respected (that is the shape of a portion of data does not resemble any previous encountered shapes).** Sometime were there is a turning point and the distribution change its overall behavior the network takes some time to adapt to it, but then as soon as new values arrives it is able to follow the series regularly.** In the next section we will review the two libraries and we will comment their results.

#### Comments

**TensorFlow with skflow is simple to use** like other high level libraries (namely scikit-learn). If the model is non-standard it is possible to modify it in several different ways, **however for complex model it is better to go back to plain TensorFlow**. Execution time is not the best feature of TensorFlow, but it has the advantage that could be seamlessly distributed over resources.

**Neon, on the contrary, is developed directly with NN in mind and thus its interface it is already a high level one.** It allows to specify the backend, the model, how to initialize weights, and what kind of optimization to use. The interface is simple as said, but allows for great modularity. Each module has most, if not all, the modern NN algorithms. Neon website hosts also a “zoo” of pretrained models.

Comparing the precision of both libraries shows that they reach the same accuracy. It’s worth noting that for skflow we used most of the default parameters and we got a good accuracy. For Neon we had to perform a brief hyperparameter optimization to reach the same results. Execution time in this case is not intended to be a full benchmark, because the model is too simple. However we can see that even in this case** Neon execution time is 30% faster compared ****to that of TensorFlow.**

One more thing to note is that **TensorFlow, since its release, has grown in popularity an now it is easy to find tutorials and high level code examples **(as SkFlow suggests). **Neon on the contrary has a fair documentation and sometimes it has few use case examples and not many tutorials available.** They will probably increase one they’ll gain popularity.