# The Chicken or the Beef? Why Everyone Loves Neural Networks, Part II (Using Keras and TensorFlow)

April 21, 2019 in Python Articles

Written by Ra Inta

## Constructing a Conceptual Artificial Neural Network (continued)

In the previous article, we saw how we could construct a mathematical framework to encode our satisfaction with meal choices on a plane. We took care and time to hand-craft the weights for the neurons ourselves. Now, we are going to take a different tack.

What if we just rolled the dice and took random values for the weights of each neuron? This is akin to playing a game of pick-up sticks on the decision plane. On the face of it this sounds like a recipe for disaster. The attendants just randomly guess what to serve, given your input. At this point, you may wonder if we'll end up being offered the antediluvian spam found wedged behind seat 67C!

Don't worry; we'll protect your exquisite palate by using a mathematical expression of how wrong our initial approximation is. This comparison of how disparate a result of our randomized guess is to our desired outcome is known as a cost function.

### Cost and loss functions

The important objective functions for ANNs are referred to as cost or loss functions. These functional measures of error are the important metrics with which we determine the success of our algorithm. The goal of machine learning algorithms is to optimize such a function. The former term (cost function) is often reserved for the entire training set, in which case the loss function is defined as the loss per epoch.

You may already be familiar with one of the most common cost functions, the Mean Squared Error (MSE). This is the routine used to obtain an Ordinary Least Squares (OLS) linear regression fit. Explicitly, we minimize the function:

J(x) = (1/2n) Σj [f'j(x) - fj(x)]2

Where f'(x) denotes the function to be approximated and f(x) represents the functional form used by the algorithm (here, the activation function used in each layer). Here, the sum is taken over each of the n nodes, labeled with j.

One problem with using the MSE cost function is that it can be very slow to learn (we'll define what this means below) with large errors for a sigmoid activation function. In other words, when we're really off-base, it takes ages to get closer to our desired answer. This is the exact opposite of how we want an optimizer to behave! Hence the introduction of another cost function, popularly used for binary classification: the cross-entropy (or Bernoulli negative log-likelihood or Binary Cross-Entropy):

J(x) = -Σj [f(x) ln(f'(x)) + (1 - f(x)) ln(1 - f'(x))]

There are a number of other cost functions, such as the Mean Absolute Error (MAE):

J(x) = (1/n) Σj |f'j(x) - fj(x)|

Which, as the name implies, adds the difference in magnitudes.

### Optimizers

OK, well that's great to know how wrong the approximation is. But we need to give the flight attendant some feedback, otherwise they'll stay wrong. And we'll remain unsatisfied! We need to train them. To get them closer to an optimal value of the cost function, using an optimizer.

In practice, we do not usually have access to the analytic form of the cost or loss functions, and hence do not have an explicit expression for the optimal parameter values. We have to then rely on optimization schemes. Probably the best known are the Newton-Raphson family of optimization functions, which 'descend' to the optimal point, based on the steepness, or gradient, of the cost function in the region of interest. This is known as gradient descent.

Specifically, perhaps the most widely used optimizers for ANNs are based on Stochastic Gradient Descent (SGD). Consider that the goal here is to minimize the cost function, J(x). In other words, where ∇J(x) = 0.

As an iterative process, we update the weights w(r), beginning from the initial w(0). We wish to find ∇J(x); this may be well approximated by taking a random sample of training inputs and computing the (discrete) gradient by taking a group of random nodes (a mini-batch), j ∊ m. In other words, we assume ∇J(x) ≈ ∇Jm(x); the gradient is approximated by a representative sample of nodes in the local region of parameter space.

The rth iteration of a weight is then updated according to: w(r) = w(r-1) - η∇Jm(w). The new weight is the old weight, minus a small correction, which is proportional to the local gradient. Each iteration — the updating of weights for the whole network — is referred to as an epoch.

Once again, the biological analogy becomes crystal clear. There is no coincidence that this process of iteratively refining the weights of each neuron, in response to the desired outcome, is referred to as training the model. Admittedly, our equivalent to a Pavlovian dog snack is the more abstract concept of proximity to the optimal point of a cost function. Indeed, the constant tuning parameter, η > 0, is known as the learning rate, as it determines the rate at which the weights are updated. You may be tempted to set this to a large value. After all, I know you're a very busy person and have plenty other things you could be doing. However, this is likely to lead to over-shooting the global minimum of the cost function and leads to a numerically unstable optimizer.

This would be a great metal band name!

To help with numerical stability, the concept of momentum has been introduced. This reduces oscillations, and overshoot of the global minimum, by introducing a term proportional to the incremental change in rate. Another useful parameter within enhancements to SGD optimization is the decay parameter. This reduces the learning rate if the loss does not decrease after a set number of epochs.

Even more sophisticated variants of SGD involve the automatic update of learning rates depending on how important a particular feature parameter is (Adagrad, Adadelta). In effect, this means that we do not have to attempt to tune the learning rate; this is done so automatically.

For more details on variants of optimizers used in ANNs, I highly recommend Sebastian Ruder's "An overview of gradient descent optimization algorithms":

So much for the theory. How do we code these things?

### Introduction to Keras and TensorFlow

Both of these open-source software packages are libraries for the popular Python programming language. We recommend installation of these via conda, from within the open-source Anaconda distribution. However, you can also use pip, or your favorite Python package manager. For more detailed explanation of how to download these, and follow the code in this article, check out the associated Jupyter notebook on GitHub.

TensorFlow

TensorFlow was developed by the Google Brain team, released to the Apache foundation in late 2015. It is a symbolic, high-performance, math library with specialized and generalized math objects, particularly tensors, a generalization of vector arithmetic and calculus (hence the name). The mental model for TensorFlow computations is a computational graph, defined by tensors. It is designed to be seamlessly applied to a range of hardware types (including GPGPUs and a specialized ASIC, the TPU — Tensor Processing Unit).

TensorFlow documentation: https://www.tensorflow.org/

Keras

Keras is a high-level API to the neural network libraries CNTK, Theano and TensorFlow. Its high level of abstraction allows rapid prototyping of neural networks, with both convolution and recurrent network architectures. Its guiding principles are user-friendliness, modularity and to be easily extendible. Because it's written in Python, configuration and extension of functionality are relatively seamless within the Python eco-system.

Keras is Greek for 'horn,' a reference to the vision-inducing spirits in the Odyssey.

Keras documentation: https://keras.io/

## Tensors and TensorFlow

The fundamental unit of computation within TensorFlow is the tensor. Tensors are generalizations of vector arithmetic and calculus, allowing linear operations on higher rank objects. These are used to partially define a computation, in the form of a data-flow graph, that will, when executed, produce an output value. TensorFlow constructs a graph based on tensor objects (tf.Tensor).

This graph is then executed within a TensorFlow session (tf.Session()) instance.

As with all Python libraries, we will have to import them before their first use:

`import tensorflow as tf from tensorflow import keras`

We can simply generate a tensor object using tf.Variable:

odd_nums = tf.Variable([1, 3, 5, 7, 9, 11])  # Rank 1 tensor is a vector

It has the usual .dtype (data type) and .shape (rows, columns etc.) attributes:

```odd_nums.dtype
odd_nums.shape

tf.int32_ref
TensorShape([Dimension(6)])

rank1 = tf.rank(odd_nums)
rank1

<tf.Tensor 'Rank:0' shape=() dtype=int32>
rank2 = tf.rank(weird_hypercube)
rank2

<tf.Tensor 'Rank_1:0' shape=() dtype=int32>

weird_hypercube

<tf.Variable 'Variable_1:0' shape=(2, 2, 1) dtype=float64_ref>```

Operations are not run until we have specified that our computation graph is complete. Note that the tf.rank() operations did not return an actual rank value. To execute operations on the tensors we just produced, we require a tf.Session() connection:

```with tf.Session() as sess:
sess.run(rank1)
sess.run(rank2) ```

1
3

Defining the details of your computation this way, and generating the necessary graphs and sessions, requires quite some thought and a lot of boiler-plate code (of which is not particularly Pythonic!).

There are so many commonly used operations and architectures, that a higher level API would be very useful! This is where Keras comes in.

### Using Keras and TensorFlow to Solve the Chicken-or-Beef Problem

As solving programming puzzles go, this is like smashing a walnut with a hydraulic press. You will hopefully see that setting up very sophisticated network architectures is made very easy using Keras. However, we're already attached to this airplane meal puzzle, so we'll see it through to its nutritious end.

We wish to construct a binary classifier (we only have two outputs: 'sated' and 'not sated') based on the conceptual MLP model we considered above. A tabular expression of our puzzle looks like:

 Chicken Beef Sated No No No Yes No Yes No Yes Yes Yes Yes No

We can encode this easily enough in Python. We'll take our inputs, x, as a matrix, and the desired outcome, y, as a vector. In the Python eco-system, these are represented as Numpy arrays:

```import numpy as np

x_unit = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_unit = np.array([0, 1, 1, 0])
x = x_unit
y = y_unit

for __ in range(6000):
x = np.vstack([x, x_unit])
y = np.hstack([y, y_unit])```

Here, we have replicated the input/output pairs 6,000 times3.

How will we know our model is performing accurately, and we're not sitting in a circle of yes-attendants, telling us our predictions are correct? We need to hold out some portion of the data (a validation data-set) to test our model against:

```data_len = x.shape
split_Idx = np.random.choice(range(data_len), int(0.75*data_len), replace=False)

x_train = x[split_Idx]
x_test = x[~np.in1d(range(data_len), split_Idx)]
# Note the bit-wise logical negation, ~
y_train = y[split_Idx]
y_test = y[~np.in1d(range(data_len), split_Idx)]```

We took a 75/25 train/test split of the data (hence the int(0.75*data_len) parameter)4. We should always check the dimensions of the resulting matrices are what we think they should be:

```x_train.shape
y_train.shape
x_test.shape
y_test.shape ```

(18003, 2)
(18003,)
(6001, 2)
(6001,)

Indeed: 18,003 is roughly 75% of the total (18,003 + 6,001).

Finally, we need to tell Keras that our outputs are categories, rather than actual numerical quantities:

```num_classes = 2  # Binary classifier
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)```

Alright, alright, alright!

Let's warm up our Keras hydraulic walnut-press by performing some additional imports of modules:

```# Some additional imports
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import RMSprop```

Respectively, these imports brought down particular classes of a model, a layer type and an optimizer. Keras currently has two types of model: Sequential and the functional Model. The former is simpler, representing a linear chain of layers. Which is the model type we want.

Each model type shares the following attributes (among many others):

.layers: the layers of the model

.inputs: the input tensors of the model

.outputs: the output tensors

They also have a .summary() method, giving a summary of the model.

The layers defined here will be dense; in other words, each neuron's output in one layer will be fed as input to each neuron in the next layer. Finally, the optimizer we will use here is recommended as a great all-purpose default, the RMSProp() variant of Adadelta (mentioned above).

Time to call our first model!

Instantiate an instance of the Sequential() model class:

`model0 = Sequential()`

How easy was that? Well, our work's not quite done. We still need to define the network architecture. Define the initial (input) layer with two neurons:

```input_size = 2

And define the final (output) layer:

```model0.add(Dense(num_classes, activation='sigmoid'))

# Summarize the model
model0.summary() ```
`_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================dense_14 (Dense)             (None, 2)                 6         _________________________________________________________________dense_15 (Dense)             (None, 2)                 6         =================================================================Total params: 12Trainable params: 12Non-trainable params: 0_________________________________________________________________Test loss: 1.1920928955078125e-07,      Test accuracy: 1.0`

We have thus defined our model's architecture. Note that this would normally be considered a ridiculously small number of neurons! But this is concordant with our sketch in Figure 3 above. We'll relax this constraint later. First, we must now define how the model determines it has found a suitable approximation.

### Tracking model performance: metrics

It is vital to quantify how our models perform. Keras makes it simple to track a number of off-the-shelf loss functions, that are not used to update or train the model, but may elucidate its behavior. This may range from the simple accuracy (the mean difference between prediction and actual 'ground truth' values), mean absolute error (mae) or categorical_accuracy.

Within Keras, it is a simple matter to define the loss and optimizer functions, and performance metric to track for our MLP model. These are specified at the compile stage of the computation:

```model0.compile(loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy']) ```

I did mention we'd be using categorical cross-entropy as an optimizer wayy back in the optimizer section. Truth is, it's very easy to choose here between a rich range of optimizers.

We can now train the model. What does this anthropomorphic term mean? Simply, it is the process of finding the appropriate weights such that the outputs we have in our training data-set correspond to their respective inputs, while minimizing our chosen error metric. Again, this is one line of Keras code:

```history0 = model0.fit(x_train, y_train,
batch_size=batch_size,
epochs=num_epochs,
verbose=0,
validation_data=(x_test, y_test)```

If you want to see how the model is performing during training, set verbose=1 to see metrics for each epoch.

Let's evaluate the model:

```score0 = model0.evaluate(x_test, y_test, verbose=0)
print('Test loss: {0},      Test accuracy: {1}'.format(score100, score100))
Test loss: 1.1920928955078125e-07,      Test accuracy: 1.0```

Yes!!! We have trained a feed-forward artificial neural network on our chicken-or-beef puzzle and achieved 100% accuracy!

However, this should not be too surprising; I already admitted the function itself is quite simple. However, let's exploit this simplicity by examining how the algorithm chose the weights of the model:

 W1 W2 b These graphs show the evolution of the weights, w1, w2 and b, from our original equation for this problem, over training epochs, for the two nodes (0 and 1). The weights are roughly equal but opposite. Each start off comparatively small but get larger during training. Note that, because we seeded the initial values of the weights randomly, these results will almost surely be different each time this is run.

In this article, we looked at how to construct a very simple artificial neural network to model our chicken-or-beef meal decision. However, this is overly simplistic for most 'real-world' problems. In Part III, the final part of this article, we'll look at how to extend this basic network, and how to deal with that ever-present bugbear in machine learning: over-fitting.

 Note that here we have adopted the gradient operator,∇ ('nabla,' meaning 'harp'); although we have written the above as a function of a single variable ∇f(x) := ∂f(x)/∂x, ∇ is the derivative across all variables in the space.

 Of course, this is the truth table for the XOR function

 This replication is for a frankly embarrassing reason. For the initial architecture, we heavily restrict the number of input neurons. In order to achieve acceptable accuracy for this model, we'd need a lot more data than we strictly have. A better solution is to slightly increase the width of the model, which we do later.

I have decided to keep strictly within keras here; sci-kit learn has a handy built-in function for this: train_test_split()

Written by Ra Inta Ra is originally from New Zealand, has a PhD in physics, is a data scientist, and has taught for Accelebrate in the US and in Africa. His specialties are R Programming and Python.