In the previous article, we saw how we could construct a mathematical framework to encode our satisfaction with meal choices on a plane. We took care and time to handcraft the weights for the neurons ourselves. Now, we are going to take a different tack.
What if we just rolled the dice and took random values for the weights of each neuron? This is akin to playing a game of pickup sticks on the decision plane. On the face of it this sounds like a recipe for disaster. The attendants just randomly guess what to serve, given your input. At this point, you may wonder if we'll end up being offered the antediluvian spam found wedged behind seat 67C!
Don't worry; we'll protect your exquisite palate by using a mathematical expression of how wrong our initial approximation is. This comparison of how disparate a result of our randomized guess is to our desired outcome is known as a cost function.
The important objective functions for ANNs are referred to as cost or loss functions. These functional measures of error are the important metrics with which we determine the success of our algorithm. The goal of machine learning algorithms is to optimize such a function. The former term (cost function) is often reserved for the entire training set, in which case the loss function is defined as the loss per epoch.
You may already be familiar with one of the most common cost functions, the Mean Squared Error (MSE). This is the routine used to obtain an Ordinary Least Squares (OLS) linear regression fit. Explicitly, we minimize the function:
J(x) = (1/2n) Σ_{j} [f'_{j}(x)  f_{j}(x)]^{2}
Where f'(x) denotes the function to be approximated and f(x) represents the functional form used by the algorithm (here, the activation function used in each layer). Here, the sum is taken over each of the n nodes, labeled with j.
One problem with using the MSE cost function is that it can be very slow to learn (we'll define what this means below) with large errors for a sigmoid activation function. In other words, when we're really offbase, it takes ages to get closer to our desired answer. This is the exact opposite of how we want an optimizer to behave! Hence the introduction of another cost function, popularly used for binary classification: the crossentropy (or Bernoulli negative loglikelihood or Binary CrossEntropy):
J(x) = Σ_{j} [f(x) ln(f'(x)) + (1  f(x)) ln(1  f'(x))]
There are a number of other cost functions, such as the Mean Absolute Error (MAE):
J(x) = (1/n) Σ_{j} f'_{j}(x)  f_{j}(x)
Which, as the name implies, adds the difference in magnitudes.
OK, well that's great to know how wrong the approximation is. But we need to give the flight attendant some feedback, otherwise they'll stay wrong. And we'll remain unsatisfied! We need to train them. To get them closer to an optimal value of the cost function, using an optimizer.
In practice, we do not usually have access to the analytic form of the cost or loss functions, and hence do not have an explicit expression for the optimal parameter values. We have to then rely on optimization schemes. Probably the best known are the NewtonRaphson family of optimization functions, which 'descend' to the optimal point, based on the steepness, or gradient, of the cost function in the region of interest. This is known as gradient descent.
Specifically, perhaps the most widely used optimizers for ANNs are based on Stochastic Gradient Descent (SGD). Consider that the goal here is to minimize the cost function, J(x). In other words[1], where ∇J(x) = 0.
As an iterative process, we update the weights w(r), beginning from the initial w(0). We wish to find ∇J(x); this may be well approximated by taking a random sample of training inputs and computing the (discrete) gradient by taking a group of random nodes (a minibatch), j ∊ m. In other words, we assume ∇J(x) ≈ ∇J_{m}(x); the gradient is approximated by a representative sample of nodes in the local region of parameter space.
The rth iteration of a weight is then updated according to: w(r) = w(r1)  η∇J_{m}(w). The new weight is the old weight, minus a small correction, which is proportional to the local gradient. Each iteration — the updating of weights for the whole network — is referred to as an epoch.
Once again, the biological analogy becomes crystal clear. There is no coincidence that this process of iteratively refining the weights of each neuron, in response to the desired outcome, is referred to as training the model. Admittedly, our equivalent to a Pavlovian dog snack is the more abstract concept of proximity to the optimal point of a cost function. Indeed, the constant tuning parameter, η > 0, is known as the learning rate, as it determines the rate at which the weights are updated. You may be tempted to set this to a large value. After all, I know you're a very busy person and have plenty other things you could be doing. However, this is likely to lead to overshooting the global minimum of the cost function and leads to a numerically unstable optimizer.
This would be a great metal band name!
To help with numerical stability, the concept of momentum has been introduced. This reduces oscillations, and overshoot of the global minimum, by introducing a term proportional to the incremental change in rate. Another useful parameter within enhancements to SGD optimization is the decay parameter. This reduces the learning rate if the loss does not decrease after a set number of epochs.
Even more sophisticated variants of SGD involve the automatic update of learning rates depending on how important a particular feature parameter is (Adagrad, Adadelta). In effect, this means that we do not have to attempt to tune the learning rate; this is done so automatically.
For more details on variants of optimizers used in ANNs, I highly recommend Sebastian Ruder's "An overview of gradient descent optimization algorithms":
http://ruder.io/optimizinggradientdescent/index.html
So much for the theory. How do we code these things?
I thought you'd never ask.
Both of these opensource software packages are libraries for the popular Python programming language. We recommend installation of these via conda, from within the opensource Anaconda distribution. However, you can also use pip, or your favorite Python package manager. For more detailed explanation of how to download these, and follow the code in this article, check out the associated Jupyter notebook on GitHub.
TensorFlow
TensorFlow was developed by the Google Brain team, released to the Apache foundation in late 2015. It is a symbolic, highperformance, math library with specialized and generalized math objects, particularly tensors, a generalization of vector arithmetic and calculus (hence the name). The mental model for TensorFlow computations is a computational graph, defined by tensors. It is designed to be seamlessly applied to a range of hardware types (including GPGPUs and a specialized ASIC, the TPU — Tensor Processing Unit).
TensorFlow documentation: https://www.tensorflow.org/
Keras
Keras is a highlevel API to the neural network libraries CNTK, Theano and TensorFlow. Its high level of abstraction allows rapid prototyping of neural networks, with both convolution and recurrent network architectures. Its guiding principles are userfriendliness, modularity and to be easily extendible. Because it's written in Python, configuration and extension of functionality are relatively seamless within the Python ecosystem.
Keras is Greek for 'horn,' a reference to the visioninducing spirits in the Odyssey.
Keras documentation: https://keras.io/
The fundamental unit of computation within TensorFlow is the tensor. Tensors are generalizations of vector arithmetic and calculus, allowing linear operations on higher rank objects. These are used to partially define a computation, in the form of a dataflow graph, that will, when executed, produce an output value. TensorFlow constructs a graph based on tensor objects (tf.Tensor).
This graph is then executed within a TensorFlow session (tf.Session()) instance.
As with all Python libraries, we will have to import them before their first use:
import tensorflow as tf
from tensorflow import keras
We can simply generate a tensor object using tf.Variable:
odd_nums = tf.Variable([1, 3, 5, 7, 9, 11]) # Rank 1 tensor is a vector
It has the usual .dtype (data type) and .shape (rows, columns etc.) attributes:
odd_nums.dtype odd_nums.shape tf.int32_ref TensorShape([Dimension(6)]) rank1 = tf.rank(odd_nums) rank1 <tf.Tensor 'Rank:0' shape=() dtype=int32> rank2 = tf.rank(weird_hypercube) rank2 <tf.Tensor 'Rank_1:0' shape=() dtype=int32> weird_hypercube <tf.Variable 'Variable_1:0' shape=(2, 2, 1) dtype=float64_ref>
Operations are not run until we have specified that our computation graph is complete. Note that the tf.rank() operations did not return an actual rank value. To execute operations on the tensors we just produced, we require a tf.Session() connection:
with tf.Session() as sess: sess.run(rank1) sess.run(rank2)
1
3
Defining the details of your computation this way, and generating the necessary graphs and sessions, requires quite some thought and a lot of boilerplate code (of which is not particularly Pythonic!).
There are so many commonly used operations and architectures, that a higher level API would be very useful! This is where Keras comes in.
As solving programming puzzles go, this is like smashing a walnut with a hydraulic press. You will hopefully see that setting up very sophisticated network architectures is made very easy using Keras. However, we're already attached to this airplane meal puzzle, so we'll see it through to its nutritious end.
We wish to construct a binary classifier (we only have two outputs: 'sated' and 'not sated') based on the conceptual MLP model we considered above. A tabular expression of our puzzle[2] looks like:
Chicken  Beef 
Sated 
No 
No 
No 
Yes 
No 
Yes 
No 
Yes 
Yes 
Yes 
Yes 
No 
We can encode this easily enough in Python. We'll take our inputs, x, as a matrix, and the desired outcome, y, as a vector. In the Python ecosystem, these are represented as Numpy arrays:
import numpy as np x_unit = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y_unit = np.array([0, 1, 1, 0]) x = x_unit y = y_unit for __ in range(6000): x = np.vstack([x, x_unit]) y = np.hstack([y, y_unit])
Here, we have replicated the input/output pairs 6,000 times3.
How will we know our model is performing accurately, and we're not sitting in a circle of yesattendants, telling us our predictions are correct? We need to hold out some portion of the data (a validation dataset) to test our model against:
data_len = x.shape[0] split_Idx = np.random.choice(range(data_len), int(0.75*data_len), replace=False) x_train = x[split_Idx] x_test = x[~np.in1d(range(data_len), split_Idx)] # Note the bitwise logical negation, ~ y_train = y[split_Idx] y_test = y[~np.in1d(range(data_len), split_Idx)]
We took a 75/25 train/test split of the data (hence the int(0.75*data_len) parameter)4. We should always check the dimensions of the resulting matrices are what we think they should be:
x_train.shape y_train.shape x_test.shape y_test.shape
(18003, 2)
(18003,)
(6001, 2)
(6001,)
Indeed: 18,003 is roughly 75% of the total (18,003 + 6,001).
Finally, we need to tell Keras that our outputs are categories, rather than actual numerical quantities:
num_classes = 2 # Binary classifier y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes)
Alright, alright, alright!
Let's warm up our Keras hydraulic walnutpress by performing some additional imports of modules:
# Some additional imports from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.optimizers import RMSprop
Respectively, these imports brought down particular classes of a model, a layer type and an optimizer. Keras currently has two types of model: Sequential and the functional Model. The former is simpler, representing a linear chain of layers. Which is the model type we want.
Each model type shares the following attributes (among many others):
.layers: the layers of the model
.inputs: the input tensors of the model
.outputs: the output tensors
They also have a .summary() method, giving a summary of the model.
The layers defined here will be dense; in other words, each neuron's output in one layer will be fed as input to each neuron in the next layer. Finally, the optimizer we will use here is recommended as a great allpurpose default, the RMSProp() variant of Adadelta (mentioned above).
Time to call our first model!
Instantiate an instance of the Sequential() model class:
model0 = Sequential()
How easy was that? Well, our work's not quite done. We still need to define the network architecture. Define the initial (input) layer with two neurons:
input_size = 2 model0.add(Dense(input_size, activation='relu', input_shape=(2,)))
And define the final (output) layer:
model0.add(Dense(num_classes, activation='sigmoid')) # Summarize the model model0.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_14 (Dense) (None, 2) 6
_________________________________________________________________
dense_15 (Dense) (None, 2) 6
=================================================================
Total params: 12
Trainable params: 12
Nontrainable params: 0
_________________________________________________________________
Test loss: 1.1920928955078125e07, Test accuracy: 1.0
We have thus defined our model's architecture. Note that this would normally be considered a ridiculously small number of neurons! But this is concordant with our sketch in Figure 3 above. We'll relax this constraint later. First, we must now define how the model determines it has found a suitable approximation.
It is vital to quantify how our models perform. Keras makes it simple to track a number of offtheshelf loss functions, that are not used to update or train the model, but may elucidate its behavior. This may range from the simple accuracy (the mean difference between prediction and actual 'ground truth' values), mean absolute error (mae) or categorical_accuracy.
Within Keras, it is a simple matter to define the loss and optimizer functions, and performance metric to track for our MLP model. These are specified at the compile stage of the computation:
model0.compile(loss='categorical_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])
I did mention we'd be using categorical crossentropy as an optimizer wayy back in the optimizer section. Truth is, it's very easy to choose here between a rich range of optimizers.
We can now train the model. What does this anthropomorphic term mean? Simply, it is the process of finding the appropriate weights such that the outputs we have in our training dataset correspond to their respective inputs, while minimizing our chosen error metric. Again, this is one line of Keras code:
history0 = model0.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs, verbose=0, validation_data=(x_test, y_test)
If you want to see how the model is performing during training, set verbose=1 to see metrics for each epoch.
Let's evaluate the model:
score0 = model0.evaluate(x_test, y_test, verbose=0) print('Test loss: {0}, Test accuracy: {1}'.format(score100[0], score100[1])) Test loss: 1.1920928955078125e07, Test accuracy: 1.0
Yes!!! We have trained a feedforward artificial neural network on our chickenorbeef puzzle and achieved 100% accuracy!
However, this should not be too surprising; I already admitted the function itself is quite simple. However, let's exploit this simplicity by examining how the algorithm chose the weights of the model:
W1  
W2 

b 
These graphs show the evolution of the weights, w1, w2 and b, from our original equation for this problem, over training epochs, for the two nodes (0 and 1). The weights are roughly equal but opposite. Each start off comparatively small but get larger during training. Note that, because we seeded the initial values of the weights randomly, these results will almost surely be different each time this is run.
In this article, we looked at how to construct a very simple artificial neural network to model our chickenorbeef meal decision. However, this is overly simplistic for most 'realworld' problems. In Part III, the final part of this article, we'll look at how to extend this basic network, and how to deal with that everpresent bugbear in machine learning: overfitting.
[1] Note that here we have adopted the gradient operator,∇ ('nabla,' meaning 'harp'); although we have written the above as a function of a single variable ∇f(x) := ∂f(x)/∂x, ∇ is the derivative across all variables in the space.
[2] Of course, this is the truth table for the XOR function
[3] This replication is for a frankly embarrassing reason. For the initial architecture, we heavily restrict the number of input neurons. In order to achieve acceptable accuracy for this model, we'd need a lot more data than we strictly have. A better solution is to slightly increase the width of the model, which we do later.
[4]I have decided to keep strictly within keras here; scikit learn has a handy builtin function for this: train_test_split()
Written by Ra Inta
We offer private, customized training for 3 or more people at your site or online.
Our live, instructorled lectures are far more effective than prerecorded classes
If your team is not 100% satisfied with your training, we do what's necessary to make it right
Whether you are at home or in the office, we make learning interactive and engaging
We accept check, ACH/EFT, major credit cards, and most purchase orders
Alabama
Birmingham
Huntsville
Montgomery
Alaska
Anchorage
Arizona
Phoenix
Tucson
Arkansas
Fayetteville
Little Rock
California
Los Angeles
Oakland
Orange County
Sacramento
San Diego
San Francisco
San Jose
Colorado
Boulder
Colorado Springs
Denver
Connecticut
Hartford
DC
Washington
Florida
Fort Lauderdale
Jacksonville
Miami
Orlando
Tampa
Georgia
Atlanta
Augusta
Savannah
Hawaii
Honolulu
Idaho
Boise
Illinois
Chicago
Indiana
Indianapolis
Iowa
Cedar Rapids
Des Moines
Kansas
Wichita
Kentucky
Lexington
Louisville
Louisiana
New Orleans
Maine
Portland
Maryland
Annapolis
Baltimore
Frederick
Hagerstown
Massachusetts
Boston
Cambridge
Springfield
Michigan
Ann Arbor
Detroit
Grand Rapids
Minnesota
Minneapolis
Saint Paul
Mississippi
Jackson
Missouri
Kansas City
St. Louis
Nebraska
Lincoln
Omaha
Nevada
Las Vegas
Reno
New Jersey
Princeton
New Mexico
Albuquerque
New York
Albany
Buffalo
New York City
White Plains
North Carolina
Charlotte
Durham
Raleigh
Ohio
Akron
Canton
Cincinnati
Cleveland
Columbus
Dayton
Oklahoma
Oklahoma City
Tulsa
Oregon
Portland
Pennsylvania
Philadelphia
Pittsburgh
Rhode Island
Providence
South Carolina
Charleston
Columbia
Greenville
Tennessee
Knoxville
Memphis
Nashville
Texas
Austin
Dallas
El Paso
Houston
San Antonio
Utah
Salt Lake City
Virginia
Alexandria
Arlington
Norfolk
Richmond
Washington
Seattle
Tacoma
West Virginia
Charleston
Wisconsin
Madison
Milwaukee
Alberta
Calgary
Edmonton
British Columbia
Vancouver
Manitoba
Winnipeg
Nova Scotia
Halifax
Ontario
Ottawa
Toronto
Quebec
Montreal
Puerto Rico
San Juan