BackPropagation Networks: Answers

Copyright © Devin McAuley, 1997.

Exercise 1: Output activations before training

Jets Sharks Classification
Robin 0.49 0.43 Jets
Margaret 0.49 0.42 Jets
Bill 0.48 0.44 Jets
Janet 0.49 0.43 Jets
Mike 0.49 0.43 Jets
Alfred 0.48 0.45 Jets
Joan 0.49 0.42 Jets
Gerry 0.48 0.45 Jets
Catherine 0.48 0.44 Jets
Brett 0.49 0.43 Jets
John 0.49 0.42 Jets
Sandra 0.49 0.42 Jets
Joshua 0.48 0.43 Jets
Beth 0.49 0.44 Jets
Bert 0.49 0.44 Jets
Maria 0.49 0.44 Jets

Exercise 2: For the simulations used to generate the answers to these exercises, the network had a slight Jets bias (see above table). Depending on the initial weights, your network may instead have a bias to classify gang members as Sharks.

Exercise 3: After 40 epochs of learning, the total summed squared error in our simulation was 4.33.

Exercise 4: Output activations after 40 epochs of learning. At this stage, the network can correctly classify all of the gang members.

Jets Sharks Classification
Robin 0.66 0.36 Jets
Margaret 0.32 0.66 Sharks
Bill 0.62 0.39 Jets
Janet 0.36 0.62 Sharks
Mike 0.66 0.36 Jets
Alfred 0.32 0.66 Sharks
Joan 0.62 0.39 Jets
Gerry 0.36 0.63 Sharks
Catherine 0.63 0.39 Jets
Brett 0.36 0.63 Sharks
John 0.66 0.35 Jets
Sandra 0.32 0.66 Sharks
Joshua 0.60 0.40 Jets
Beth 0.38 0.61 Sharks
Bert 0.64 0.37 Jets
Maria 0.32 0.66 Sharks

Exercise 5: The total error on the training set was reduced to 0.4 after 120 epochs. This will vary depending on the intial weights of the network. For the simulation reported here, learning reduced the total error gradually at first, but then more rapidly before error flattening out in the last 40 epochs.

Exercise 6: With initial weights of -0.5 and 0.5, and a threshold of 1.5, the perceptron solves the AND problem in six steps.

Input1 Input2 Target Output Weight1 Weight2 Threshold
1 1 1 0 -0.5 0.5 1.5
1 0 0 0 0.5 1.5 0.5
0 1 0 1 0.5 1.5 0.5
0 0 0 0 0.5 0.5 1.5
1 1 1 0 0.5 0.5 1.5
1 0 0 1 1.5 1.5 0.5
0 1 0 0 0.5 1.5 1.5

Exercise 7: Assuming a threshold of 0.0, w1 = 1.0, w2 = 1.0, and w3 = -2.0 is a solution to the 3D version of XOR. Since the third input is the AND of the first two, w3 turns the output off when input1 and input2 are both 1.

Exercise 8: When the net input and bias are zero, the output of the sigmoid function is 0.5.

Exercise 9:

Exercise 10: The 1:1:1 Network with global values attached to the weights to monitor their values.

Exercise 11: The 1:1:1 Network with the training set and a global value to monitor the total error.

Exercises 12-14: Table of results for the 1:1:1 Network (local and global minima).

Simulation Minima Error w1 w2 bias1 bias2
1. Global 0.01 4.75 5.9 -2.31 -2.89
2. Global 0.01 -4.75 -5.95 1.90 2.65
3. Local 0.46 2.99 0.73 0.0 0.0
4. Global 0.27 -5.07 -4.47 0.0 0.0

In simulation 1, the connections between the units are positive and the biases are negative. When the input is 0, the hidden unit will turn off because of the large negative bias. The output unit will also then turn off because it doesn't receive any input and its bias is also negative. When the input is 1, the strong positive connections will turn both the hidden unit and output unit on.

In simulation 2, the connections between the units are negative and the biases are positive. When the input is 0, the hidden unit will turn on because it doesn't there is a large positive bias without any input. Since the hidden unit has a strong negative connection to the output unit, it will turn the output unit off. When the input is 1, the strong negative connection from the input to the hidden unit, will turn the hidden unit off. In the absense of input, the postive bias on the output unit will turn it on.

In simulations 3 and 4, the solutions are similar to simulations 1 and 2. However, because the biases are fixed, the network in both cases is unable to find an errorless solution. Simulation 3 is a local solution and simulation 4 is a global solution.

Exercise 15: For very small weights and biases, the net input to all of the units will be close to zero (independent of the inputs). Recall that a net input of zero, the output of the sigmoid function is 0.5. Thus, with small weights and biases, all of the unit activations will be approximately 0.5.

Exercise 16: The error surface for the XOR problem is very flat. It takes a long while before the total error reduces appreciably. Here are the results from one simulation. After 100 epochs, the total error was 1.022 (reduced for an initial error of about 1.025). After 200 and 300 epochs, the total error was 1.020 and 1.01 respectively (almost no change). However, between 300 and 400 epochs of learning, the total error plunged rapidly. By the 400th epoch, the total error was 0.697, and by the 500th epoch it was 0.074. The total error dropped below 0.05 after 521 epochs.

Exercise 17:

The above table reports the hidden unit activations after 521 epochs of training. In this simulation, the network has mapped both the 01 and 10 input patterns to the same point in hidden-unit space, so that the mapping from hidden units to outputs is linearly separable (see graphical illustration below).

Exercise 18: Learning is much faster with a learning rate of 1.0. However, if the learning rate is too large, it may tend to oscillate. If the global minima is in a very narrow ravine, large steps in weight space may miss the ravine entirely and send it into a local minima.

Exercise 19: The BackProp algorithm is very slow to learn a solution to the XOR problem without momentum, and is more prone to getting stuck in a local minima. Two features which would explain this behaviour are a long flat error surface, and numerous "craters" along the way. Momentum drastically speeds learning along long flat surfaces, and can help the network roll in and out of craters.

Exercise 21: Response of the trained network to novel inputs: George, Linda, Bob, and Michelle.

Name Age Education Marital Status Occupation Jets Unit Sharks Unit Predicted Gang
George 20's College Single Pusher 0.979 0.018 Strong Jet
Linda 40's J.H. Married Bookie 0.027 0.973 Strong Shark
Bob 30's H.S. Divorced Burglar 0.564 0.405 Weak Jet
Michelle 40's College Married Pusher 0.478 0.521 Weak Shark

Exercise 22: Response of the trained network to each characteristic.

Characteristic Jets Unit Sharks Unit Predicted Gang
20's 0.879 0.128 Strong Jet
30's 0.566 0.427 Weak Jet
40's 0.169 0.833 Strong Shark
Junior High 0.173 0.827 Strong Shark
High School 0.583 0.412 Weak Jet
College 0.885 0.115 Strong Jet
Single 0.887 0.113 Strong Jet
Married 0.176 0.822 Strong Shark
Divorced 0.595 0.394 Weak Jet
Pusher 0.879 0.118 Strong Jet
Bookie 0.169 0.832 Strong Shark
Burglar 0.568 0.419 Weak Jet

Exercise 23: The network predicts that George is a Jet because all of George's characteristics are strong Jet's characteristics. Similarly, Linda is predicted to be Shark because all of her characteristics are strongly predictive of Shark's gang members. George and Linda fit the stereotypic profile of Jets and Sharks gang members respectively. If we later learn that George is a Shark (counter to the model's predictions), we could then train the network to incorporate this extra piece of information into its knowledge base. Hopefully, George is an exception to the Shark stereotype. But if there turn out to be many more Sharks like George, we shouldn't place too much faith in our model. Learning that George is a Shark will be hard for the model to do.

Exercise 24: The network makes weak predictions for both Bob and Michelle, but for different reasons. If we take a close look at Bob's characteristics (30's, High School, Divorced, and Burglar), we notice that they are all weak Jets predictors. That is, unlike George, none of Bob's characteristics are stereotypical. Bob does not fit either the Jets or Sharks profile. The reason that the network weakly predicts Jets is because of its initial Jets bias Your network may weakly predict Bob is a Shark depending on its initial weights. Michelle on the other hand is sending mixed signals. Two of Michelle's characteristics (40's and Married) are strong Shark's predictors, whereas two of her characteristics (College and Pusher) are strong Jet's predictors. In the network, the conflicting predictons compete, and the Sharks category is a slim winner. Again, how the conflict is resolved may vary for your network depending on the initial weights. The prediction for Michelle may either been Weak Jet or Weak Shark.


The BrainWave homepage.