The Hebbian Network: The Distributed Representation of Facts

Janet Wiles, Simon Dennis and Rachael Gibson

Table of Contents

Introduction

In this chapter, we introduce two of the fundamental concepts in the field of connectionsm: distributed representations and learning. In the networks we have examined thus far, each unit has represented a different concept. For instance, in the Jets and Sharks example, a unit could represent a name, an age category, a gang etc. This style of coding is called a local representation. The alternative is to have entire patterns of activation represent each concept. For example, Art might be representated by the pattern (100100111) while Ralph is represented by the pattern (110010110). Each component has no meaning on its own, it is the pattern that representes the concept. This style of coding is called a distributed representation. In this chapter, we will discuss some of the advantages and disadvantages of distributed encodings within artificial neural networks.

Another property of the networks we have examined so far is that the weights have been hand crafted to implement the appropriate function. In the Jets and Sharks network, for instance, interpool connections were positive and intrapool connections were negative by design. However, this process is at best tedious and at worst completely impractical for large poorly defined problems. In order for neural networks to be useful it is necessary to devise learning algorithms that select the weights automatically. In this chapter, we outline the Hebbian learning rule and investigate some of its properties.

Distributed Representations

Local and distributed coding schemes tend to have complementary strengths and weaknesses. In this section, we will beign by looking at the advantages of local codes and then examine the advantages of distributed codes.

Network Understanding: Local codes make a netowrk easier to understand. Because each node corresponds to a single concept one can interpret the activation of each node as a "degree of belief" in that concept or microhypothesis. In contrast, distributed codes are typcially more difficult to understand and often require the application of additional techniques to help with the reading off process.

Variable Binding: A common requirement of many cognitive processes is the ability to bind a variable to a value. In mental arithmetic, we might be told that x=3 and then later be asked the value of x*2. Clearly, to complete this task successfully one must retreive the fact that x was bound to 3. The same sort of operation is a ubiquitius requirement of lanaguage processing. When we comprehend the sentence "John loves Mary" we must bind John to the role of lovee and Mary to the role of lover.

Local codes provide a straightforward mechnaism for variable binding. One simply creates a connection between a node representing the variable and the node representing the value. When one wishes to access the value of the variable one locates the variable and follows the connection to the value. Distributed codes make variable binding somewhat more difficult. Because both the variable and the value are patterns over the entire set of nodes the patterns must be chosen carefully in order for them to be able to be simultaneously active without interfering with each other. Furthermore, it is not as obvious how one can create connections that will allow one pattern of activity to be retrieved when another pattern of activity becomes active. The exercises within this chapter will describe some possible mechanisms.

Representing Constitutent Structure: A corollary to the previous observations about the representation of variable bindings is that local codes provide a more straightforward mechanism for representing constituent structure. For instance, for a system to represent the fact that a leg is a part of a body then the concept body must be bound to the leg with a "part-of" connection. While this is easy with a local scheme is requires a more sophisticated mechanism when dealing with distributed representations.

Similarity and Generalization: While local codes have a number of advanatges in the representation of structured data, distributed codes have several properties that are suggestive of human cognitive function. In particular, distributed codes for different concepts can be more or less similar to each other. As a consequence, distributed codes support automatic generalization. Suppose, for instnace, we employ distributed codes to represent animals. We might expect that chimpanzees and apes would correspond to similar patterns, while the pattern representing mice might be somewhat different and that representing rocks would be in a different part of space entirely (example from Hinton, McCelland & Rumelhart 1986). Now suppose we learn a new piece of information about chimpanzees, perhaps that they like onions. By virtue of the similarity of the distributed patterns representing apes and chimpanzees the act of learning information about the chimpanzees will also code that information for apes. That is, because chimpanzees are similar to apes the system will asume that apes also like onions. Now, later information may overide this new learning indicating that in fact apes do not like onions, but in the first instance the system will assume that the inference is sound. In a system employing local codes, however, this inference would not be a natural consequence of the representational scheme. Some subsystem would have toindependently establish a connection between apes and onions.

Type/Instance Hierarchies: A related point is that distributed representations are useful for storing Type/Instance hierarchies. If we suppose that the part of the ape and chimpanzee patterns that overlaps corresponds to type information (all these things are monkeys) then the art of the pattern that differentiates apes and chimpanzees is the instance information. That is, the one distributed pattern contains both the type and the instance. Now when we learn something about an instance we are also learning about the type and vice versa, if we learn soemthing about a type we are learning about the instances within that type automatically.

Creation of New Concepts: The final advantage of distributed representations relates to the formation of new concepts. Within a local coding scheme the formation of a new concept implies the creation of a new unit, that is, the creation of an entirely new piece of hardware. At least on the face of it, this does not seem to be the way that the neurophysiology works. A more subtle, but perhaps more profound point is that the necessity to create new units, means that one must identify when one is dealing with a new concept and when one is dealing with a variation of an already existing concept. In a distributed scheme patterns can be more of less similar, and there is no need to define where one concept stops and the next begins.

In the exercises in the remainder of the chapter you will working with distributed patterns and exploring some of their properties. Firstly, however, we need to introduce the concept of learning.

Hebbian Learning

The simplest form of weight selection mechanism is known as Hebbian learning. This form of learning is a mathematical abstraction of the principle of synaptic modulation first articulated by Hebb (1949). Hebb's law says that if one neuron stimulates another neuron when the receiving neuron is firing, the strength of the connection between the two cells is strengthened. Mathematically this is expressed as:
wij = ai aj

That is, the change in a weight is equal to the product of the activations of the units which it connects. So if both units are on (ai = 1) or both units are off (ai = -1) the strength of the weight is increased, otherwise it is decreased. Note that the mathematical expression goes beyond Hebb's Law in that it allows for a weight to increase if both units are off and to decrease if the activations of its units are 1 and -1. These properties are probably not physiologically justified and can be modified, however the system is easiest to understand if the simple Hebbian rule is applied.

In this exposition, we described the learning rule in terms of the interactions of individual units. More generally, however, Hebbian learning is equivalent to vector, matrix and tensor algebra. Think of learning in these terms allows us to take advantage of a long mathematical tradition and to use what has been learned. For that reason, we will now do a quick refresher course on vector, matrix and tensor algebra and describe how these algebraic systems correspond to neural network architectures.

Tensors Explained

Tensors are convenient ways of collecting together numbers. For instance, a vector, which is also known as a rank one tensor, could be used to describe the age, gender, and salary of an employee. If Jeff is 50 years old, is male (where male = 0 and female = 1) and earns $56000 per annum then we could describe Jeff with the vector [50, 1, 56000] (see figure 1). Note that vectors (and tensors in general) are ordered. The vector [56000, 1, 50] would describe someone who was 56000 years old who made a total of $50 per annum!





Figure 1: Examples of vectors. (a) a row vector describing Jeff, (b) a column vector, (c) a vector with N components.

The rank one tensor described above has a dimension of three because it contains three components. There is no reason that vectors need be restricted to three dimensions, however. We could have added shoe size, for instance, to increase the dimension to four. Similarly, there is no reason that we need to restrict ourselves to a single row of numbers. A tensor with N rows and M columns is known as an NxM matrix and has a rank of two, indicating that the array of numbers extends in two directions (see figure 2).

Figure 2: Examples of matrices. (a) a 2x2 matrix, (b) a 3x2 matrix, (c) an NxN matrix.

The process of extending the number of directions in which the array extends can theoretically continue indefinitely, creating tensors of rank three, four, five etc. In the following sections, we will look at vectors, matrices and tensors of rank three (see figure 3) as they are critical to understanding how associations of distributed represenations can be stored. Models, such as the STAR model of analogical reasoning (Halford, Wiles, Humphreys and Wilson 1992), employ tensors of higher rank.









Figure 3: A rank three tensor (NxNxN).

Vectors - Rank One Tensors

Tensors, and in particular vectors, can be represented in many different forms including:
  1. Cartesian form in which the components are enumerated explicitly. Figure 1 depicts vectors represented in Cartesian form.
  2. Geometric form in which the vector is plotted in N dimensional space. For instance, figure 4 shows the vector representing Jeff plotted in three dimensional space.
    Figure 4: The vector representing Jeff plotted in three dimensional space (Geometric form).
  3. Algebraic form in which a vector is represented as a bolded lower case letter (e.g. v). Algebraic form is a particularly concise form of representation, which makes it easy to talk about the operations that can be performed on vectors such as addition (e.g. w = v + t).
  4. Neural network form which diagrams a neural network architecture in which either a set of units or a set of weights contain the elements of the vector. For instance, a vector can be mapped to a two layer network (one input and one output layer) as depicted in Figure 5. The number of units in the input layer corresponds to the number of dimensions in the original vector, while the output layer contains only 1 unit. Each input unit is connected to each output unit. The input units represent one vector and the weights represent a second vector.

    Figure 5: The network corresponding to a vector memory.

    The output of this network is defined to be the dot product (or inner product) of the input and weight vectors. A Dot Product is calculated by multiplying together the values which are in the same position within the two vectors, and then adding the results of these multiplications together to get a scalar (see Figure 6a). In the case of the neural network, this involves multiplying each input unit activation by the corresponding weight value and then adding. The dot product of two vectors represents the level of similarity between them and can be extended to higher rank tensors (see figure 6b)

    Figure 6: The Dot Product.

    The dot product is expressed algebraically as a dot, that is, the dot product of the vectors v and w is written v.w. If the dot product of two vectors is zero (v.w = 0) then the vectors are orthogonal. If the dot product of a vector and itself is 1 then the vector is normal. More generally, the dot product of a vector and itself is the length of the vector squared:

    x.x = |x|2

    Learning occurs in this network by adding the input vectors. Vector addition superimposes vectors of the same dimension. It is calculated by adding together the elements in a particular position in each vector (see Figure 7a). In this way, multiple memories can be stored within the same vector.

    Figure 7: (a) Vector Addition, (b) Matrix Addition.

    Again vector addition can be extended to tensors of arbitrary rank (see figure 7b). Vector addition is expressed algebraically as a plus sign (+). So if we wanted to talk about the dot product of v with the addition of w and x we would write v.(w + x). Another useful property to keep in mind is that the dot product distributes over addition. That is:

    v.(w + x) = v.w + v.x

    Suppose we have three items represented by the vectors v, w, and x. We can form a memory by adding these:

    m = v + w + x To determine if a given item has been stored we calculate the dot product of the item with the memory vector. w.m = w.(v + w + x)

    = w.v + w.w + w.x

    = 1

    Alternately, if w was not stored the dot product would be 0.

In the following exercises, you will build a vector network that learns to discriminate between stored items and new items (see figure 8).

Figure 8: Build your vector memory network here.

Follow the instructions below to create the network and then work through the exercises.
  1. Set up the vector memory network:
    • Create two input units and one output unit.
    • Connect both input units to the output unit.
    • Set up VALUE objects for the units and weights.
  2. Create the data sets:
    • Create an Input Set containing the two input units
    • Create a Test Set containing the two input units
    • Create an Output Set containing the output unit
    • The items in this exercise are FROG [0.95, 0.32], TOAD [0.49, 0.87], KOALA [0.32, -0.95]. Add the FROG pattern to the input set.
    • Add a pattern containing a 1 to the output set.
    • Add all three items, FROG, TOAD and KOALA, to the test set.
  3. Add the buttons:
    • Create a Feedforward button (which cycles activation forward from inputs to outputs).
    • Set the input and output sets of the Feedforward button
    • Create a HebbianLearn button (which implements training)
    • Set the input and output sets of the HebbianLearn button
    • Create a ZeroWeights button.

Exercise 1: Train the network for one epoch and record the weights in the TRAIN FROG row of the following table. How have the weights changed?

Weight 1 Weight 2
TRAIN FROG
TRAIN FROG & KOALA
TRAIN FROG & TOAD

Exercise 2: Test each of the items, FROG, TOAD and KOALA, and record the match values (the activation of the output unit) in the second table. Explain the match values.

TEST FROG TEST TOAD TEST KOALA
TRAIN FROG
TRAIN FROG & KOALA
TRAIN FROG & TOAD

Exercise 3: Train the network for one more epoch and test again. What happens to the match values after a second training trial? Why?

Exercise 4: Add KOALA to the input set and an output value of 1 to the output set. Zero the weights and retrain the network on the updated input set. Test the network as before, recording the values in the table in the TRAIN FROG & KOALA row. Explain the new values.

Exercise 5: Delete KOALA from the input set and add TOAD. Zero the weights, retrain and test as above, recording the values in the TRAIN FROG & TOAD row. You should have six weight values and nine match values for each training trial. Draw a graph of the match values after the first training trial: plot three lines, one for each test item, against the three training conditions. Explain the shape of each line on the graph.

Exercise 6: For each of the three training conditions (FROG alone, FROG & KOALA, FROG & TOAD):

  1. Draw the geometric (graphical) representation of the weights,
  2. Provide the algebraic representation of the weights.

[Next Section:] Matrices - Rank Two Tensors