The Self-organizing Map

information from multidimensional primary signals, and to represent it as a location, say, in a two-dimensional network. Although this i s already a step towards generalization and symbolism, it must be admitted that the extraction of features from geometrically or physically relatable data elements i s still a very concrete task, in principle at least. Theoperation of the brain at the higher levels relies heavily on abstract concepts, symbolism, and language. It is an old notion that the deepest semantic elements of any language should also be physiologically represented in the neural realms. There i s now new physiological evidence for linguistic units being locatable in the human brain [6], [15]. In attempting to devise Neural Network models for linguistic representations, the first difficulty i s encountered when trying to find metric distance relations between symbolic items. Unlike with primary sensory signal patterns for which similarity i s easily derivable from their mutual distances in the vector spaces in which they are represented, it can not be assumed that encodings of symbols in general have any relationship with the observable characteristics of the corresponding items. How could it then be possible to represent the “logical similarity” of pairs of items, and to map such items topographically? The answer lies in the fact that the symbol, during the learning process, i s presented in context, i.e., in conjunction with the encodings of a set of other concurrent items. In linguistic representations context might mean afew adjacentwords. Similarity between items would then be reflected through the similarity of the contexts. Note that for ordered sets of arbitrary encodings, invariant similarity can be expressed, e.g., in terms of the number of items they have in common. O n the other hand, it may be evident that the meaning (semantics) of a symbolic encoding i s only derivable from the conditional probabilities of its occurrences with other encodings, independent of the type of encoding [68]. However, in the learning process, the literal encodings of the symbols must be memorized, too. Let vector x, represent the symbolic expression of an item, and x, the representation of its context. The simplest neural model then assumes that x, and x, are connected to the same neural units, i.e., the representation (pattern) vector x of the item is formed as a concatenation of x, and x,: In other words, the symbol part and the context part form a vectorial sum of two orthogonal components. The core idea underlying symbol maps is that the two parts are weighted properly such that the norm o f the context part predominates over that of the symbol part during the self-organizing process; the topographical mapping 1474 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


I. INTRODUCTION
A. On the Role o f the Self-organizing Map Among Neural Network Models The network architectures and signal processes used to model nervous systems can roughly be divided into three categories, each based on a different philosophy. Feedforwardnetworks [94] transform sets of input signals into sets of output signals. The desired input-output transformation i s usually determined by external, supervised adjustment of the system parameters. I n feedback networks [27], the input information defines the initial activity state of a feedback system, and after state transitions the asymptotic final state is identified as theoutcomeof the computation. In the third category, neighboring cells in a neural network compete in their activities by means of mutual lateral interactions, and develop adaptively into specific detectors of different signal patterns. I n this category learning is called competitive, unsupervised, or self-organizing, The Self-organizing Map discussed in this paper belongs to the last category. It is a sheet-like artificial neural network, the cells of which become specifically tuned to various input signal patterns or classes of patterns through an unsupervised learning process. In the basic version, only one cell or local group of cells at a time gives the active response t o the current input. The locations of the responses tend to become ordered as if some meaningful coordinate system for different input features were being created over the network. The spatial location or coordinatesof acell in the network then correspond toa particular domain of input signal patterns. Each cell or local cell group acts like a separate decoder for the same input. It is thus the presence or absence of an active response at that location, and not so much the exact input-output signal transformation or magnitude of the response, that provides an interpretation of the input information.
The Self-organizing Map was intended as a viable alternative to more traditional neural network architectures. It is possibletoaskjust how"neura1"the map is. Itsanalytical description has already been developed further in the technical than in the biological direction. But the learning results achieved seem very natural, at least indicating that the adaptive processes themselves at work in the map may be similar t o those encountered i n the brain. There may therefore be sufficient justification for calling these maps "neural networks" in the same sense as their traditional rivals.
Self-organizing Maps, or systems consisting of several map modules, have been used for tasks similar to those t o which other more traditional neural networks have been applied: pattern recognition, robotics, process control, and even processing of semantic information. The spatial segregation of different responses and their organization into topologically related subsets results i n a high degree of efficiency in typical neural network operations.
Although the largest map we have used i n practical applications has only contained about 1000 cells, its learning speed, especially when using computational shortcuts, can be increased to orders of magnitude greater than that of many other neural networks. Thus much larger maps than those used so far are quite feasible, although it also seems that practical applications favor hierarchical systems made up of many smaller maps.
It may be appropriate t o observe here that if the maps are used for pattern recognition, their classification accuracy can be multiplied if the cells are fine-tuned using supervised learning principles (cf. Sec. Ill).
Although the Self-organizing Map principle was introduced in early 1981, no complete review has appeared in compact form, except perhaps in [44], which does not contain the latest results. I have therefore tried t o collect a variety of basic material in the present paper. As much as a hundred years ago, a quite detailed topographical organization of the brain, and especially of the cerebral cortex, could be deduced from functional deficits and behavioral impairments induced by various kinds of lesion, or by hemorrhages, tumors, or malformations. Different regions i n the brain thereby seemed to be dedicated to specific tasks. One modern systematic technique for causing controllable, reversible simulated lesions is to stimulate a particular site with small electric currents, thereby eventually inducing both excitatory and inhibitory effects and disturbing the assumed local function [75]. If such a spatially confined stimulus then disrupts a specific cognitiveabilitysuch as namingofobjects, it givesat least some indication that this site is essential to that task.
One straightforward method for locating a response is to record the electric potential or train of neural impulses associated with it. Many detailed mappings, especiallyfrom the primary sensory and associative areas of the brain, have been made using various electrophysiological recording techniques.
Direct evidence for any localization of brain functions can also beobtained using modern imaging techniques that display the strength and spatial distribution of neural responses simultaneously over a large area, with a spatial resolution of a few millimeters. The two principal methods which use radioactive tracers are positron emission tomography (PET) [80] and autoradiography of the brain through very narrow collimators (gamma camera). PET reveals changes in oxygen uptake and phosphate metabolism. The gammacamera method directlydetectschanges in cerebral blood flow. Both phenomena correlate with local neural activity, but they are unable to monitor rapid phenomena. In magnetoencephalography (MEG), the low magnetic field caused by electrical neural responses is detected, and by computing its sources, quite rapid neural responses can be directly analyzed, with a spatial resolution of a few millimeters. The main drawback of MEG is that only current dipoles parallel to the surface of the skull are detectable; and since the dipoles are oriented perpendicular to the cortex, onlythe sulci can be studied with this method. A review of experimental techniques and results relating to these studies can be found in [32].
Aftera largenumber of such observations,afairlydetailed organizational view of the brain has evolved [32]. Especially in higher animals, thevariouscortices in thecell mass seem to contain many kinds of "map" [33], such that a particular location of the neural response in the map often directly corresponds to a specific modality and quality of sensory signal. The field of vision is mapped "quasiconformally" onto the primaryvisual cortex. Someof the maps, especially those in the primary sensory areas, are ordered according to some feature dimensions of the sensory signals; for instance, in the visual areas, there are line orientation and It thus seems as if the internal representations of information in the brain are generally organized spatially.
Although there is only partial biological evidence for this, enough data are already available to justify further theoretical studies of this principle. Artificial self-organizing maps and brain maps thus have many features in common, and what is even more intriguing, we now fully understand the processes by which such artificial maps can be formed adaptively and completely automatically.

C. Early Work on Competitive Learning
The basic idea underlying what is called competitive learning is roughly as follows: Assume a sequence of statistical samples of a vectorial observablex = x(t) E El", where tis the time coordinate, and a set of variable reference vectors {m,(t): m, E .d", i = 1,2, . * , k } . Assume that the m, (O) have been initialized in some proper way; random selection will often suffice. If x(t) can somehow be simultaneously compared with each m,(t) at each successive instant of time, taken here to be an integer t = 1 , 2 , 3 , . . . , then the bestmatching ml(t) is to be updated to match even more closely the current x(t). If the comparison is based on some distance measure d(x, m,), altering m, must be such that, if i = c is the index of the best-matching reference vector, then d(x, m,) is decreased, and all the other reference vectors m,, with i # c, are left intact. In this way the different reference vectors tend to become specifically "tuned" to different domains of the input variable x. It will be shown below that ifpis the probabilitydensityfunction ofthesamplesx, then the m, tend to be located in the input space S" in such a way that they approximate to p in the sense of some minimal residual error.
Vector Quantization (VQ) (cf., e.g., [19], [54], [58]) is a classical method, that produces an approximation to a continuous probability density function p(x) of the vectorial input variable x using a finite number of codebook vectors m,, i = 1,2, . . . , k. Once the "codebook" is chosen, the approximation of xinvolvesfinding the referencevectorm,closest to x. One kind of optimal placement of the m, minimizes E, the expected r t h power of the reconstruction error: where dx is the volume differential in the x space, and the index c = c(x) of the best-matching codebook vector ("winner") is a function of the input vector x: (2) In general, no closed-form solution for the optimal placement of the m, is possible, and iterative approximation schemes must be used.
It has been pointed out in [14], [64], and [I151 that (1) IIxm,ll = min {llxmllll. I defines a placement of the codebookvectors into the signal space such that their point density function is an approximation to [p(x)]"""+'), where n i s the dimensionality of x and m,. We usually consider the case r = 2. In most practical applications n >> r, and then the optimal VQ can be shown to approximate p(x).
Using the square-error criterion ( r = 2), it can also be shown that the following stepwise "delta rule," i n the discrete-time formalism (t = 0,1,2, . . ), defines the optimal values asymptotically. Let m, = m,(t) be the closest codebookvector t o x = x(t) i n the Euclidean metric.The steepestdescent gradient-step optimization of € in the m, space yields the sequence with a(t) being a suitable, monotonically decreasing sequence of scalar-valued gain coefficients, 0 < a(t) < 1. This is then the simplest analytical description of competitive learning.
In general, if we express the dissimilarity of x and m, in terms of a general distance function d(x, m,), we have first to identify the "winner" m, such that (4) After that an updating rule should be used such that d decreases monotonically: the correction 6m, of m, must be such that (5) If (1) is used for signal approximation, it often turns out to be more economical to first observe a number of training samples x(t), which are "classified" (labeled) on the basis of (2) according to the closest codebook vectors m,, and then to perform the updating operation i n a single step. For the new codebook vector m,, the average is taken of those x ( t ) that were identified with codebook vector i. This algorithm, termed the k-means algorithm is widely used in digital telecommunications engineering [58].
The m,(t), i n the above processes, actually develop into a set of feature-sensitive detectors. Feature-sensitive cells arealso known to becommon i n the brain. Neural modelers like Nass and Cooper [72], Perez et a/.
[79], and Grossberg [21] have been able to suggest how such feature-sensitive cells can emerge from simplified membrane equations of model neurons.
In the above process and i t s biophysical counterparts, all the cells act independently. Therefore the order in which they are assigned to the different domains of input signals is more or less haphazard, most strongly depending on the initial values of the m,(O). In fact, in 1973, v.d. Malsburg [59] had already published a computer simulation in which he demonstrated localordering of feature-sensitive cells, such that i n small subsets of cells roughly corresponding to the so-called columns of the cortex, the cells were tuned more closely than were more remote cells. Later, Amari [I] formulated and analyzed the corresponding system of differential equations, relating them t o spatially continuous twodimensional media. Such continuous layers interacted i n the lateral direction; the arrangement was called a nerve field. The above studies are of great theoretical importance because they involve a self-organizing tendency. ing power they demonstrated was, however, still weak as nerve-field type equations only describe this tendency as a marginal effect. In spite of numerous attempts, no "maps" of practical importance could be produced; ordering was either restricted to a one-dimensional case, or confined to small parcelled areas of the network [601, [77, [78], [99], [loo], [IIO], [ I l l ] .
Indeed it later transpired that system equations have to involve much stronger, idealized self-organizing effects, and that the organizing effect has to be maximized in every possibleway beforeuseful global maps can becreated.The present author, in early 1981, was experimenting with various architectures and system equations, and found a process description [34]- [36] that seemed generally t o produce globally well-organized maps. Because all the other system models known at that time only yielded results that were significantly more "brittle" with respect to the selection of parameters and to success in achieving the desired results, we may skip them here and concentrate o n the computationally optimized algorithm known as the Self-Organizing Map algorithm.

II. AN ALGORITHM THAT ORDERS RESPONSES SPATIALLY
Readers who are not yet familiar with the Self-organizing Maps may benefit from a quick look at Figs. 5 and 6 or Fig.  9 to find out what spatial ordering of output responses means.
The Self-organizing Map algorithm that I shall now describe has evolved during a long series of computer experiments. The background to this research has been expounded in [ a ] . While the purpose of each detail of the final equations may be clear in concrete simulations, it has proved extremely difficult, in spite of numerous attempts, to express the dynamic properties of this process in mathematical theorems. Strict mathematical analysis only exists for simplified cases. And even they are too lengthy to be reviewed here: cf. [q, [8], [24], [57, [83], [86], [89], [go]. It is therefore hoped that the simulation experiments and practical applications reported below in Secs. 11-C, 11-E, IV, V, and VI will suffice to convince the reader about the utility of this algorithm.
It may also be necessary to emphasize again that for practical purposes we are trying to extract or explain the selforganizing function in its purest, most effective form, whereas in genuine biological networks this tendency may be more or less disguised by other functions. It may thus be conceivable, as has been verified by numerous simulation experiments, that the two essential effects leading to spatially organized maps are: 1) spatialconcentration of the network activityon thecell (or its neighborho0d)that is best tuned to the present input, and 2) further sensitization or tuning of the best-matching cell and its topological neighbors to the present input.

A. Selection of the Best-Matching Cell
Consider the two-dimensional network of cells depicted in Fig. 1. Their arrangement can be hexagonal, rectangular, etc. Let (in matrix notation) x = [x,, x2, . , x, ] ' E 8" be the input vector that, for simplicity and computational efficiency, is assumed to be connected i n parallel to all the neu-  The simplest analytical measure for the match of x with them, may be the inner product x'm,. If, however, the selforganizing algorithm is t o be used for, say, natural signal patterns relating t o metric vector spaces, a better and more convenient (cf. the adaptation law below) matching criterion may be used, based on the Euclidean distances between xand m,. The minimum distance defines the "winner" m, (cf. (2)). A shortcut algorithm t o find m, has been presented in [49].
Comment. Definition of the input vector,x, asan ordered set of signal values is only possible if the interrelation between the signals is simple. I n many practical problems, such as image analysis (cf. Discussion, Sec. VII) it will generally be necessary t o use some kind of preprocessing t o extract a set of invariant features for the components of x.

B. Adaptation (Updating) of the Weight Vectors
It iscrucial totheformation ofordered mapsthatthecells doing the learning are not affected independently of each other (cf. competitive learning i n Sec. I-C), but as topologically related subsets, on each of which a similar kind of correction is imposed. During the process, such selected subsets will then encompass different cells. The net corrections at each cell will thus tend t o be smoothed out in the long run. An even more intriguing result from this sort of spatially correlated learning is that the weight vectors tend to attain values that are ordered along the axes of the network.
In biophysically inspired neural network models, correlated learning byspatiallyneighboringcells can be implemented using various kinds of lateral feedback connection and other lateral interactions. I n the present process we want to enforce lateral interaction directly in a general form, for arbitrary underlying network structures, by defining a neighborhood set N, around cell c. At each learning step, all the cells within N,are updated, whereas cells outside N, are left intact. This neighborhood is centered around that cell for which the best match with input x is found: shrink monotonicallywith time (Fig. 2). The explanation for this may be that a wide initial N, , corresponding to a coarse spatial resolution in the learning process, first induces a rough global order in them, values, after which narrowing the N, improves the spatial resolution of the map; the acquired global order, however, is not destroyed later on. It is even possible to end the process with N, = {c}, that is, finally updating the best-matching unit ("winner") only, in which case the process is reduced t o simple competitive learning. Before this, however, the "topological order" of the map would have t o be formed. The updating process (in discrete-time notation) may read

(6)
where a(t) i s a scalar-valued "adaptation gain" 0 < a(t) < 1. It is related to a similar gain used in the stochastic approximation processes [49], [92], and as in these methods, a(t) should decrease with time. An alternative notation is to introduce a scalar "kernel" whereby,above, h,,(t) = a(t)within N,,and h,,(t) = Ooutside N, . On theother hand, thedefinition of h,,can also be more general; a biological lateral interaction often has the shape of a "bell curve". Denoting the coordinates of cells c and i by the vectors rc and r,, respectively, a proper form for h,, might be h, , = ho exp (-llr,r,l12/a2), with bo = h,(t) and U = o(t) as suitable decreasing functions of time.

C. Demonstrations of the Ordering Process
The first computer simulations presented here are intended t o illustrate the effect that the weight vectors tend to approximate to the density function of the input vectors in an orderly fashion. I n these examples, the input vectors were chosen to be two-dimensional for visual display purposes, and their probability density function was arbitrarily selected to be uniform over the areademarcated bythe borderlines (square or triangle). Outside the frame the density was zero. The vectors x(t) were drawn from this density 1Ix -mcll = min {Ilx -m,ll}.

(2')
The width or radius of N , can be time-variable; in fact, for good global ordering, it has experimentally turned out to beadvantageousto let N,beverywide in the beginning and function independently and at random, after which they caused adaptive changes in the weight vectors m,.
The m, vectors appear as points in the same coordinate system as that in which the x(t) are represented; in order to indicate to which unit each m, value belongs, the points   As stated above, in Fig. 3 the array was two-dimensional.
The results, however, are particularly interesting if the distribution and the array have different dimensionalities: Fig.   4 illustrates a case i n which the distribution of x i s twodimensional, but the array is one-dimensional (linear row of cells). The weight vectors of linear arrays tend t o approximate to higher-dimensional distributions by Peano curves.
A two-dimensional network representing three-dimensional "bodies" (uniform-densityfunction) is shown in Fig.

5.
I n practical applications, the input and output weight vectors are usually high-dimensional; e.g., in speech recognition, the dimensionality n may be 15 to 100. Since no factor present defines a particular orientation in the output map, the latter can be realized i n the process in any mirror or point-symmetric inversion, mainlydepending on the initial values m,(O). If a particular orientation is to be favored, the easiest way t o achieve this result is by asymmetric choice of the initial values m,(O).

D. Some Practical Hints for the Application of the Algorithm
When applying the map algorithm, (2) or (2') and (6) alternate. Input x is usually a random variable with a density function p(x), from which the successive values x(t) are drawn. In real-world observations, such as speech recognition,thex(t)can simply be successivesamplesofthe input observables in their natural order of occurrence.
The process may be started by choosing arbitrary, even random, initial valuesforthem,(O), theonlyrestriction being that they should be different.
We shall give numerical examples of efficient process parameters with the simulation examples. It may also be helpful t o emphasize the following general conditions. 1) Since learning is a stochastic process, the final statistical accuracy of the mapping depends on the number of steps, which must be reasonably large; there is no way to circumvent this requirement. A "rule of thumb" is that, for good statistical accuracy, the number of steps must be at least500timesthenumberof networkunits.Ontheother hand, the number of components in x has no effect on the number of iteration steps, and if hardware neural computers are used, a very high dimensionality of input is allowed. Typically we have used up t o 100 000 steps in our simulations, but for "fast learning," e.g., in speech recognition, 10000 steps and even less may sometimes be enough. Note that the algorithm is computationally extremely light. If onlya small number of samples are available,theymust be recycled forthedesired numberof steps.
2) For approximately the first 1000 steps, a(t) should startwith avalue that is close t o unity, thereafter decreasing monotonically. An accurate rule is not important: a! = a(t) can be linear, exponential, or inversely proportional t o t. Although the more practical applications of the Self-Organizing Maps are available, for example, in pattern recognition and robotics, it may be interesting t o apply this principle first t o abstract data vectors consisting of hypothetical attributes or characteristics. We will look at an example with implicitly defined (hierarchical) structures in the primary data, which the map algorithm is then able t o reveal. Although this system is a single-level network, it can Table 1 Input Data Matrix produce a hierarchical representation of the relations between the primary data.
The central result in self-organization is that if the input signals have a well-defined probability density function, then theweightvectorsof thecells tryto imitateit, however complex its form. It is even possible to perform a kind of numerical taxonomy on this model. Because there are no restrictions on the semantic content of the input signals, they can be regarded as arbitrary attributes, with discrete or continuous values. I n Table 1, 32 items, each with five hypothetical attributes, are recorded in a data matrix. (This example is completely artificial.) Each of the columns represents one item, and for later inspection the items are labeled "A" through "6", although these labels were not referred to during the learning.
The attribute values (a,, a2, . . . ,a,) constitute the pattern vector x which acts as a set of signal values at the inputs t o the network of the type shown in Fig. 1. During training, the vectors x were selected from Table 1 at random. Sampling and adaptation were continued iteratively until the asymptotic state could be considered stationary. Such a "learned" network was then calibrated using the items from Table 1 and labeling the best-matching mapcells according to the different calibration items. Such a labeled map is shown in Fig. 6. It can be seen that the"images"of different  Table 1. 7. Minimal spanning tree corresponding to Table 1.

A B C D E F C H I J K L M N O P Q R S T U V W X Y Z I 2 3 4 5 6
Attribute a,

F. Another Variant o f the Algorithm
One further remark may be necessary. It has sometimes been suggested that x be normalized before it is used i n the algorithm. Normalization is not necessary in principle, but it may improve numerical accuracy because the resulting reference vectors then tend to have the same dynamic range.
Another aspect, as mentioned above, is that it is also possible to apply a general distance measure in the matching; then, however, the matching and updating laws should be mutually compatible with respect to the same metric. For instance, if the inner-product measure of similarity were applied, the learning equations should read: to define effective values for the codebook vectors such that they directly define near-optimal decision borders between the classes, even in the sense of classical Bayesian decision theory. These strategies and learning algorithms were introduced by the present author [38], [43], [45] and called Learning Vector Quantization (LVQ).

A. Type One Learning Vector Quantization (LVQI)
If several codebook vectors mi are assigned to each class, and each of them is labeled with the corresponding class symbol, the class regions in thexspace are defined by simple nearest-neighbor comparison of x with them,; the label of the closest mi defines the classification of x.
TO define the optimal placement of mi in an iterative learning process, initial values for them must first be set using any classical VQ method or by the Self-organizing Map algorithm. The initial values in both cases roughly correspond to the overall statistical densityfunction p(x) of the input. The next phase is to determine the labels of the codebook vectors, by presenting a number of input vectors with known classification, and assigning the cells to different classes by majority voting, according to the frequency with which each mi is closest to the calibration vectors of a particular class.
The In other words, we would perform a classical Vector Quantization of the function )p(xlC;)P(C;)p(x)Cj)P(C,)l where C; and Cjare the neighboringclasses,p(xlC,) is theconditional probabilitydensityfunction of samplesx belonging to class Ci, and P(CJ is the apriori probability of occurrences of the class C, samples. The difference between the density functions of the neighboring classes, by definition, drops tozero at the Bayes border, inducing the above "depletion layer" of the codebook vectors.
After training, them, will have acquired values such that classification using the "nearest neighbor" principle, by comparing of xwith the mi, already rather closely coincides with that of the Bayes classifier. Figure 8 represents an illustrative example in which x is two-dimensional, and the probability density functions of the classes substantially overlap. The decision surface defined by this classifier seems t o be near-optimal, although piecewise linear, and the classification accuracy in this rather difficult example i s within a fraction of a percent of that achieved with the Bayes classifier. For practical applications of the LVQI, cf.

B. Type Two Learning Vector Quantization (LVQ2)
The previous algorithm can easily be modified t o comply even better with Bayes' decisionmaking philosophy [43]- [45]. Assume that two codebook vectors m, and m, that belong to different classes and are closest neighbors in the vector spaceare initially in awrong position.The(incorrect) discrimination surface, however, is always defined as the midplane of m, and m,. Let us define a symmetric window of nonzero width around the midplane, and stipulate that corrections to m, a n d m, shall only be made if x falls into the window on the wrong side of the midplane (cf. Fig. 9). if C, is the nearest class, but x belongs to C, # C, where C, is the next-to-nearest class ("runner-up"); furthermore x must fall into the "window". I n all the other cases, The optimal width of the window must be determined experimentally, and it depends on the number of available samples. With a relatively small number of training samples, a width 10 to 20% of the difference between m, and m, seems to be proper.
One question concerns the practical definition of the "window". If we are working in a high-dimensional signal space, it seems reasonable t o define the "window" in terms of relative distances d, and d, from m, and m,, respectively, having constant ratios. I n this way the borders of the "window" are Apollonian hyperspheres. The vector x is defined to lie in the "window" if min (d,/d,, d,/d,) > s. (

3)
If w is the relative width of the window in i t s narrowest point, then s = (1 -w)/(l + w). The optimal size of the window depends on the number of available training samples. If we had a large number of samples, a narrow window would guarantee the most accurate location of the border; but for good statistical accuracy the number of samples falling into the window must be sufficient, too, so a 20% window seems a good compromise, at least in the experiments reported below.
For reasons explained in the next section, the classificationaccuracyofthe LVQZ is first improvedwhen thedecision surface is shifted towards the Bayes limit; after that, however, the m, continue "drifting away". Therefore this algorithm ought t o be applied for a relatively short time only, say, starting with a = 0.02 and letting it to decrease to zero in at most 10 000 steps.
KOHONEN: THE SELF-ORGANIZINC MAP C. Type Three Learning Vector Quantization (LVQ3) The LVQ2 algorithm was based on the idea of differentially shifting the decision borders towards the Bayes limits, while no attention was paid towhat might happen t o the location of the m, i n the long run if this process were continued. Thus, although researchers have reported good results, some have had problems, too. It turns out that at least two different kinds of detrimental effect must be taken into account. First, because corrections are proportional t o the difference of x and m,, or x and m,, the correction on m, (correct class) is of larger magnitude than that on m, (wrong class); this results i n monotonically decreasing distances IIm,m, 11. One remedy is t o compensate for this effect, approximately at least, by accepting all the training vectors from the "window," and the only condition is that one of m, and m, must belong to the correct class, and the otherto the incorrect class. The second problem arises from the fact that if the process in (12) iscontinued, it may lead to another asymptotic equilibrium of m, that i s no longer optimal. Therefore it seems necessary to include corrections that ensure that the m, continue approximating the class distributions,at least roughly. Combiningthese ideas, we now obtain an improved algorithm that may be called LVQ3: I n a series of experiments, applicable values for E between 0.1 and 0.5 were found. The optimal value of E seems to depend on the size of the window, being smaller for narrower windows. This algorithm seems t o be self-stabilizing, i.e., the optimal placement of the m, does not change i n continued learning. Noticethatwhereas in LVQI onlyoneof them,valueswas changed at each step, LVQ2 and LVQ3 change two codebook vectors simultaneously.

Iv. APPLICATION O F THE MAP T O SPEECH RECOGNITION
When artificial neural networks are t o be used for a practical pattern recognition application such as speech recognition, the first task is t o make clear whether it is desirable t o perform the complete chain of processing operations starting, e.g., from the pre-analysis of the microphone signal and leading on t o some form of linguistic encoding of speech using "all-neural" operations, or whether "neural networks" should only be applied at the most critical stage, whereby the rest of the processingoperations can be implemented on standard computing equipment.Thischoice mainlydependsonwhethertheobjective is commercial or academic.
Another issue is whether the aim is t o demonstrate the ultimate capabilities of "neural networks" in the analysis of dynamical speech information, or whether it is only to replace some of the traditional "spectral" and "vector space" pattern recognition algorithms by highly adaptive, learning "neural network" principles.
I n a speech recognizer, a proper place for artificial neural networks is in the phonemic recognition stage where exacting statistical analysis is needed. It should be remembered that if phonemes, i.e., classes of different phonological realizations of vowels and consonants, are selected for the basic phonetic units, then account has t o be taken of their transformations due t o coarticulation effects. I n other words, the spectral properties of the phonemes are changed in the context or frame of other phonemes. I n an "all-neural" speech recognizer it may not be necessary to distinguish or consider phonemes at all, because interpretation of speech is then regarded as an integral, implicit process. Introduction of the phoneme concept already implies that the system must be able to automatically identify them in one form or another and to label the corresponding time interval. Correction of coarticulation effects may then already be implemented i n the acoustic analysis itself, by regarding the speech states as Markov processes, and analyzing the state transitions statistically [53]. A different approach altogether is first to apply some vector quantization classification, whereby the speech waveform i s only labeled by class symbols of stationary phonemes, as if no coarticulation effects were being taken into account. Corrections can then be made afterwards in a separate postprocessing stage, in symbolic form. We have used the latter approach.
We have implemented a practical "phonetic typewriter" for unlimited speech input using the Self-organizing Map to spot and recognize phonemes in continuous speech (Finnish and Japanese) [42], [46], [48]. The "network" was finetuned for optimal decision accuracy by the Learning Vector Quantization. After that, in the postprocessing stage we applied a self-learning grammar that corrects the majority of coarticulation errors and derives its numerous transformation rules automatically from given examples. This principle, termed "Dynamically Expanding Context" [37, [40], actually belongs to the category of learning Artificial Intelligence methods, and thus falls outside the scope of this article (cf. Sec. IV-D below).

A. Acoustic Preprocessing o f the Speech Signal
It is known that biological sensory organs such as the inner ear are usually able to adapt t o signal transients in a fast, nonlinear way. Nonetheless, we decided to apply conventional frequency analysis t o the preprocessing of speech. The main reason for this was that digital Fourier analysis is both accurate and fast, and the fundamentals of digital filtering are well understood. Deviations from physiological reality are not essential since the self-organizing neural network can accept many alternative kinds of preprocessing and can compensate for minor imperfections.
The technical details of the acoustic preprocessing stage are briefly as follows: 1) 5.3-kHz low-pass switched-capacitor filter, 2) 12-bit A/D-converter with 13.02-kHz sampling rate, 3) 256-point FFT formed every 9.83 ms using a Hamming window, 4) logarithmization and smoothing of the power spectrum, 5) combination of spectral channels from the frequency range 200 Hz-5 kHz into a 15-component pattern vector, 6) subtraction of the average from the com-ponents, 7) normalization of the pattern vectors. Except for steps 1) and 2), an integrated-circuit signal processor, TMS32010, is used for the computation.

B. Phoneme Map
The simplest type of speech maps formed by self-organization is the static phoneme map. There are 21 phonemes in Finnish: /U, 0, a, ce, 0, y, e, i, s, m, n, t, I, r, j, v, h, d, k, p, tl. For their representation we used short-time spectra as the input patterns x(t). The spectra were evaluated every 9.83 ms. They were computed by the 256-point FFT, from which a 15-component spectral vector was formed by grouping of the channels. In the present study all the spectral samples, even those from the transitory regions, were employed and presented to the algorithm in the natural order of their utterance. During learning, the spectra were not segmented or labeled in any way: any features present in the speech waveform contributed t o the self-organized map. After adaptation, the map was calibrated using known stationary phonemes (Fig. IO). The map resembles the wellknown formant maps used in phonetics; the main difference is that i n our maps complete spectra, not just two of their resonant frequencies as in formant maps, are used to define the mapping. Recognition of discrete phonemes is a decision-making process in which the final accuracy only depends on the rate of misclassification errors. It is therefore necessary t o try t o minimize them using a decision-controlled (supervised) learning scheme, using a training set of speech spectra with known classification. I n practice, for a new speaker, it will be sufficient t o dictate 200 t o 300 words which are then analyzed by an automatic segmentation method. The latter picks u p the training spectra that are applied in the supervised learning algorithm. The finite set of training spectra (of the order of 2000) must be repeated in the algorithm either cyclically or in a random permutation. LVQI, LVQ2, or LVQ3 can be used as the fine tuning algorithm. A map created for a typical (standard) speaker can then be modified for a new speaker very quickly, using 100 more dictated words, and LVQ fine tuning only.

C. Specific Problems with Transient Phonemes
Generally, the spectral properties of consonants behave more dynamically than those of vowels. Especially in the case of stop consonants, it seems to be better to pay attention to the plosive burst and transient region between the consonant and the subsequent vowel in order t o identify the consonant. In our system transient information is coded in additional "satellite" maps (called transient maps) and they are trained, using transient spectral samples alone, to describe the dynamic features with higher resolution [48].
Our system was in fact developed in two versions: one for Finnish and one for Japanese. In the Japanese version, four transient maps have been constructed to distinguish the following cases: 1) voiceless stops /k, p, t/ and glottal stop (vowel at the beg inning of utterance), 2) voiceless stops /k, p, t/ without comparison with the glottal stop, 3) voiced stops /b, d, g/, 4) nasals Im, n, 71, Only one transient map has been adopted for the Finnish version, making the distinction between /k, p, t/ and the glottal stop. (/b/ and /g/ do not exist in original Finnish.)

D. Compensation for Coarticulation Effects using the "Dynamically Expanding Context"
Because of coarticulation effects, i.e., transformation of the speech spectra due to neighboring phonemes, systematic errors appear in phonemic transcriptions. For instance, the Finnish word "hauki" (meaning pike) is almost invariably recognized as the phoneme string Ihaoukil by our acoustic processor. It may then be suggested that if a transformation rule /sou/ + /au/ is introduced, this error will be corrected. It might also be imagined that it is possible to list and take into account all such variations. However, there may be hundreds of different frames or contexts of neighboring phonemes in which a particular phoneme may occur, and in many cases such empirical rules are contradictory; they are only statistically correct. The frames may Fig. 10. An exampleof aphonememap. Natural Finnish speech was processed bya model of the inner ear which performs its frequency analysis. The resulting signals were then connected to an artificial neural network, the cells of which are shown in this picture as circles. The cells were tuned automatically, without any supervision or extra information given, to the acoustic units of speech known as phonemes. The cells are labeled by the symbols of those phonemes to which they "learned" to give responses; most cells give a unique answer, whereas the double labels show which cells respond to two phonemes. also be erroneous. In order to find an optimal and very large system of rules, the Dynamically Expanding Context grammar mentioned abovewas applied [37], [40]. Its rules or productionscan be used t o transform erroneous symbol strings into correct ones, and even into orthographic text.
Because the correction rules are made accessible in memory using a softwarecontent-addressing method (hash coding), they can be applied very quickly, such that the overall operation of the grammar, even with 15 000 rules, is almost in real time. This algorithm is able to correct u p to 70% of the errors left by the phoneme map recognizer.
E. Performance o f the "Phonetic Typewriter" I n order to get some idea of the accuracy of the map algorithm, we first show a comparative benchmarking of five different methods, namely, classification of manually selected phonemic spectra bythe classical parametric Bayes classification, the well-known k-Nearest-Neigh bor method (kNN), and LVQI, LVQ2 and LVQ3.
In this partial experiment, the spectral samples were of Finnish phonemes (divided into 19 classes and using 15 frequency channels for the spectral decomposition). There were 1550 training vectors, and 1550 statistically independent vectors that were only used for testing. The error percentages are given in Table 2. Note that the parametric Bayes classifier is not even theoretically the best because it assumes that the class samples are normally distributed. We have not been able to find any method, theoretical or heuristic, that classifies speech spectra better than LVQI, LVQ2 o r LVQ3.
In its complete form, the"Ph0neticTypewriter" has been tested on several Finnish and Japanese speakers over a long period. To someone familiar with practical speech recognizers it will be clear that it would be meaningless t o evaluateand comparedifferenttest runs statistical1y;the results obtained in each isolated test depend so much on the experimental situation and the content of text, the status and tuningof the equipment, as well as on the physical condition of the speaker. The number of tests performed over many years i s also too large t o be discussed fully here. Let it sufficeto mention thattheaccuracyof spottingand recognizing phonemes in arbitrary continuous speech typically varies between 80 and 90% (depending on the automatic segmentation and recognition of any phoneme), and this figure depends on the speaker and the text. After compensation for coarticulation effects and editing the text into orthographic form, the accuracy, in terms of correctness of any letter, is of the order of 92 t o 97%, again depending on the speaker and the text.
The Phonetic Typewriter has already been implemented in several hardware versions using signal processor chips. The latest versions operate in genuine real time with continuous dictation.
It may be of interest here to mention other results, independent of ours. McDermott and Katagiri [28], [66] have carried out experiments on all the Japanese phonemes, and report that LVQ2 gave consistently higher accuracies than Backpropagation Time Delay Neural Networks [I051 and was faster in learning.

V. SEMANTIC MAP
Demonstrations such as those reported above have indicated that the Self-organizing Map is indeed able to extract abstract information from multidimensional primary signals, and to represent it as a location, say, in a two-dimensional network. Although this is already a step towards generalization and symbolism, it must be admitted that the extraction of features from geometrically or physically relatable data elements is still a very concrete task, in principle at least.
Theoperation of the brain at the higher levels relies heavily on abstract concepts, symbolism, and language. It is an old notion that the deepest semantic elements of any language should also be physiologically represented in the neural realms. There is now new physiological evidence for linguistic units being locatable in the human brain [6], [15].
In attempting t o devise Neural Network models for linguistic representations, the first difficulty is encountered when trying t o find metric distance relations between symbolic items. Unlike with primary sensory signal patterns for which similarity is easily derivable from their mutual distances in the vector spaces in which they are represented, it can not be assumed that encodings of symbols in general have any relationship with the observable characteristics of the corresponding items. How could it then be possible to represent the "logical similarity" of pairs of items, and to map such items topographically? The answer lies in the fact that the symbol, during the learning process, is presented in context, i.e., in conjunction with the encodings of a set of other concurrent items. In linguistic representations context might mean afew adjacentwords. Similarity between items would then be reflected through the similarity of the contexts. Note that for ordered sets of arbitrary encodings, invariant similarity can be expressed, e.g., in terms of the number of items they have in common. O n the other hand, it may be evident that the meaning (semantics) of a symbolic encoding is only derivable from the conditional probabilities of its occurrences with other encodings, independent of the type of encoding [68].
However, in the learning process, the literal encodings of the symbols must be memorized, too. Let vector x, represent the symbolic expression of an item, and x, the representation of its context. The simplest neural model then assumes that x, and x, are connected to the same neural units, i.e., the representation (pattern) vector x of the item is formed as a concatenation of x, and x,: In other words, the symbol part and the context part form a vectorial sum of two orthogonal components.
The core idea underlying symbol maps is that the two parts are weighted properly such that the norm o f the context part predominates over that of the symbol part during the self-organizing process; the topographical mapping then mainly reflects the metric relationships of the sets of associated encodings. But since the inputs for symbolic signals are also active all the time, memory traces of them are formed in the corresponding inputs of those cells i n the map that have been selected (or actually enforced) by the context part. If then, during recognition o f input information, the context signals are missing or are weaker, the (same) map units are selected solely on the basis o f the symbol part. In this way the symbols become encoded into a spatial order reflecting their logical (or semantic) similarities.
In the following, I shall demonstrate this idea, which was originated by H. Ritter, using a simple language [84]. The simplest definition of the context of a word is t o take all those words (together with their serial order) that occur in a certain "window" around the selected word. For simplicity, we shall imagine that the content of each "window" can somehow be presented t o the x, input ports of the neural system. We are not interested here in any particular means for the conversion of, say, temporal signal patterns into parallel ones (this task could be done using paths with different delays, eigenstates that depend on sequences, or any other mechanisms implementable in short-term memory).
The vocabulary used in this experiment is listed in class has further categorial subdivisions, such as names of persons, animals, and inanimate objects. To study semantic relationships in their purest form, it must be stipulated that the semantic meaning be not inferable from any patterns usedfortheencodingofthe individual words, butonlyfrom the context in which the words occur (i.e., combinations of words). To this end each word was encoded by a random vector of unit length (here, seven-dimensional). A sequence of randomly generated meaningful threeword sentences was used as the input data t o the self-organizing process. Meaningful sentence patterns had therefore first to be constructed on the basis of word categories ( Fig. Il(b)). Each explicit sentence was then constructed by randomly substituting the numbers in a randomly selected sentence pattern from Fig. I l ( b ) by words with compatible numbering in Fig. Il(a). A total of 498 different three-word sentences are possible, a few of which are exemplified in Fig. Il(c). These sentences were concatenated into a single continuous string, S.
The context of a word in this string was restricted to the pair ofwords formed by its immediate predecessor and successor in S(ignoring any sentence borders; i.e., words from adjacent sentences in S are uncorrelated, and act like random noise in that field). The code vectors of the predecessor/successor-pair forming the context t o a word were concatenated into a single 14dimensional code vector x,.
In this simpledemonstration wethusonlytook intoaccount the context provided by the immediately adjacent textual environment of each word occurrence. Even this restricted context already contains interesting semantic relationships.
In our computer experiments it turned out that instead of presenting each phrase separately to the algorithm, a much more efficient learning strategy is first to consider each word in its average contextover a set of possible "windows". The (mean) context of a word was thus first defined as the average over 70 000 sentences o f all code vectors of predecessor/successor-pairs surrounding that particular word. The resulting thirty ICdimensional "average word contexts", normalized t o unit length, assumed the role of the"context fields"^, i n (14). Each "context field" was combined with a 7-dimensional "symbol field" xs, consisting of the code vector for the word itself, but scaled t o length a. The parameter a determines the relative influence of the sym bo1 part x, i n comparison t o the context part x, and was set t o 0.2.
For the simulation, a planar, rectangular lattice of 10 by 15cellswas used.The initial weightvectorsof thecellswere chosen randomly, so that no initial order was present. Updating was based on (7) and (8). The learning step size was h, , = 0.8 and the radius o(r) of the adjustment zone (cf. (8)) was gradually decreased from an initial value U, = 4 to a final value uf = 0.5 according to the law a(t) = u,(af/u,)'tmax Here t counts the number of adaptation steps.
After t , , , = 2000 input presentations the responses of the neurons to presentation of the symbol parts alone were tested. I n Fig. 12, the symbolic label is written to that site at which the symbol signal x = [xs, 0IT gave the maximum response. We clearly see that the contexts "channel" the word items to memory positions whose arrangements reflects both grammatical and semantic relationships. Words of same type, i.e., nouns, verbs, and adverbs, are segregated into separate, large domains. Each of these domains is further organized according t o similarities on the semantic level. Adverbs with opposite meaning tend t o be close t o each other, because sentences differing in one word only are regarded as semantically correlated, and the words that are different then usually have the opposite meaning. The groupings of the verbs correspond to differences in the ways they can co-occur with adverbs, persons, animals, and nonanimate objects such as, e.g., food.
It could be argued thatthe structures resulting in the map were artificially created by a preplanned choice of the sentence patterns allowed as input. This is not the case, however, since it is easy t o check that the categorial sentence patterns in Fig. I l ( b )

VI. SURVEY OF PRACTICAL APPLICATIONS O F THE MAP
In addition to numerous more abstract simulations, theoretical developments, and "toy examples," the following practical problem areas have been suggested for the Self-Organizing Map or the LVQ algorithms. I n some of them concrete work is already in progress. Of these, the application to speech recognition has the longest tradition in demonstrating the power of the map method when dealing with difficult stochastic signals. M y personal expectations are, however, that the greatest industrial potential of this method may lie i n process control and telecom m u n ications.
On theother hand, it is a little surprising that so few applications of the maps to computer vision are being studied. This does not mean that the problems of vision are not important. It is rather that automatic analysis and extraction of visual features, without heuristic or analytical approach, has turned out t o be an extremely difficult problem. Biological and artificial vision probably require very complicated hierarchical systems using many stages (e.g., several different maps) [55], [56]. One unclarified problem is how 1476 ~ .~ the maps should be interconnected, e.g., whether special nonlinear interfaces are needed [82]; and in hierarchical systems, adaptive normalization of input (cf. [31]) also seems necessary. Only a few isolated problems, such as texture analysis that is under study in our laboratory, might be amenable to the basic method as such.

VII. DISCUSSION
It was stated in Secs. I and Ill that it is not advisable t o use the Self-organizing Map for classification problems because decision accuracy can be significantly increased iffinetuningsuchasLVQisused.Another important notion, that not only concerns the maps but most of the other neural network models as well, is that it would often be absurd to use primary signal elements, such as temporal samples of speech waveform or pixels of an image, for the components of x directly. This is especially true if the input patterns are fine-structured, like line drawings. It is not possible to achieve any invariances in perception unless the primary information is first transformed, using, e.g., various convolutions with, say, Gabor functions [13], or other, possibly nonlinearfunctionalsof the imagefield [81], ascomponents of x. Which particular choice of functionals should be used for preprocessing in a particular task is a very difficult and delicate question, and cannot be discussed here.
One question concerns the maximum capacity achievablein themaps. Is itpossibletoincreasetheirsize,toeventually use them for data storage in large knowledge data bases? It can at least be stated that the brain's maps are not particularly extensive; they mainly seem to provide for efficient encoding of a particular subset of signals t o enhance the operation and capacity of associative memory [44]. If more extensive systems are required, it might be more efficient to develop hierarchical structures of abstract representations.
The hardware used for the maps has so far only consisted of co-processor boards (cf., e.g., [42], [48]). If the simple algorithm is to be directly built into special hardware, one of its essential operations will be the global extremum selector, for which conventional parallel computing hardware is available [39]. Analog "winner-take-all" circuits can also be used [20], [21], [52]. Another question is whether the learning operations ought t o be performed on the board, or whether fixed values for weights could be loaded into the cells. Note that in the latter case the function of the cells can be very simple, Ii ke that of the conventional formal neurons. One beneficial property of the maps is that their parameters usually stabilize out into a narrow dynamic range, and the accuracy requirements are then modest. In this case even integer arithmetic operations can provide for sufficient accuracy. What is the most significant difference between the Self-Organizing Map and other contemporary neural-model approaches? Most of the latter strongly emphasize the aspect of distributed processing, and only consider spatial organization of the processing units as a secondary aspect. The map principle, on the other hand, is in some ways complementary to this idea. The intrinsic potential of this particular self-organizing process for creating a localized, structured arrangement of representations in the basic network module is emphasized.
Actually, we should not talk of the localization of a "function": i t is only the response that is localized. I am PROCEEDINGS OF THE IEEE, VOL 78, NO 9, SEPTEMBER 1990 -~ thus not opposed to the view that neural networks are distributed systems. The massive interconnects that underlie all neural processing are certainly spread over the network; their effects, on the other hand, may be "focused" on local sites.
It seems inevitable, however, that any complex processing task requires organization o f information into separate parts. Distributed processing models in general underrate this issue. Consequently, many models that process features of input data without structuring exhibit slow convergence and poor generalization ability, usually ensuing from ignorance of the localization of the adaptive processes.
On the lower perceptual levels, localization of responses in topographically organized maps has already been demonstrated long time ago, and it is known that such maps need not be prespecified in detail, but can instead organize themselves on the basis of the statistics of the incoming signals. Such maps have already been applied with success in many complex pattern recognition and robot control tasks.
On the higher levels of representation, relationships between items seem to be based on more subtle roles in their occurrence, and are less apparent from their immediate intrinsic properties. Nonetheless it has also been shown recently that even with a simple modeling assumption of semantic roles, topological self-organization of semanticdatawill takeplace.Todescribetheroleofan item, it is sufficient that the input data are presented together with a sufficient amount of context. This then controls the adaptation process.
In the practical application that we have studied most carefully, viz. speech recognition, a statistical accuracy of phonemic recognition has been achieved that is clearly equal to or better than the results produced by more conventional methods, even when the latter are based on analysis of signal dynamics [28], [66].
It should be emphasized that the map method is not restricted to using of any particular form of preprocessing, such as amplitude spectra in speech recognition, or even to phonemes as basic phonological units. For instance, analogous maps may be formed for diphones, syllables, or demisyllables, and other spectral representations such as linear prediction coding (LPC) coefficients or cepstra may be used as the input information to the maps. Although the basic one-level map, as demonstrated i n Sec. II-E, has already been shown to be capable of creating hierarchical (ultrametric) representations of structured data distributions, it might be expected that the real potential of the map lies in a genuine hierarchical or otherwise structured system that consists of several interconnected map modules. In a more natural system, such modules might also correspond to contiguous areas in a single large sheet, where each area receives a different kind of external input, as in the different areas in the cortex. In that case, the borders between the modules might be diffuse. The problem of hierarchical maps, however, has turned out to be very difficult. One of the particular difficultiesarises if the inputs to a cell come from very different sources; it then seems inevitable that for the comparison of input patterns, an asymmetrical distance function, in which the signal components are provided with adaptive tensorial weights, must be applied [31]. Another aspect concerns the interfaces of KOHONEN: THE SELF-ORGANIZING MAP modules in a hierarchical map system: the signals merging from different modules may have to be combined nonlinearly [82]. On the other hand, it has already been demonstrated that the map, or the LVQ algorithms, can be used as a preprocessing stage for other models [26], [69], [ l o l l .
In the Counterpropagation Network of Hecht-Nielsen [22], competitive learning is neatly integrated into a hierarchical system as a special layer. One of the strongest original motives for starting the development of (artificial) neural networks was their use as learning systems that might effectively be able to utilize the vast capacities of active circuits that can be manufactured using semiconductor or optical technologies. It is therefore a little surprising that most of the theoretical research on and simulations of neural networks have been restricted to relatively small networkscontainingonlyafewtens toafew thousands of nodes (let alone parallel networks for preprocessing images). The main problem with most circuits seems to be slow convergence of learning, which again indicates that the best learning mechanisms are yet to be found.