Harriman House | Business Books | Politicos | Financial Conferences | Glossary | Investor Education | Derivatives | Financial Gurus | Spread Betting Central |

Home |  Search |  shopping basket Shopping basket
Tel: +44 (0)1730 233870    Email: bookshop@global-investor.com  
Categories
Advertise on this site
Neural Networks for Pattern Recognition by Christopher M. Bishop
Usually ships within 3 to 5 working days

    • Product code: 16097
    • ISBN: 0198538642, ISBN13: 9780198538646, 504 pages, paperback
      Published by Oxford University Press on 1995
    Rate this book...

    Rating: 0.0/5 (0 votes cast)

    Description of Neural Networks for Pattern Recognition

    Introduction

    In recent years neural computing has emerged as a practical technology, with successful applications in many fields. The majority of these applications are concerned with problems in pattern recognition, and make use of feed-forward network architectures such as the multi-layer perceptron and the radial basis function network. Also, it has also become widely acknowledged that successful applications of neural computing require a principled, rather than ad hoc, approach. My aim in writing this book has been to provide a more focused treatment of neural networks than previously available, which reflects these developments. By deliberately concentrating on the pattern recognition aspects of neural networks, it has become possible to treat many important topics in much greater depth. For example, density estimation, error functions, parameter optimization algorithms, data pre-processing, and Bayesian methods are each the subject of an entire chapter.

    From the perspective of pattern recognition, neural networks can be regarded as an extension of the many conventional techniques which have been developed over several decades. Indeed, this book includes discussions of several concepts in conventional statistical pattern recognition which I regard as essential for a clear understanding of neural networks. More extensive treatments of these topics can be found in the many texts on statistical pattern recognition, including Duda and Hart (1973), Hand (1981), Devijer and Kittler (1982), and Fukunaga (1990). Recent review articles by Ripley (1994) and Titterington (1994) have also emphasized the statistical underpinnings of neural networks.

    Historically, many concepts in neural computing have been inspired by studies of biological networks. The perspective of statistical pattern recognition, however, offers a much more direct and principled route to many of the same concepts. For example, the sum-and-threshold model of a neuron arises naturally as the optimal discriminant function needed to distinguish two classes whose distributions are normal with equal covariance matrices. Similarly, the familiar logistic sigmoid is precisely the function needed to allow the output of a network to be interpreted as a probability, when the distribution of hidden unit activations is governed by a member of the exponential family.
    An important assumption which is made throughout the book is that the processes which give rise to the data do not themselves evolve with time. Techniques for dealing with non-stationary sources of data are not so highly developed, nor so well established, as those for static problems. Furthermore, the issues addressed within this book remain equally important in the face of the additional complication of non-stationarity. It should be noted that this restriction does not mean that applications involving the prediction of time series are excluded. The key consideration for time series is not the time variation of the signals themselves, but whether the underlying process which generates the data is itself evolving with time, as discussed in Section 8.4.

    Use as a course text

    This book is aimed at researchers in neural computing as well as those wishing to apply neural networks to practical applications. It is also intended to be used used as the primary text for a graduate-level, or advanced undergraduate-level, course on neural networks. In this case the book should be used sequentially, and care has been taken to ensure that where possible the material in any particular chapter depends only on concepts developed in earlier chapters.

    Exercises are provided at the end of each chapter, and these are intended to reinforce concepts developed in the main text, as well as to lead the reader through some extensions of these concepts. Each exercise is assigned a grading according to its complexity and the length of time needed to solve it, ranging from (*) for a short, simple exercise, to (***) for a more extensive or more complex exercise. Some of the exercises call for analytical derivations or proofs, while others require varying degrees of numerical simulation. Many of the simulations can be carried out using numerical analysis and graphical visualization packages, while others specifically require the use of neural network software. Often suitable network simulators are available as add-on tool-kits to the numerical analysis packages. No particular software system has been prescribed, and the course tutor, or the student, is free to select an appropriate package from the many available. A few of the exercises require the student to develop the necessary code in a standard language such as C or C++. In this case some very useful software modules written in C, together with background information, can be found in Press et al. (1992).

    Prerequisites

    This book is intended to be largely self-contained as far as the subject of neural networks is concerned, although some prior exposure to the subject may be helpful to the reader. A clear understanding of neural networks can only be achieved with the use of a certain minimum level of mathematics. It is therefore assumed that the reader has a good working knowledge of vector and matrix algebra, as well as integral and differential calculus for several variables. Some more specific results and techniques which are used at a number of places in the text are described in the appendices.

    Overview of the chapters

    The first chapter provides an introduction to the principal concepts of pattern recognition. By drawing an analogy with the problem of polynomial curve fitting, it introduces many of the central ideas, such as parameter optimization, generalization and model complexity, which will be discussed at greater length in later chapters of the book. This chapter also gives an overview of the formalism of statistical pattern recognition, including probabilities, decision criteria and Bayes' theorem.

    Chapter 2 deals with the problem of modelling the probability distribution of a set of data, and reviews conventional parametric and non-parametric methods, as well as discussing more recent techniques based on mixture distributions. Aside from being of considerable practical importance in their own right, the concepts of probability density estimation are relevant to many aspects of neural computing.

    Neural networks having a single layer of adaptive weights are introduced in Chapter 3. Although such networks have less flexibility than multi-layer networks, they can play an important role in practical applications, and they also serve to motivate several ideas and techniques which are applicable also to more general network structures.
    Chapter 4 provides a comprehensive treatment of the multi-layer perceptron, and describes the technique of error back-propagation and its significance as a general framework for evaluating derivatives in multi-layer networks. The Hessian matrix, which plays a central role in many parameter optimization algorithms as well as in Bayesian techniques, is also treated at length.

    An alternative, and complementary, approach to representing general non-linear mappings is provided by radial basis function networks, and is discussed in Chapter 5. These networks are motivated from several distinct perspectives, and hence provide a unifying framework linking a number of different approaches.

    Several different error functions can be used for training neural networks, and these are motivated, and their properties examined, in Chapter 6. The circumstances under which network outputs can be interpreted as probabilities are discussed, and the corresponding interpretation of hidden unit activations is also considered.

    Chapter 7 reviews many of the most important algorithms for optimizing the values of the parameters in a network, in other words for network training. Simple algorithms, based on gradient descent with momentum, have serious limitations, and an understanding of these helps to motivate some of the more powerful algorithms, such as conjugate gradients and quasi-Newton methods.

    One of the most important factors in determining the success of a practical application of neural networks is the form of pre-processing applied to the data. Chapter 8 covers a range of issues associated with data pre-processing, and describes several practical techniques related to dimensionality reduction and the use of prior knowledge.

    Chapter 9 provides a number of insights into the problem of generalization, and describes methods for addressing the central issue of model order selection. The key insight of the bias--variance trade-off is introduced, and several techniques for optimizing this trade-off, including regularization, are treated at length.

    The final chapter discusses the treatment of neural networks from a Bayesian perspective. As well as providing a more fundamental view of learning in neural networks, the Bayesian approach also leads to practical procedures for assigning error bars to network predictions and for optimizing the values of regularization coefficients.

    Some useful mathematical results are derived in the appendices, relating to the properties of symmetric matrices, Gaussian integration, Lagrange multipliers, calculus of variations, and principal component analysis.

    An extensive bibliography is included, which is intended to provide useful pointers to the literature rather than a complete record of the historical development of the subject.

    Contents of Neural Networks for Pattern Recognition

    1. Statistical Pattern Recognition

    Overview of pattern recognition. Non-linear mappings. The curse of dimensionality. Analogy with polynomial curve fitting. Simple examples. Generalization and model complexity. Probabilities and probability density functions. Bayes' theorem. Decision boundaries and discriminant functions. Loss matrices and rejection thresholds.

    2. Probability Density Estimation

    Parametric methods. The normal distribution. Maximum likelihood. Bayesian learning. The Robbins-Monro algorithm. Non-parametric methods. Histograms, kernel-based models and K-nearest-neighbours. Smoothing parameters. Gaussian mixture models. The EM algorithm. On-line parameter estimation.

    3. Single-Layer Networks

    Linear discriminant functions. Logistic discrimination. Linear separability. Least-squares techniques, including geometrical interpretation and pseudo-inverse solution. The perceptron convergence theorem. Fisher's discriminant and its relation to least-squares.

    4. The Multi-layer Perceptron

    Threshold and sigmoidal activation functions. Universality. Weight-space symmetries. Higher-order networks. Kolmogorov's theorem. Error back-propagation and its computational efficiency. The Jacobian matrix. The Hessian matrix including diagonal and outer product approximations and exact evaluation. Fast multiplication by the Hessian.

    5. Radial Basis Functions

    Exact interpolation. Smooth interpolation and the pseudo-inverse. Origins in regularization theory, noisy interpolation theory, potential function theory and probability density estimation. Radial basis functions for classification. Relation between radial basis functions and the multi-layer perceptron. Basis function optimization including orthogonal least squares and clustering algorithms. Links with Gaussian mixtures and density estimation.

    6. Error Functions

    Relation to maximum likelihood. Sum-of-squares error. Linear sum rules. Interpretation of network outputs as conditional averages and as conditional probabilities. Justification for outer product Hessian approximation. Inverse problems. Minkowski-R measure. Modelling conditional probability densities. Periodic variables. Sum-of-squares error for classification. Interpretation of hidden unit activations. Weighted sum-of-squares. Cross-entropy error for two classes. Hidden units as feature detectors. Multi-class cross entropy. Softmax activation function and error function. Entropy and information theory. General conditions for network outputs to represent probabilities.

    7. Parameter Optimization Algorithms

    Error surfaces. Local quadratic approximation. Use of gradient information. Weight initialization. Stopping criteria. Test error functions. Gradient descent, including the role of momentum. Batch versus sequential training. Convergence properties and limitations of gradient descent. Conjugate gradients. Scaled conjugate gradients. Newton's method. Quasi-Newton methods. Limited memory quasi-Newton methods. The Levenberg-Marquardt algorithm.

    8. Pre-Processing and Feature Extraction

    Motivation. Pre- and post-processing. Input normalization and encoding. Representation of discrete data. Missing data. Time series prediction. Feature selection criteria and search procedures. Principal component analysis and its limitations. Intrinsic dimensionality. Invariances and prior knowledge. Shared weights. Higher-order networks.

    9. Learning and Generalization

    The bias-variance decomposition. Regularization. Simple weight decay. Relation to early stopping. Training with noise. Soft weight sharing. Architecture selection. Binary classification problems. Cascade correlation. Weight saliency and pruning. Node pruning. Committees of networks. Mixture of experts. Model order selection. Cross-validation. Stacked generalization. Complexity criteria. VC dimension.

    10. Bayesian Techniques

    Bayesian learning of network weights. Choice of priors and noise model. Consistent priors. Posterior distribution of weights. A simple example of Bayesian learning. The Gaussian approximation. Distribution of network outputs. Application to classification problems. The evidence framework for hyper-parameters. Integration over hyper-parameters versus maximization. Bayesian model comparison. Committees revisited. Application of Bayesian techniques in practice. Monte-Carlo methods. Minimum description length.

    Appendices

    A. Eigenvectors and eigenvalues of real symmetric matrices. Quadratic forms.
    B. Gaussian integrals in 1 and d dimensions.
    C. Lagrange multipliers.
    D. Calculus of variations.
    E. Solution of minimization problem leading to principal component analysis.

    About Christopher M. Bishop

    After organising research on neural networks and machine learning for the Newton Institute programme, Chris Bishop joined the Microsoft Research Laboratory in Cambridge. He was elected to a Chair of Computer Science at the University of Edinburgh where he is a member of the Institute for Adaptive and Neural Computation in the Division of Informatics. He is a Fellow of Darwin College, Cambridge and a Senior Member of Robinson College, Cambridge.

    Elsevier Books Promotion

    gi bulletin sign up
    Bulk buying
    If you need bulk copies of Neural Networks for Pattern Recognition, or are interested in opening a corporate account, please contact us.