Large Automatic Learning, Rule Extraction, and Generalization

John Denker; Daniel Schwartz; Ben Wittner; Sara Solla; Richard Howard; Lawrence Jackel; John Hopfield

Large Automatic Learning, Rule Extraction, and Generalization

John Denker
Daniel Schwartz
Ben Wittner
Sara Solla
Richard Howard
Lawrence Jackel
AT&T Bell Laboratories, Holmdel, NJ 07733, USA

John Hopfield
AT&T Bell Laboratories, Holmdel, NJ 07733, USA
and
California Institute of Technology, Pasadena, CA 91125, USA

Abstract

Since antiquity, man has dreamed of building a device that would "learn from examples," "form generalizations," and "discover the rules" behind patterns in the data. Recent work has shown that a highly connected, layered network of simple analog processing elements can be astonishingly successful at this, in some cases. In order to be precise about what has been observed, we give definitions of memorization, generalization, and rule extraction.

The most important part of this paper proposes a way to measure the entropy or information content of a learning task and the efficiency with which a network extracts information from the data.

We also argue that the way in which the networks can compactly represent a wide class of Boolean (and other) functions is analogous to the way in which polynomials or other families of functions can be "curve fit" to general data; specifically, they extend the domain, and average noisy data. Alas, finding a suitable representation is generally an ill-posed and ill-conditioned problem. Even when the problem has been "regularized," what remains is a difficult combinatorial optimization problem.

When a network is given more resources than the minimum needed to solve a given task, the symmetric, low-order, local solutions that humans seem to prefer are not the ones that the network chooses from the vast number of solutions available; indeed, the generalized delta method and similar learning procedures do not usually hold the "human" solutions stable against perturbations. Fortunately, there are ways of "programming" into the network a preference for approximately chosen symmetries.