## Large Automatic Learning, Rule Extraction, and Generalization

**John Denker****Daniel Schwartz****Ben Wittner****Sara Solla****Richard Howard****Lawrence Jackel***AT&T Bell Laboratories, Holmdel, NJ 07733, USA*

**John Hopfield***AT&T Bell Laboratories, Holmdel, NJ 07733, USA*

and*California Institute of Technology, Pasadena, CA 91125, USA*

#### Abstract

Since antiquity, man has dreamed of building a device that would "learn from examples," "form generalizations," and "discover the rules" behind patterns in the data. Recent work has shown that a highly connected, layered network of simple analog processing elements can be astonishingly successful at this, in some cases. In order to be precise about what has been observed, we give definitions of *memorization*, *generalization*, and *rule extraction*.

The most important part of this paper proposes a way to measure the *entropy* or *information* content of a learning task and the *efficiency* with which a network extracts information from the data.

We also argue that the way in which the networks can compactly represent a wide class of Boolean (and other) functions is analogous to the way in which polynomials or other families of functions can be "curve fit" to general data; specifically, they *extend the domain*, and *average noisy* data. Alas, finding a suitable representation is generally an ill-posed and ill-conditioned problem. Even when the problem has been "regularized," what remains is a difficult combinatorial optimization problem.

When a network is given more resources than the minimum needed to solve a given task, the symmetric, low-order, local solutions that humans seem to prefer are not the ones that the network chooses from the vast number of solutions available; indeed, the generalized delta method and similar learning procedures do not usually hold the "human" solutions stable against perturbations. Fortunately, there are ways of "programming" into the network a preference for approximately chosen symmetries.