Academia.eduAcademia.edu
Deep Learning and Cognitive Science Pietro Perconti, Alessio Plebe Abstract In recent years, the family of algorithms collected under the term “deep learning” has revolutionized artificial intelligence, enabling machines to reach humanlike performances in many complex cognitive tasks. Although deep learning models are grounded in the connectionist paradigm, their recent advances were basically developed with engineering goals in mind. Despite of their applied focus, deep learning models eventually seem fruitful for cognitive purposes. This can be thought as a kind of biological exaptation, where a physiological structure becomes applicable for a function different from that for which it was selected. In this paper, it will be argued that it is time for cognitive science to seriously come to terms with deep learning, and we try to spell out the reasons why this is the case. First, the path of the evolution of deep learning from the connectionist project is traced, demonstrating the remarkable continuity, and the differences as well. Then, it will be considered how deep learning models can be useful for many cognitive topics, especially those where it has achieved performance comparable to humans, from perception to language. It will be maintained that deep learning poses questions that cognitive sciences should try to answer. One of such questions is the reasons why deep convolutional models that are disembodied, inactive, unaware of context, and static, are by far the closest to the patterns of activation in the brain visual system. Keywords: deep learning, embodied cognition, enactive vision, artificial neural networks, mental representations 1. Introduction The family of techniques collected under the name of deep learning is responsible for the current AI Renaissance, the fast resurgence of artificial intelligence after several decades of slow and unsatisfactory advances. In 2012, the group at the University of Toronto lead by Geoffrey Hinton, the inventor of deep learning, won ImageNet, the most challenging image classification competition. In 2016, the company DeepMind, founded by Demis Hassabis and soon acquired by Google, defeated the world champion of Go, the Chinese chessboard game much more complex than chess (Silver et al., 2016). The leading Internet companies were among the first in employing deep learning on a massive scale (Hazelwood et al., 2018) and are also the largest investors in research well over their own applications. Thanks to the vast success, deep learning was featured on the Preprint submitted to Cognition May 31, 2020 covers of journals such as Science (July 2015), Nature (January 2016), and The Economist (May 2015). The astonishing success of deep learning was totally unexpected (Plebe and Grasso, 2019). The most surprising aspect is that the technology contains minor improvements from artificial neural networks, a field that was stagnating at the beginning of this century. Hinton himself was one of the protagonists of the invention of the artificial neural networks of the ’80s (Hinton et al., 1986; Rumelhart et al., 1986). We deem one of the most distinctive differences between the first generation of artificial neural networks and the current deep learning enterprise to be related to its focus. The primary motivation for the development of the early neural networks was the study of cognition. Most of the main players in the group that gave birth to the first artificial neural networks in the ’80s were psychologists: Geoffrey Hinton, Michael Jordan, James McClelland, and David Rumelhart. In contrast, the majority of modelers in deep learning is totally indifferent to cognitive studies, with few notable exceptions like Yoshua Bengio (see for example Bengio et al., 2013; Chevalier-Boisvert et al., 2019; Ke et al., 2019). Even if several of the protagonists of deep learning are the same scientists associated with earlier artificial neural networks – Hinton included – the scope has drastically shifted towards engineering goals. Let us provide an example. In current cognitive science the proposal of Karl Friston about the fundamental predictive activity of the brain and the related free-energy principle is well known and discussed. At the heart of his proposal there is a formal expression of free energy, derived from Bayesian variational inference (Friston and Stephan, 2007; Friston and Kiebel, 2009; Friston, 2010). On the deep learning side, an important advancement was achieved a few years ago with an architecture known as the “variational autoencoder”, introduced independently by Kingma and Welling (2014) and by Rezende et al. (2014). Variational autoencoders in deep learning are a precise correlate of Friston’s free-energy principle in the brain, and the mathematical formulations are almost the same. Curiously, Kingma & Welling glaringly neglect the connection between their new architecture and its cognitive counterpart, as do Rezende and co-authors. This striking connection is ignored in all the further refinement on the variational autoencoder in the deep learning community, and it is first acknowledged only by Ofner and Stober (2018). We are rather inclined to push the difference in scope between the earlier artificial neural network community engaged in cognitive explorations and deep learning even further. To the extent that modelers withdrew from pursuing cognitive investigations, the design of neural models was allowed much more freedom in adopting mathematical solutions alien to mental processes. However, we argue that now deep learning, despite this recent tradition, can and should have its say in cognitive science. There is at least one simple reason: the engineering objectives of deep learning have been met with such success that, for the first time, we have artificial models performing complex cognitive tasks at human performance level. The era of toy worlds in which models are restricted to highly simplified versions of cognitive capabilities is over. We now have empirical examples of algorithms solving cognitive tasks at the full scale 2 of complexity. The resonance of the successes of deep learning has already stirred up reflections and discussions within cognitive science and philosophy (Edelman, 2015; Lake et al., 2017; López-Rubio, 2018; Marcus, 2018; Ras et al., 2018; Cichy and Kaiser, 2019; Landgrebe and Smith, 2019; Schubbach, 2019). However, most of the focus of these reflections is about the chances and limits of deep learning in fulfilling the promises of artificial intelligence, in particular its possibility to reach the most coveted goal, the so-called “general artificial intelligence”. These are important themes, but the focus of this paper is different. Our reflections are on how certain empirical achievements of deep learning may illuminate crucial debates in cognitive science. An easy and immediate consideration is that deep learning may lead to a revision of old debates in cognitive science, in which the first generation of neural networks was engaged, such as symbolic/subsymbolic, or innate/acquired knowledge. Some of these issues are already discussed in a few of the works just cited, like in Marcus (2018) and Landgrebe and Smith (2019). But, in our opinion, the results of deep learning may play a major role in debates that characterize contemporary cognitive science. There is in particular one debate that is shaking the foundations of cognitive science: the rejection of the concepts of computation and representations (Gelder, 1995; Chemero, 2009; Hutto and Myin, 2013). The antirepresentationalist and anticomputationalist stances are related to the so-called “4E cognition”, i.e., embodied, embedded, enactive, and extended, which encompasses a wide variety of positions, not necessarily committed to antirepresentationalism and anticomputationalism. Embodiment has taken center stage in cognitive science for several decades (Lakoff and Johnson, 1999). Taking the body as the locus of actions, embodiment has naturally implied enactive cognition (O’Regan and Noë, 2001; Noë, 2004), and since the body interacts with its environment, embodiment also contributes to the possibility of reconciling traditional cognitive science with Gibson’s ecological psychology (Gibson, 1966, 1979; Heras-Escribano, 2019). The concepts of embodiment, enactivism, embeddedness, have all certainly contributed to fundamental advances in cognitive science; however the assessment of their relative roles in cognition is currently highly controversial (Aizawa, 2015; Mahon, 2015; Goldinger et al., 2016). The performances of deep learning are disconcerting from the perspective of all these alternative stances in cognitive science, and we are surprised that it has gone almost unnoticed. Deep neural models are entirely based on representations and computations. Mostly, their main results are achieved with computations that disregard any action, any embodiment, any dynamics, any interaction with the environment. It is certainly necessary to be cautious in drawing conclusions: as mentioned above, we have to keep in mind that deep neural models are not intended as tools for studying cognition, and are not primarily intended to be biologically plausible models. Nevertheless, we believe the achievements of deep learning to be worthy of reflection on the aforementioned debates. The paper is organized as follows. In §2 we will recap the fundamental 3 role of the idea of computation and representation in cognitive science. We will then examine deep learning on the basis of an analysis of the two words of which its name is composed. In §3 we start with “learning”, because it represents the continuity with the main tenet of connectionism, grounded in empiricism. Then, in §4 the concept of deep has its turn in the pivot of the discontinuity between the current neural network models and its precursors. Section §5 will describe the current challenges to the foundations of cognitive science, and section §6 will discuss the impact of deep learning results on these challenges. We mainly address results in vision, for two reasons. First, vision is a paradigmatic case used in support of 4E cognition, and also the most successful field of application for deep learning. Second, there are several recent studies claiming that the specific deep learning models used for vision have some level of biological plausibility. Finally, in §7 we draw some tentative conclusions on how deep learning results may illuminate contemporary debates in cognitive science. 2. The computational turn in psychology Until the early ’90s, the representational computational theory of mind (RCTM) has been the standard in cognitive science (Fodor, 1987, 1998; Pinker, 1997), and computational psychology was considered to be a genuine scientific achievement able to solve the vexata quaestio of the mind-body problem. Nowadays, the landscape of computational psychology is deeply enriched by other perspectives, including predictive coding theories, Karl Friston free energy, and predictive engagement (Friston, 2012; Allen and Friston, 2018; Gallagher and Allen, 2018). According to predictive processing, the brain is to be considered as an inference engine working by means of Bayesian hypotheses. In this account, organisms use predictive models of the world to shape adaptive behaviors. Active inference in the brain takes advantage of internal generative models to predict incoming sensory data aiming at maintaining the best possible balance between the organism and its ecological niche (Constant et al., 2019) (cfr. §5). Classical computational psychology is also challenged by 4E cognition (cfr. §5. In particular, radical enactivists argue for the idea that cognitive science can do without mental representation or use a very minimal concept of what a representation is (Hutto and Myin, 2013; Gallagher, 2017). On their side, connectionists use highly distributed representations, not sentence-like representations. While classical computational psychology stored separate blocks of information in syntactically structured representations, connectionist networks process many kinds of information across their units and weights. Following this line of reasoning, one could think of ruling out the notion of “representation” itself from the scene of cognitive science. We should, however, not throw the baby out with the bathwater, that is, not rule out the representation itself, because we need a more ecological notion of “representation” than the classical computational one. Gualtiero Piccinini (2008), after having considered the possibility of a kind of computation without representation, claims that cognitive computations, in a wide sense, are “any process whose function 4 is to manipulate medium-independent vehicles according to a rule defined over the vehicles” (Piccinini and Scarantino, 2010, 239). It remains, of course, to explain how this exactly happens in human brains and then if it is possible to replicate it in an artificial machine. Piccinini argues that neural computation is sui generis: “Typical neural signals are neither continuous signals nor strings of digits”, i.e., they require graded and continuous signals, but consisting in discrete functional elements, like spikes (Piccinini and Bahar, 2013, 453). Among others, Charles Gallistel is a key figure in making computational psychology as ecological as desired. In his perspective, and in general, in the view of structural representations, which will be further examined in section §6.3, a given representation is understood as an abstract rule, but physically realized in neural activation patterns, which connects observable behaviors (domain) with a set of environmental inputs (co-domain) (Gallistel, 1990b; Beck, 2013). What is connected is physical, but the rule of the connection is an abstract theoretical construct. This use of the word “representation” is more than minimal, in the sense of Mark Rowlands (2006, 113-14) (see also Wheeler, 2005), according to which an action concept of representation (or, AOR) should be intentional, teleological, compositional, decoupleable from its reference, and able to misrepresent it. Being more minimal than Rowlands’ minimal representation, the use of the word “representation” adopted here is immune to its more common criticism (Gallagher, 2008). This sense of the phrase “mental representation” is available for computational purposes, but it has at the same time a naturalistic counterpart in brain activation patterns and in the environmental set of inputs able to elicit them. Thanks to this double nature, i.e., being both computational and naturalistic in kind, computational psychology is able to provide a basis for both the mathematical advances in machine learning and neuro-computational modeling. 3. The “learning” paradigm The two most outstanding philosophical traditions within artificial intelligence are probably the rationalist and the empiricist ones. The “learning” label in “deep learning” flags unequivocally that it belongs to the empiricist party. The empiricist community in artificial intelligence obviously connects to the philosophical tradition from Locke and Hume, for whom concepts are the products of experience, and reason gets all its materials from perception. In the brave new world of computing, Alan Turing (1948) was the first to advance the idea that computers can be designed simply by letting them learn by themselves. He envisioned a machine based on distributed interconnected elements, called B-type unorganized machine. Turing’s neurons were simple NAND gates with two inputs, randomly interconnected, and each NAND input can be connected or disconnected, and thus a learning method can “organize” the machine by modifying the connections. His idea of learning generic algorithms by reinforcing useful links and by cutting useless ones was the most farsighted of this report, anticipating the empiricist approach characteristic of deep learning. Tur- 5 ing made his commitment to empiricism concerning the human mind explicit (Turing, 1948, p.16): We believe then that there are cortex, whose function is largely that the cortex of the infant is organised by suitable interfering large parts of the brain, chiefly in the indeterminate. [. . . ] All of this suggests an unorganised machine, which can be training. Turing’s employer at the National Physical Laboratory, for which the report was produced, was not as farsighted as Turing. He dismissed the work as a “schoolboy essay” (Copeland, 2004, p.401). Therefore, this report remained hidden for decades, until upheld by Copeland and Proudfoot (1996). In the meantime, under the influence of the newborn cognitive science, early artificial intelligence favored the rationalist side. The rationalist tradition, that started with Rene’ Descartes and was followed by philosophers such as Leibniz, Spinoza and Kant, is at the heart of the formal languages in logic developed during the last century (Novaes, 2012). The rationalist soul in artificial intelligence drew heavily from formal languages, in building models of reasoning based on symbols and symbol processing (Newell et al., 1957, 1959; Newell and Simon, 1972). Although the unorganized learning machine of Turing was still unknown, there were a few attempts to create learning devices inspired by the way neurons learn. Marvin Minsky (1954) designed SNARK (Stochastic Neural Analog Reinforcement Computer ), the first neural computer, assembling 40 “neurons”, each made with six vacuum tubes and a motor to adjust its connections mechanically. From about 1960, however, the rivalry between empiricists and rationalists escalated up to the emergence of distinct sociological coteries (Boden, 2008, ch.11–12). The rationalist coalition had gained a certain dominance, prompted by the impressive initial success of symbolic computing within the newborn artificial intelligence, and even Minsky was attracted to its side. In 1958 an ambitious project led by Frank (Rosenblatt, 1958, 1962) threatened to shake the rationalist supremacy. It was the Perceptron project, funded by the US Office of Naval Research and carried out at Cornell Aeronautical Laboratory, resulting in a photoelectric machine with eight “neurons” and connections that can be adjusted according to a learning rule. The project had considerably enthusiastic media coverage, with consequent irritation in the rationalist community of artificial intelligence, culminating in a stinging critique raised by Minsky and Papert (1969). They replicated the perceptron machine, with the purpose of highlighting its limitations. As remarked by Olazaran (1996) in his historical and sociological analysis of the perceptron controversy, “replication of this kind is quite unusual in science, and it occurs only when the claim under discussion is particularly important”. The simplest example of the limitations of the perceptron is the XOR function, the exclusive disjunction, that is not linearly separable and cannot be learned by machine. The attack by Minsky and Papert achieved the desired effect, marginalizing artificial neural network research for decades. 6 3.1. The succeeding invention of backpropagation It was late in the ’80s that artificial neural networks found their way, with the PDP (Parallel Distributed Processing) project of Rumelhart and McClelland (1986b). The basic structure of “parallel distributed” machinery is made of simple units organized into distinct layers, with unidirectional connections between each layer and the next one. This structure, known as the feedforward network, is preserved in most deep learning models. PDP adheres to a radical empiricist account, with models that learn any possible meaningful function from scratch, just by experience. The success of PDP was largely due to an efficient mathematical rule, known as backpropagation, for adapting the connections between units, from examples of the desired function between known input and output. Being w ~ the vector of all learnable parameters in a network, and L(~x, w) ~ a measure of the error of the network with parameters w ~ when applied to the sample ~x, the backpropagation updates the parameters iteratively, according to the following formula: w ~ t+1 = w ~ t − η∇w L (~xt , w ~ t) (1) where t spans over all available samples ~xt , and η is the learning rate. Backpropagation actually has a longer history. The term was used the first time by Frank Rosenblatt (1962), one of the pioneers of artificial neural networks cited earlier. Rosenblatt attempted to generalize its perceptron architecture (Rosenblatt, 1958) with back propagation , based on a single layer, to multiple layers. His attempt was different from equation (1) and not especially successful. Paul Werbos (1994), by titling his book The Roots of Backpropagation, claimed to be the originator of the backpropagation algorithm, in his PhD thesis at Harvard (Werbos, 1974), even if he didn’t use this name. The supervisor of Werbos was Karl Deutsch, one of the leading social scientists of the 20th century, and one of the first in introducing statistical methods and formal analysis in political and social sciences. The novel technique developed by Werbos aimed at testing the Deutsch-Solow model of national assimilation and political mobilization (Deutsch, 1966) on real data. For this purpose, he used an iterative technique, termed dynamic feedback in which derivatives of the error estimates with respect to the parameters were computed. Both the algorithm of Werbos and the backpropagation formulated by Rumelhart and Hinton have their roots in the gradient methods, intensively developed in the first half of the last century for several engineering applications (Levenberg, 1944; Curry, 1944; Polak, 1971). Therefore, there was a mature mathematical context in the ’70s, applied to engineering problems, fertile enough for being adapted to the formulation of artificial learning. 3.2. Neural networks and cognitive science There is little doubt that learning is the foundation of cognition for the PDP group , and the successful artificial realization of learning by backpropagation would have paved the road for the “Explorations in the Microstructure of Cognition”, as the title of the book by Rumelhart and McClelland (1986b) indicates. 7 One of the most widely cited chapters in the PDP book (Rumelhart and McClelland, 1986a) became the standard bearer of empiricism against the rationalist orthodoxy in cognitive science at that time. It presented a model of how children learn the past tense of English verbs, without any predefined rules in mind. It challenged rationalism in two specific fundamental assumptions within cognitive science: nativism and the psychological reality of processing rules. Both assumptions were strongly defended in linguistics by Noam Chomsky (1966, 1968) and in the philosophy of mind by Jerry Fodor (1981, 1983), and assumed by most cognitive scientists at the time. Rationalists certainly did not retreat at all, and their defense was a harsh criticism of the PDP approach (Pinker and Prince, 1988; Fodor and Pylyshyn, 1988). Nevertheless, connectionist models, fueled with backpropagation learning, become widespread in developmental psychology, especially in the psychology of language development (Landau et al., 1988; MacWhinney and Leinbach, 1991; Karmiloff-Smith, 1992; Elman et al., 1996; Gasser and Smith, 1998; Smith, 1999; MacWhinney, 1999). It should be noted that the empiricist turn connected to backpropagation learning was not a menace for the computational theory of mind. The PDP paradigm proposed a form of computation, whose difference from others derives precisely from the primacy of learning. In a traditional algorithm the processing steps prescribe the function to be performed between the input and the output. In a backpropagation model there is no predefined function, the processing steps define how a general model will gradually adapt to any possible function by learning. Moreover, the PDP paradigm shares an affection for the idea of mental representations with the classical computational theory of mind . The format of representations in the PDP framework is different from classical symbolic representation, again reflecting the principle of learning. PDP representations form gradually, are never stable, and their modifications depend on experience. PDP representations found a good agreement with cognitive theories of mental concepts (Rosch, 1973; Horgan and Tienson, 1989). To sum up, neural networks of the PDP style, precursors of deep learning, have been the most radical form of empiricism in artificial intelligence. Their success derives from the ingenious invention of backpropagation, which will also have a heavy legacy in deep models. Neural networks of the PDP generation have had an important interaction with cognitive sciences, and have preserved a fundamental role in the ideas of computation and representations. 4. From shallow to “deep” Probably the most prominent representatives of deep learning would not disapprove of the metaphorical usage of the adjective “deep” as intellectually profound, capable of entering far into a subject. But its meaning is actually technical and rather trivial, it refers to the number of “hidden” layers in a feedforward artificial neural network. Any feed-forward network should include an input layer, where data is read, and an output layer where results are produced. Hinton called the other layers in between “hidden”, inspired by the use of this 8 adjective in the hidden Markov models (Anderson and Rosenfeld, 2000, p.379). Neural models can learn increasingly complex functions by augmenting the number of units, this way, however, the number of parameters to optimize in a fully connected network increases as well, and learning becomes more difficult. Units can be added using two different options: by increasing the number of units in the existing layers, or by adding new layers. The design of efficient artificial neural network models, 40 years ago as well as today, was based more on heuristics then on theoretical assumptions (Plebe and Grasso, 2019), one such heuristic was that following the second option for augmenting the number of units is a bad idea. In fact, it was often observed that increasing the number of units by adding layers was much less efficient than increasing the width of a single hidden layer. For example, de Villers and Barnard (1992), based on this sort of observation, concluded this way: We have found no difference in the optimal performance of three- and four-layered networks [. . . ] four layer networks are more prone to the local minima problem during training [. . . ] The above points lead us to conclude that there seems to be no reason to use four layer networks in preference to three layers nets in all but the most esoteric applications. 4.1. Hinton again The “deep” addition to PDP style of the feedforward network was just a revision of this long assumed dogma of no more than one hidden layer. The point is that the difficulty in training four layer networks, i.e. with two hidden layers, was not due to an intrinsic advantage of having a wide single hidden layer, with respect to many smaller layers. It was the standard backpropagation learning algorithm that worked quite well with models with one hidden layer – now called “shallow” – and lost efficiency with more hidden layers. Hinton and Salakhutdinov (2006) succeeded in training a model with four hidden layers by inventing a novel learning strategy, called the Deep Belief Network. The “belief” was borrowed from the Belief Networks (Pearl, 1986), popular in expert systems, which Hinton appreciated. These networks are unrelated to neurons, the nodes are symbolic, often propositional, and the connecting arcs express conditional probabilities. But Pearl’s Belief Networks do not learn, while the core of the Deep Belief Network is, once again, in the learning strategy. It derives from a neural architecture, called Boltzmann Machines (Aarts and Korst, 1989), in which neurons have binary values that can change stochastically, with probability given by the contributions of the other connected neurons. Boltzmann Machines adapt their connections in an unsupervised way, with a sort of energy minimization. This is the reason for the dedication to the great Austrian physicist. The clever trick of Hinton was to take two adjacent layers in a feedforward network, and train them as Boltzmann Machines. The procedure starts with the input and the first hidden layer, so that it is possible to use the inputs of the dataset to train the unsupervised Boltzmann Machine model. Then, this model is used to generate 9 a new dataset, just by processing all the inputs. This new set is used to train the next couple of layers. This procedure is a sort of pre-training that gives a first shape to all the connections in the network, to be further refined by ordinary backpropagation using both the inputs and the known outputs of the dataset. This first success aroused interest for artificial neural networks from a state of relative lethargy. As a result, many of the old neural models and ideas that have been around for decades have been revived, and made fresh and deep (Schmidhuber, 2015). Examples are the Neocognitron by Fukushima (1980) for vision, LSTM (Long Short-Term Memory) units (Hochreiter and Schmidhuber, 1997) for natural language processing, the Reinforcement Learning framework (Barto and Sutton, 1982) for action selection tasks. 4.2. Backpropagation again As odd as it may seem, even the good old backpropagation has resurged again, thus dispensing modelers from using the rather complex strategy devised by Hinton with the Deep Belief Network. It was found that a few modifications made backpropagation efficient with deep models almost as well as with shallow models. The main modification is in the following equation: w ~ t+1 = w ~ t − η∇w M 1 X L (~xi , w ~ t) M i (2) where instead of computing the gradients over a single sample t, a stochastic estimation is made over a random subset of size M of the entire dataset, and at each iteration step t a different subset, with the same size, is sampled. Despite strong similarity between equations (1) and (2) the term “backpropagation” is now out of fashion. Techniques related with equation (2) are referred to as stochastic gradient descent. This change in name gives credit to a different mathematical context, that of stochastic approximation, established by Robbins and Monro (1951). The idea is to solve the equation f (w) ~ = ~a for a vector w, ~ in the case when the function f is not observable, using samples of an auxiliary random function g(w) ~ such that E[g(w)] ~ = f (w). ~ The solution is obtained by the following iterative equation: w ~ t+1 = w ~t − α (g(~xt ) − ~a) . t (3) Stochastic approximation was mostly developed in engineering domains, and has turned into an ample mathematical discipline (Kushner and Clark, 1978; Benveniste et al., 1990). This mathematical domain provided a fertile context for developing more and more efficient variations of learning techniques for deep neural networks (Bottou and LeCun, 2004; Kingma and Ba, 2014; Schmidt et al., 2017). In summary, deep learning preserves the main philosophy of radical empiricism in all its variants and applications. Its chances of functioning depend entirely on learning from experience. There is, however, a fundamental difference in aims between the first generation of artificial neural networks and deep 10 neural models. The former was motivated primarily by “Explorations in the Microstructure of Cognition” (Rumelhart and McClelland, 1986b), as discussed in §3. On the contrary, the development of deep neural models is mainly driven by applications, therefore the prevailing part of the deep learning community has far less ambition or interest in exploring cognition. 5. Computationalism in decline The popularity of the Parallel Distributed Processing (PDP) approach has been based both on the idea that brains actually compute in a parallel and distributed way and on the technical results achieved by the artificial neural networks. In other words, ecology and efficiency were the main reasons why neural networks appeared as such a promising El Dorado. Predictive coding models promise to explain higher level cognitive phenomena, like binocular rivalry (Churchland, 1996). As above mentioned, however, before the success of the deep learning networks, the efficiency of the early artificial neural networks was low. On the other side, the ecological advantages of neuro-computationalism were not stopped while the efficiency of neural networks was not satisfactory (Plebe and De La Cruz, 2016). 4E cognition introduces itself as a more radical option, able to change the framework within the phenomenon of knowledge that should be understood. And – a particularly interesting effect for cognitive science – this would happen in a way which seems incompatible with some key assumptions of computational psychology. Even the ecological advantages of PDP have not seemed yet enough in the eyes of the theorists who are enthusiasts of 4E cognition. They sincerely appreciate that PDP computational architectures are somehow bio-inspired and intended to be plausible from an ecological point of view. But, the very idea of a static scene in which a given subject represents something in front of him by means of a computational device inside his head remains as something to be rejected. The preferred alternative is a dynamic scenario in which the subject is not engaged in representing the world in front of him, but to interact with it through an action which is oriented to a goal. “Action”, in fact, is the new magic word in 4E cognition, instead of “representation”. Today we could say that cognitive science is characterized by a sort of “Faustian turn”, inspired by the celebrated statement in Goethe’s Faust: “Am Anfang war die Tat”, “In the beginning was the Action”, instead of the Bible’s “Word”. Preferring the vocabulary of action to that of language is a typical trend in embodied cognitive science. Even the typical examples used in papers and talks reveal this shift. Instead of people representing an object which lies in front of them through linguistic resources, we now have somebody who typically tries to grasp an object located in their perceptual scene. According to Baggs and Chemero (2018), the future of embodied cognitive science lies in unifying the two major trends, that is, radical enactivism and ecological psychology. In their words: “Both ecological psychology and enactivism reject the idea that cognition is defined by the computational manipulation of 11 mental representations; both also focus on self-organization and nonlinear dynamical explanation. They differ in emphasis, however. Ecological psychology focuses on the nature of the environment that animals perceive and act in; enactivism focuses on the organism as an agent. Combining the two would seem to provide a complete picture of cognition: an enactive story of agency, and an ecological story of the environment to which the agent is coupled”. The die is cast. Instead of the “New Synthesis” of evolutionary psychology, that is, the combination of the RCTM and evolutionary biology, proposed by Steven Pinker when computational psychology seemed at its peak, Baggs and Chemero argue that “an enactive story of agency” would be able to provide the new “complete picture of cognition”, made up by the combination of radical enactivism and ecological psychology. This new scenario is a playground for the theoretical ambitions of the 4E approach, as can be appreciated, for instance, in the attempt to extend the “Faustian turn” into traditional “static” fields of investigation, like the machinery underlying the functioning of abstract concepts and words (Borghi and Binkofski, 2014; Borghi et al., 2019). Deep learning networks are facing the same kind of challenge. Like the case of vision, which will be addressed in the next section as a revenge for computationalism, consciousness too is a field of investigation which has recently been stagnant. After having appreciated the hard problem of the qualitative side of consciousness, cognitive science is still in trouble with an ecological and – at the same time – computational account of what it means to be conscious. The explanatory gap between the functional vocabulary, and the corresponding capacity to design computational architectures, and the phenomenology of consciousness is still an epistemological puzzle. Analogously, the 4E cognition framework is a promising theoretical perspective, but it is focused on low level perception-action mechanisms. It is less encouraging for modeling high level cognitive processes. For these reasons, it is possible to consider the recent advances in deep learning networks like a sort of “revenge” for computationalism, insofar as it allow us to successfully deal with the relatively stagnant field of investigation from the computational point of view, like consciousness and vision. This theoretical issue is exactly where two senses of the word “phenomenology” become almost the same. On one side, there is phenomenology as the well-known philosophical trend. Some of its basic ideas, like anti-reductionism, seem incompatible with the framework of cognitive science. Other features, like the dynamic account of the interaction between the subject and the environment, conceived in an action oriented and bodily dependent way, are used as an excellent basis for the “Faustian turn” in cognitive science. On the other side, there is phenomenology used as something to explain what we are talking about when we refer to qualia, “what is it to be like” for other individuals, and the phenomenal aspects of experience. The work of leading figures of 4E cognition, like Shaun Gallagher and Dan Hutto, display both features, being both phenomenologists (in the philosophical sense) and cognitive scientists engaged in the embodied turn. Deep learning networks, in fact, promise to be useful in this attempt to ad12 dress high level cognitive processes, like consciousness both in term of accessibility and phenomenology (Mallakin, 2019). The Consciousness Prior Hypothesis by Yoshua Bengio (2017) is a paradigmatic example of this trend. His aim is exactly to extend the achievements of deep learning beyond the field of predictive and unconscious inference over low level inputs: “Instead of making predictions in the sensory (e.g. pixel) space, one can thus make predictions in this high level abstract space, which do not have to be limited to just the next time step but can relate events far away from each other in time” (Bengio, 2017). By combining the attention mechanism, able to extract some pertinent features from the background, and the ability to make predictions about the future – two typical cognitive devices involved in consciousness, Bengio realized that machine learning and deep learning networks are the best computational architectures available to model them. As we will show in the next section, the case of vision could be seen as a revenge for computationalism for the same reason, i.e., being both a stagnant field of investigation after the classical computational achievements and a challenge for a comprehensive account of its phenomenological experience. 6. Pure vision: a renaissance? For a long time, cognitive vision research has been shaped by the computational approach outlined by the late David Marr (1982). One of the earliest yet more persuasive papers challenging this mainstream wisdom is A critique of pure vision by Churchland et al. (1994). The authors clarified that “pure vision” in the form they describe is just a caricature of Marr’s position and other vision cognitive scientist, adopted for the convenience of the arguments. The caricatured theory of pure vision conforms to the following three tenets (p.25): 1. The Visual World. [. . . ] The goal of vision is to create a detailed model of the world in front of the eyes in the brain [. . . ] 2. Hierarchical Processing. [. . . ] At successive stages, the basic processing achievement consists in the extraction of increasingly specific features [. . . ] 3. Dependency Relations. Higher levels in the processing hierarchy depend on lower levels, but not, in general, vice versa. It is difficult to find an approach to artificial vision more “shamelessly pure” than contemporary deep learning models, according to this caricatured description of “pure vision”. Yet, their performances have left researchers in vision science astonished, breaking the well-established wisdom of an unbridgeable gap between artificial and natural vision. See, for example, VanRullen (2017): For decades, perception was considered a unique ability of biological systems, little understood in its inner workings, and virtually impossible to match in artificial systems. But this status quo was upturned in recent years, with dramatic improvements in computer models of perception brought about by ’deep learning’ approaches [. . . ] For as long as I can 13 remember, we perception scientists have exploited in our papers and grant proposals the lack of human-level artificial perception systems [. . . ] But now neural networks [. . . ] routinely outperform humans in object recognition tasks [. . . ] Our excuse is gone. Artificial vision as solved by deep learning models is the only sign of mini– singularity (Kurzweil, 2005) we have so far. 6.1. Impure visions Before describing how “naively pure” deep learning is, let us dwell for a bit on the less naive views that have emerged. Churchland and co-workers characterized a conception alternative to “pure vision” as interactive vision, by stressing the interaction of vision with other sensory systems, and its function of guiding actions. A consistent part of their work is a review of compelling neurocognitive evidence of close and complex interactions of the visual system with non-visual signals. Soon after, Churchland Rao and Ballard (1995) dropped the “inter”, launching the concept of active vision, in which the connection between vision and activity is essentially in the movements of the eyes. Rao and Ballard fostered the simulation of saccades, in order to improve machine vision. The progressive dismissal of Marr’s concept of vision led to the rediscovery of James Gibson (1979), whose idea of direct perception dispensing heavy information processes was strongly criticized by Marr. The ecological approach of Gibson, and his celebrated concept of “affordances” fit well with the new directions in cognitive vision (Nakayama, 1994). Alongside with the “Faustian turn” (see §5) in cognitive science, Jeannerod (1994); Jacob and Jeannerod (2003) – among others – have placed great emphasis on the close link between vision and motor representations at the neural level. The visual processing of objects and their attributes is driven by the kind of task the subject is performing, and object affordances are transformed into specific motor representations. O’Regan and Noë (2001); Noë (2004) push the “impure” vision even further, inventing the label enactive vision, where perception is not just a process useful for action, it is a sort of action by itself. While the previous accounts of ecological and active vision still rely on the notion of mental/brain representations, enactive vision raises significant concerns against the need to postulate internal representations. This position was initially moderate, as in the words of Noë (2004, p.22): The claim is not that there are no representations in vision. That is a strong claim that most cognitive scientists would reject. The claim rather is that the role of representations in perceptual theory needs to be reconsidered. With the expansion of the “Faustian turn” in 4E cognition, the claim that there are no representations becomes less heretic, allowing Noë (2010) to reject visual representations without hesitation. Not surprisingly, this position is pursued even more sharply by Myin and Degenaar (2014). 14 The computer vision community was rather slow and reluctant in abandoning “pure vision”, because of the neat advantages of the three tenets of pure vision in the design of software. However, the lesson implied in the various “impure” approaches was that trying to interpret the content of images with a static hierarchy of feature extraction is hopeless. Already Churchland et al. (1994, p.50) reported, as a marginal argument, the difficulties of machine vision systems based on “pure vision” in tasks as easy as reading bar codes. Vision became effective when treated as an interactive process oriented by the goals of the seeing agent. Therefore, attracted by the possibility of a significant gain in performances, several artificial vision systems inspired by embodiment and enactivism were designed during the ’90s (Blake and Yuille, 1992). Most of these vision systems were integrated into robots (Viéville, 1997) that offer a physical surrogate of the body, but there have also been examples of systems where “action” is just simulated (Beer, 2003; De Croon et al., 2009). Soon, it turned out that the complexity in building embodied/enactive vision systems was not compensated at all by a gain in performance, and such attempts become rare and without any marketable application. The scarce results of these models would not necessarily threaten the validity of the 4E cognition on which they are based. Put simply, it might be the case that the task of vision is just too difficult for artificial approaches. This way out, however, clashes with the results of the deep models described below. 6.2. How naive is deep learning We will now qualify our claim that deep learning fits so well with the description of the naive “pure vision” approaches. For this purpose it is useful to dig a bit inside the rapid rise of deep learning in vision. Once again, it was Hinton who first pushed deep models towards unexpected results in vision, but with an architecture that was different from his earlier deep belief networks (see §4.1). Hinton and his PhD student Alex Krizhevsky et al. (2012), adapted an old model of the PDP era, called Neocognitron (Fukushima, 1980) and transformed it into a deep model. It alternates layers of S-cell type units with C-cell type units, and those names are evocative of the classification in simple and complex cells by Hubel and Wiesel (1962, 1968). The S-units act as convolution kernels, while the C-units downsample the images resulting from the convolution by spatial averaging. The crucial difference from conventional convolution in image processing (Rosenfeld and Kak, 1982; Bracewell, 2003) is that the kernels are learned. The first version of the Neocognitron learned by unsupervised self-organization, with a winner-take-all strategy: only the weights of the maximum responding S units, within a certain area, are modified, together with those of neighboring cells. A later version (Fukushima, 1988) used a weak form of supervision: at the beginning of the training the units to be modified in the S-layer were selected manually rather than by winner-take-all. After this first sort of seeding, training proceeded in unsupervised way. Hinton and his group called the deep version of Neocognitron Deep Convolutional Neural Network. Their first model has five layers of convolutions, each with a large number of different kernels, followed by three ordinary neural 15 layers, with a total number of 60 million parameters. The model participated in the ImageNet Large-Scale Visual Recognition Challenge that has been the standard benchmark for large-scale object recognition since 2010 (Russakovsky et al., 2015). The model, now known colloquially as AlexNet, dominated the challenge, dropping the previous error rate from 26.0% down to 16.4%. This first success steered computer vision towards deep models, and many new designs continued to improve performance. The model VGG-16 (Simonyan and Zisserman, 2015), with thirteen convolutional layers and three ordinary layers, and kernels smaller than AlexNet, achieved an error of 7.3% in the 2014 ImageNet challenge, further improved to 6.7% by the Inception (or GoogleNet) model (Szegedy et al., 2015). Several refinements continued to improve performance, even surpassing those of human subjects (Rawat and Wang, 2017). The first assumption of “pure vision”, The Visual World, is implicit in the ImageNet benchmark. It is organized according to the hierarchy of nouns in the lexical dictionary WordNet Fellbaum (1998), in which each lexical entry is associated with hundreds of still images. Deep convolutional neural networks like AlexNet meet the other two assumptions of “pure vision” precisely. Hierarchical Processing: the convolutional layers are organized in a hierarchical way, with earlier convolutions extracting low level features, which in turn become the input of other convolutions that extract features at progressively higher levels. Dependency Relations: the network is strictly feed-forward, with higher levels depending on lower levels, but not vice versa. Deep convolutional neural networks also ignore all of the many factors involved in natural vision indicated by 4E cognition. These models simply learn, using stochastic gradient descent (see §4.2), from examples made by images and the lexical description of the category of objects found in the image. The model is unaware of any contextual information about each image, any conceptual relationship between categories, any information about the poses each object can assume in space, or about the affordances exposed by objects, any information about how objects can change their aspect in time. In summary, the model learns to recognize objects in a fully disembodied and inactive way. 6.3. The plausibility objection One may raise the objection that using state-of-the-art deep convolutional neural models in discussions about natural vision is misguiding, insofar as these models are engineered for applications, not intended to replicate how natural vision works. Therefore, their results are irrelevant for cognitive science. For sure, just calling units in deep learning models “neurons” does not imply any resemblance with the cells in the brain. Unjustified claims on the link between deep learning and neuroscience are now quite common, as in (Arel et al., 2010, p.13): Recent neuroscience findings have provided insight into the principles governing information representation in the mammal brain, leading to new ideas for designing systems that represent information. [. . . ] This discovery motivated the emergence of the subfield of deep machine learning, 16 which focuses on computational models for information representation that exhibit similar characteristics to that of the neocortex. This claim is incorrect from a historical perspective, there is a long-standing line of research on computational models of the neocortex (Plebe, 2018), far distant from deep learning. Claims like that of Arel and co-authors are also unwarranted because not one of the improvements of deep learning over PDP models, reviewed in §4 are related to recent (or not recent) neuroscientific findings. The training regime used in the most successful deep neural models is a further source of implausibility. Models for vision are typically trained with millions of static images, and as many as a thousand images for each category (Russakovsky et al., 2015). This amount of data is probably less than the equivalent experience of an infant, but there is an important difference. Once having acquired a basic knowledge of the visual environment, humans are able to learn new categories of objects and actions with a very small number of examples, and can continue to learn all their lives. These natural forms of learning are difficult to achieve with deep neural models (Parisi et al., 2019), and far distant from the standard training regimes used in the top vision models. Nevertheless, when limiting deep learning to convolutional models for vision, there is a growing evidence of striking analogies between patterns in these models, and patterns of voxels in the brain visual system. One of the first attempts to relate results of deep learning to the visual system was based on the idea of adding another layer at a given level of an artificial network model to predict the space of voxel response, and to train this level on sets of images and corresponding fMRI responses (Güçlü and van Gerven, 2014). Using this method Güçlü and van Gerven (2015) compared a model very similar to AlexNet (Chatfield et al., 2014) with fMRI data. Initially, subjects were presented with 1750 natural images and voxel responses in progressively downstream areas – from V1 up to LO (Lateral Occipital Complex ) – were recorded. The same images were presented to the model, and the output of the convolutional layers were trained – with a simple linear predictor – to predict voxel patterns. As a result, model responses were predictive of the voxels in the visual cortex above chance, with good prediction accuracy especially in the lower visual areas. This first unexpected result was immediately followed by several other studies, using variants of the same technique (Khan and Tripp, 2017; Eickenberg et al., 2017; Tripp, 2017), finding reasonable agreement between features computed by deep learning models and fMRI data. An alternative method for comparing deep learning models and fMRI responses was offered by the representational similarity analysis, introduced by Kriegeskorte et al. (2009); Kriegeskorte (2009). This method can be applied to any sort of distributed responses to stimuli, computing one minus the correlation between all pairs of stimuli. The resulting matrix is especially informative when the stimuli are grouped by their known categorical similarities. The whole idea is that the responses across the set of stimuli reflect an underlying space in which reciprocal relations correspond to relations between the stimuli. This is exactly the idea of structural representations, one of the fundamental concepts 17 in cognitive science (Swoyer, 1991; Gallistel, 1990a; O’Brien and Opie, 2004; Shea, 2014; Plebe and De La Cruz, 2018). Representational similarity analysis is applied by Khaligh-Razavi and Kriegeskorte (2014) in comparing responses in the higher visual cortex, measured with fMRI in humans, and with cell recordings in monkeys, with several artificial models. This study is interesting because it includes, models with more biological plausibility in addition to AlexNet. In particular, it included VisNet (Wallis and Rolls, 1997; Stringer and Rolls, 2002; Rolls and Stringer, 2006; Stringer et al., 2007), a highly biologically plausible model, organized into five layers, where connectivity approximates the sizes of receptive fields in V2, V2, V4, posterior inferior temporal cortex, and the inferior temporal cortex. This network learns by unsupervised self-organization (von der Malsburg, 1973; Willshaw and von der Malsburg, 1976) with synaptic modifications derived from Hebb (1949) rule. Learning includes a specific mechanism called trace memory, aimed at accounting for the natural dynamics of vision, where the invariant recognition of objects is learned by seeing them when moving under various different perspectives. The analysis revealed that AlexNet was significantly more similar to the structural representation of the categorical distinction animate/inanimate in the inferior temporal cortex than VisNet. This impetus of studies on the analogies between deep convolutional models and the visual system has led to a broad discussion in the visual neuroscience community on the relevance of deep learning models for their scientific objective. Positions range from a mostly positive acceptance (Gauthier and Tarr, 2016; VanRullen, 2017) to a cautious interest (Lehky and Tanaka, 2016; GrillSpector et al., 2018; Tacchetti et al., 2018), down to more skeptical stances (Olshausen, 2014; Robinson and Rolls, 2015; Rolls, 2016; Conway, 2018). However, this ongoing discussion has hitherto eluded our point. Whatever the degree of similarity between activation in deep model layers and in areas of the brain visual system, we have found it astounding that there is a similarity at all, given the diversity of the two systems. There is no doubt that the brain visual system is embodied, enactive, and that it has developed by the continuous interaction with the environment. How is possible then, that the most naively “pure” model, disembodied, inactive, static, unaware of context, is by far the best in predicting patterns of activation in the brain visual system? One possible answer is that there is a part of the process involved in vision that consists of extracting features at progressive and hierarchical levels, not too far from what Marr had in mind. Curiously, the rationalist tradition was deeply connected to Marr’s manifesto (Newell, 1980; Pylyshyn, 1984; Egan, 1995), while the empiricist side was more fond of Marr’s critics like Churchland et al. (1994). Now is the deep learning empiricist’s turn to (implicitly) reverse Marr’s decline caused by 4E cognition. In fact the part of Marr’s theory that was especially advantageous for rationalists is his three-level distinction, understood as autonomous levels of description. The autonomy of the top – computational – level permits fully rule-based models to gain explanatory value, in cognition in general, and specifically in vision (Biederman, 1987; Ullman, 1996; Draper et al., 2004). The rationalist interpretation of Marr’s level distinction is contro18 versial (Eliasmith and Kolbeck, 2015; Shagrir and Bechtel, 2018), and most of all irrelevant for deep convolutional neural models. What these models have in common with Marr’s theory of vision, and what differentiates them both from 4E cognition, is the series of features collected by Churchland et al. (1994) as “pure vision”. Vision in the brain is far distant from the “pure vision” assumption: visual areas certainly collect additional information from other sensorial areas, receive rich top-down feedback, trigger attentive gazing actions, and are dynamically modulated by the ongoing action plan. The suspicion is that the role of all these aspects has been gradually amplified, to the point of discarding the “pure” computational component of the vision process at all. A possible clue to an explanation of the similarity of “pure” deep models with brain visual areas can be found in the proposal by Joel Norman (2002) of two types of processes in the visual and dorsal visual streams. The latter is seen to work in a manner well described by Gibson’s ecological theory, and the ventral stream in a manner that Norman categorizes as “constructivist”, in the Helmholtzian tradition (von Helmholtz, 1866). In this dual approach view the 4E instances could be confined to the dorsal stream, and the “pure” deep convolutional networks may well be a modern counterpart of the Helmholtzian constructivist–inferential process. While the independence of the two visual streams may be overstated (Schenk and McIntosh, 2010), the general idea of the parallel pathways for different aspects of visual processing is appealing. Supporting this idea, one should expect that the similarity between deep convolutional models and the brain should be stronger in areas such as V4, the inferior temporal (IT) and the Lateral Occipital Complex (LOC). The results obtained by Yamins et al. (2014) confirm the expectation, but not those recorded by Güçlü and van Gerven (2015) who found that the deep model was more predictive of the brain activity in area V1 then LOC. In the end, for deep learning to pursue a “shamelessly and naively pure” strategy paid off. Given the various evidence, reviewed here, of similarities between patterns in deep convolutional models and in the visual cortex, one wonders whether there is a degree of “purity” in the human visual system as well. 6.4. Beyond vision: natural language processing Vision is at the same time the most striking success of deep learning, and the case where its distance from 4E cognition is stunning, as just discussed. There is, however, more than vision. Deep learning is gradually outperforming and replacing alternative approaches in a variety of other fields, even if at a lesser pace than in vision. When deep neural models achieve state of the art performance in tasks highly related to human cognition, it is natural to ask what these models can suggest to cognitive science. This question is even more compelling when the task is the highest cognitive function of humans: language. As we recap in §3.2, artificial neural networks made their way into cognitive science with a language model, and they did not do this quietly. The model of the English past tense formation by Rumelhart and McClelland (1986a) was a pure empiricist example of language learning: it takes a 19 phonological representation of the uninflected form of an English verb as input, and predicts the phonological form of its past tense. In this model there are no explicit rules for determining a past tense morpheme or for deriving the phonological shape of that morpheme, the composition of past tense is just learned from examples. Despite – or possible because of – the harsh rebuttal of this model by Pinker and Prince (1988), artificial neural networks met significant success at the beginning of the ’90s in linguistics, especially among developmental linguists. On the other side, for the rationalist component in cognitive science and linguistics the critique of Pinker and Prince was fully successful. One of their most compelling arguments concerned the simple phonological representation used in the model, the so-called Wickelfeature (Wickelgren, 1969), where a central phoneme is related to both the preceding and following ones. As shown by Pinker and Prince, this simplification is scarcely plausible and misses important aspects of the temporal order in phonological forms. Just a couple of years after the past tense model, Jeffrey Elman (1990) proposed an alternative artificial neural structure that efficiently the issue of representing temporally ordered information by adding recurrent connections in a feedforward model. Using recurrent networks Elman demonstrated the ability to capture some basic aspects of syntax from examples, such as noun-verb distinction, and subject-verb agreement (Elman, 1991). A limitation of Elman’s recurrent networks was the difficulty in retaining memory of input events lasting more than 4-5 discrete time steps, preventing the processing of complex sentences. This issue was solved by a refinement of the basic recurrent networks called Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997), with the addition of multiplicative gate units that learn to open or close access of past signals to recurrent units. All the work done in the ’80s and ’90s with artificial neural networks for human language involved small experiments using toy language examples, leaving a fundamental question open: would neural models ever be able to process language at scale? Theoretical speculations oscillate between enthusiastic positive answers and skepticism. Presumably, the ongoing progress in natural language processing based on generative grammar influenced skeptical answers. The lack of linguistic structures like categories and trees, and of the many parameters and constraints of full grammatical systems, seems to form a gap far too wide to be bridged by the simple recurrent neural models. Now with deep learning, this question is no longer of concern. There is no more need to speculate whether neural models would process aspects of fullfledged human language in a future or not: they already do it. The success of deep learning in language tasks has not been as rapid and surprising as in vision, and its performances are more distant from those of humans than in vision. Nevertheless, deep learning has now replaced approaches based on grammar in almost all applications of natural language processing, including speech recognition (Veselý et al., 2013; Saon et al., 2017), language comprehension (Trischler et al., 2016; Devlin et al., 2018) and translation (Zhou et al., 2016; Vaswani et al., 2017; Johnson et al., 2017). In line with the drop of cognitive interest in the deep learning community, 20 the developers of these successful applications in natural language processing have not cared about vindicating the soundness of the intuitions of Rumelhart, McClelland, and Elman. However, for no more than a couple of years, the impressive breakthroughs of deep learning in natural language processing have started to capture the attention of the linguistic community. So, Rumelhart & McClelland found a late rematch on Pinker & Prince, thanks to Kirov and Cotterell (2018). They reproposed the celebrated neural model of the English present-to-past tense mappings, adopting the modern encoding based on recurrent neural networks. This new model obviates most of Pinker and Prince’s criticisms. Moreover, there seems to be more than one repetition of the never dormant diatribe between rationalists and empiricists in linguistics. There is a growing interest in understanding how neural networks can achieve their language skills, and what kind of linguistic knowledge is embedded in these models. Following this line of investigation, Gulordava et al. (2018) evaluated the ability of recurrent neural networks to learn English subject-verb agreement over long distance, a task thought to require hierarchical structure of sentences. Above all, they tried to ascertain whether the models were able to rely on the hierarchical structure, regardless of semantics. They did it by recalling – with a hint of irreverence – Chomsky himself, by testing the sentence The colorless green ideas I ate with the chair sleep furiously, for the agreement between ideas and sleep. Their results show how recurrent neural networks acquire deep grammatical competence without any predefined rule. Today we are witnessing a multiplication of studies aimed at assessing a variety of grammatical competences in modern recurrent neural models. The list includes auxiliary inversion in English yes/no-question formation (Fitz and Chang, 2017), negative polarity item licensing1 (Jumelet and Hupkes, 2018; Warstadt et al., 2019), and syntactic island constraints on the filler-gap dependency2 (Wilcox et al., 2019a,b). In a target article Joe Pater (2019) provided a particularly useful overview of this new line of research, and he fostered a fusion between the traditional antagonistic empiricist and rationalist approaches to natural language. Is the rationalist part available to this alliance? Not much, apparently. Berent and Marcus (2019) rejected Pater’s invitation, arguing that “either those connectionist models are right, and generative linguistics must be radically revised, or they must be replaced by alternatives that are compatible with the algebraic hypothesis”. Corkery et al. (2019) imitated Pinker and Prince in dissecting the modern version of the English past tense learning model of Kirov and Cotterell (2018), and while acknowledging significant progress with respect to Rumelhart and McClelland’s original models, they concluded that “there is still insufficient 1 Negative polarity item licensing is the knowledge of which context in a sentence license the presence of negative polarity items, words such as any 2 Syntactic island are positions that locally block the filler-gap dependency, where the filler is a wh-word, like who, and the gap is the empty syntactic position related to the filler. 21 evidence to claim that neural nets are a good cognitive model for this task.” So the impact of deep learning in cognitive science, in the case of language, is likely to heat up old debates. Still, there is a fundamental difference with respect to the same discussion in the ’90s: the pragmatic evidence that today deep learning is the best available computational approach to language. However, the cognitive relevance of deep learning is vulnerable to the same objection that we discussed in the case of vision. Current recurrent neural models are engineered for applications, such as Google Translate, not intended to study how language is processed in humans. Therefore, their results might be irrelevant for cognitive science. Let us recall how in the case of vision, faced with the same objection, a strategy pursued by several scholars has been to find similarities between patterns in deep neural models and patterns in cortical areas. For vision we found a significant body of research on the analogies between deep convolutional models and the human visual system, with several positive results. Nothing similar can be found for recurrent neural models. Not only is there no attempt of a mapping between components of linguistic neural models and brain areas, there is not even an idea on a correspondence between the basic recurrent units and neural circuitry in the brain. Among the few attempts in this direction, Ponte Costa et al. (2017) proposed a variation of LSTM in which information is gated through units that are subtractive, therefore akin to inhibitory cells. They showed a possible mapping of this structure onto known canonical excitatory-inhibitory cortical microcircuits. Compared to deep convolutional models for vision, recurrent neural models operate at a more abstract level, far away from sensorial representations, and therefore difficult to map across cortical areas. The level of abstraction of recurrent neural models not only prevents their mapping onto brain circuits, it also introduces important cognitive differences with respect to the way humans learn language. For example, a standard practice for deep language models initializes the recurrent neural models with a vocabulary of known words and feeds them tokenized corpora during training. This approach is certainly valid for practical purposes but departs from the way humans learn language. One of the major challenges for infant learners is precisely discovering the basic constituents of linguistic structures like words. An important step forward was achieved recently by Hahn and Baroni (2019), who demonstrated the ability of recurrent neural networks to learn from character-level inputs without word boundaries. These networks learn to track word boundaries, and to solve morphological, syntactic and semantic tasks. In conclusion, even if to a less extent than vision, language also is a field where the breakthroughs of deep learning are so impressive that they deserve more attention in cognitive science. 7. Conclusions What we have argued for in the above sections is that the advances achieved using deep learning models in cognitive domains are not neutral to cognitive science. We have first attempted to delineate the framework of deep learning. 22 From a purely mathematical point of view it appears in full continuity with artificial neural networks as established in the ’80s within the PDP project, and the technical innovations are surprisingly limited. There is, however, a radical difference in the interaction with cognitive science. The PDP group proposed neural models mostly as new tools for exploring cognition, with a radical empiricist perspective of how the mind works. Instead, the deep learning research community is largely driven by application and market motivations, and indifferent to cognitive studies, with a few notable exceptions. We have analyzed the domain of artificial vision in depth, where deep learning models have reached human-like performance, and language processing, where deep learning is the best available computational approach today. Even if these achievements have been eventually a quasi- secondary outcome, given that the primary goals were engineering in kind, they raise fundamental questions to current cognitive science anyway. We have argued that the most pressing questions concern the recent cognitive science journey, away from its earlier computational and representational principles. In principle, deep neural models are not incompatible with the new synthesis sketched here between 4E cognition, including radical enactivism, and ecological psychology. However, it is puzzling that the impressive results of deep learning are achieved disregarding all theoretical indications coming from 4E cognition. The suspicion is that the role of aspects such as embodiment, enaction, dynamic aspects, contextual effects have been gradually amplified in 4E cognition, to the point of neglecting the contribution of basic computational processes. Contrary to the line of argument here, one may still insist on the irrelevance of deep learning models for cognition, imputing the analogies between patterns in the brain and models to mere coincidence. But insisting on this line of reasoning is vulnerable to a sort of “no–miracle argument” (Putnam, 1978, pp.18–19), and, at least within cognitive science, miracles are not supposed to take place. References Aarts, E., Korst, J., 1989. Simulated Annealing and Boltzmann Machines. John Wiley, New York. Aizawa, K., 2015. What is this cognition that is supposed to be embodied? Plenum Press 28, 755–775. Allen, M., Friston, K. J., 2018. From cognitivism to autopoiesis: towards a computational framework for the embodied mind. Synthese 195, 2459–2482. Anderson, J. A., Rosenfeld, E. (Eds.), 2000. Talking nets: an oral history of neural networks. MIT Press, Cambridge (MA). Arel, I., Rose, D. C., Karnowski, T. P., 2010. Deep machine learning–a new frontier in artificial intelligence research. IEEE Computational Intelligence Magazine 5, 13–18. 23 Baggs, E., Chemero, A., 2018. Radical embodiment in two directions. Synthese online. Barto, A. G., Sutton, R. S., 1982. Simulation of anticipatory responses in classical conditioning by a neuron–like adaptive element. Behavioral and Brain Science 4, 221–234. Beck, J., 2013. Why we can’t say what animals think. Philosophical Psychology 26, 520–546. Beer, R. D., 2003. The dynamics of active categorical perception in an evolved model agent. Adaptive Behavior 11, 209–243. Bengio, Y., 2017. The consciousness prior. CoRR abs/1709.08568. Bengio, Y., Courville, A., Vincent, P., 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1798–1828. Benveniste, A., Metivier, M., Priouret, P., 1990. Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, Berlin. Berent, I., Marcus, G., 2019. No integration without structured representations: Response to Pater. Language 95, e75–e86. Biederman, I., 1987. Recognition-by-components: A theory of human image understanding. Psychological Review 94, 115–147. Blake, A., Yuille, A. L. (Eds.), 1992. Active Vision. MIT Press, Cambridge (MA). Boden, M., 2008. Mind as Machine: A History of Cognitive Science. Oxford University Press, Oxford (UK). Borghi, A. M., Barca, L., Binkofski, F., Castelfranchi, C., Pezzulo, G., Tummolini, L., 2019. Words as social tools: Language, sociality and inner grounding in abstract concepts. Physics of Life Reviews 29, 120–153. Borghi, A. M., Binkofski, F., 2014. Words as social tools: An embodied view on abstract concepts. Springer-Verlag, Berlin. Bottou, L., LeCun, Y., 2004. Large scale online learning. In: Advances in Neural Information Processing Systems. pp. 217–224. Bracewell, R., 2003. Fourier Analysis and Imaging. Springer-Verlag, Berlin. Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A., 2014. Return of the devil in the details: Delving deep into convolutional nets. CoRR abs/1405.3531. Chemero, A., 2009. Radical embodied cognitive science. MIT Press, Cambridge (MA). 24 Chevalier-Boisvert, M., Bahdanau, D., Lahlou, S., Willems, L., Saharia, C., Nguyen, T. H., Bengio, Y., 2019. BabyAI: a platform to study the sample efficiency of grounded language learning. In: International Conference on Learning Representations. Chomsky, N., 1966. Cartesian Linguistics: a Chapter in the History of Rationalist Thought. Harper and Row Pub. Inc, New York. Chomsky, N., 1968. Language and Mind. Harcourt, Brace and World, New York, second enlarged edition, 1972. Churchland, P. M., 1996. The engine of reason, the seat of the soul: A philosophical journey into the brain. MIT Press, Cambridge (MA). Churchland, P. S., Ramachandran, V., Sejnowski, T., 1994. A critique of pure vision. In: Koch, C., Davis, J. (Eds.), Large-Scale Neuronal Theories of the Brain. MIT Press, Cambridge (MA). Cichy, R. M., Kaiser, D., 2019. Deep neural networks as scientific models. Trends in Cognitive Sciences 23, 305–317. Constant, A., Clark, A., Kirchhoff, M., Friston, K. J., 2019. Extended active inference: constructing predictive cognition beyond skulls. Mind and Language in press. Conway, B. R., 2018. The organization and operation of inferior temporal cortex. Annual Review of Vision Science 4, 19.1–19.22. Copeland, J. (Ed.), 2004. The Essential Turing – Seminal Writings in Computing, Logic, Philosophy, Artificial Intelligence, and Artificial Life plus The Secrets of Enigma. Oxford University Press, Oxford (UK). Copeland, J., Proudfoot, D., 1996. On Alan Turing’s anticipation of connectionism. Synthese 108, 361–377. Corkery, M., Matusevych, Y., Goldwater, S., 2019. Are we there yet? encoderdecoder neural networks as cognitive models of English past tense inflection. CoRR abs/1906.01280. Curry, H. B., 1944. The method of steepest descent for non-linear minimization problems. Quarterly of Applied Mathematics 2, 258–261. De Croon, G. C., Sprinkhuizen-Kuyper, I. G., Postma, E., 2009. Comparing active vision models. Image and Vision Computing 27, 374–384. de Villers, J., Barnard, E., 1992. Backpropagation neural nets with one and two hidden layers. IEEE Transactions on Neural Networks 4, 136–141. Deutsch, K. W., 1966. The Nerves of Government: Models of Political Communication and Control. Free Press, New York. 25 Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2018. BERT: Pretraining of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Draper, B. A., Baek, K., Boody, J., 2004. Implementing the expert object recognition pathway. Machine Vision and Applications 16, 27–32. Edelman, S., 2015. The minority report: some common assumptions to reconsider in the modelling of the brain and behaviour. Journal of Experimental & Theoretical Artificial Intelligence 28, 751–776. Egan, F., 1995. Computation and content. The Philosophical Review 104, 181– 203. Eickenberg, M., Gramfort, A., Varoquaux, G., Thirion, B., 2017. Seeing it all: Convolutional network layers map the function of the human visual system. NeuroImage 152, 184–194. Eliasmith, C., Kolbeck, C., 2015. Marr’s attacks: On reductionism and vagueness. Topics in Cognitive Science 7, 323–335. Elman, J. L., 1990. Finding structure in time. Cognitive Science 14, 179–221. Elman, J. L., 1991. Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning 7, 195–225. Elman, J. L., Bates, E., Johnson, M. H., Karmiloff-Smith, A., Parisi, D., Plunkett, K., 1996. Rethinking innateness A Connectionist Perspective on Development. MIT Press, Cambridge (MA). Fellbaum, C., 1998. WordNet. Blackwell Publishing, Malden (MA). Fitz, H., Chang, F., 2017. Meaningful questions: The acquisition of auxiliary inversion in a connectionist model of sentence production. Cognition 166, 225–250. Fodor, J., 1981. Representations: Philosofical Essay on the Foundation of Cognitive Science. MIT Press, Cambridge (MA). Fodor, J., 1983. Modularity of Mind: and Essay on Faculty Psychology. MIT Press, Cambridge (MA). Fodor, J., 1987. Psychosemantics: The Problem of Meaning in the Philosophy of Mind. MIT Press, Cambridge (MA). Fodor, J., 1998. Concepts: Where Cognitive Science Went Wrong. Oxford University Press, Oxford (UK). Fodor, J., Pylyshyn, Z., 1988. Connectionism and cognitive architecture: a critical analysis. Cognition 28, 3–71. 26 Friston, K., 2010. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience 11, 127–138. Friston, K., 2012. A free energy principle for biological systems. Entropy 14, 2100–2121. Friston, K., Kiebel, S., 2009. Predictive coding under the free–energy principle. Philosophical transactions of the Royal Society B 364, 1211–1221. Friston, K., Stephan, K. E., 2007. Free–energy and the brain. Synthese 159, 417–458. Fukushima, K., 1980. Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics 36, 193–202. Fukushima, K., 1988. Neocognitron: a hierarchical neural network capable of visual pattern recognition. Neural Networks 1, 119–130. Gallagher, S., 2008. Are minimal representations still representations? International Journal of Philosophical Studies 16, 351–369. Gallagher, S., 2017. Enactivist interventions: Rethinking the mind. Oxford University Press, Oxford (UK). Gallagher, S., Allen, M., 2018. Active inference, enactivism and the hermeneutics of social cognition. Synthese 195, 2627–2648. Gallistel, C. R., 1990a. The Organization of Learning. MIT Press, Cambridge (MA). Gallistel, C. R., 1990b. Representations in animal cognition: An introduction. Cognition 37, 1–22. Gasser, M., Smith, L. B., 1998. Learning nouns and adjectives: a connectionist account. Language and Cognitive Processes 13, 269–306. Gauthier, I., Tarr, M. J., 2016. Visual object recognition: Do we (finally) know more now than we did? Annual Review of Vision Science 2, 16.1–16.20. Gelder, T. v., 1995. What might cognition be, if not computation? Journal of Phylosophy 91, 345–381. Gibson, J. J., 1966. The senses considered as perceptual systems. Houghton Miflin, Boston (MA). Gibson, J. J., 1979. The Ecological Approach to Perception. Houghton Miflin, Boston (MA). Goldinger, S. D., Papesh, M. H., Barnhart, A. S., Hansen, W. A., Hout, M. C., 2016. The poverty of embodied cognition. Psychonomic Bulletin & Review 23, 171–182. 27 Grill-Spector, K., Weiner, K. S., Gomez, J., Stigliani, A., Natu, V. S., 2018. The functional neuroanatomy of face perception: from brain measurements to deep neural networks. Interface Focus 8, 20180013. Güçlü, U., van Gerven, M. A. J., 2014. Unsupervised feature learning improves prediction of human brain activity in response to natural images. PLoS Computational Biology 10, 1–16. Güçlü, U., van Gerven, M. A. J., 2015. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience 35, 10005–10014. Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., Baroni, M., 2018. Colorless green recurrent networks dream hierarchically. In: Proceedings North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, pp. 1195– 1205. Hahn, M., Baroni, M., 2019. Tabula nearly rasa: Probing the linguistic knowledge of character-level neural language models trained on unsegmented text. Transactions of the Association for Computational Linguistics 7, 467–484. Hazelwood, K., Bird, S., Brooks, D., Chintala, S., Diril, U., Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y., Kalro, A., Law, J., Lee, K., Lu, J., Noordhuis, P., Smelyanskiy, M., Xiong, L., Wang, X., 2018. Applied machine learning at Facebook: A datacenter infrastructure perspective. In: IEEE International Symposium on High Performance Computer Architecture (HPCA). pp. 620– 629. Hebb, D. O., 1949. The Organization of Behavior. John Wiley, New York. Heras-Escribano, M., 2019. The Philosophy of Affordances. Palgrave Macmillan, London. Hinton, G. E., McClelland, J. L., Rumelhart, D. E., 1986. Distributed representations. In: Rumelhart and McClelland (1986b), pp. 77–109. Hinton, G. E., Salakhutdinov, R. R., 2006. Reducing the dimensionality of data with neural networks. Science 28, 504–507. Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Computation 9, 1735–1780. Horgan, T., Tienson, J., 1989. Representations without rules. Philosophical Topics 17, 147–174. Hubel, D., Wiesel, T., 1962. Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. Journal of Physiology 160, 106– 154. 28 Hubel, D., Wiesel, T., 1968. Receptive fields and functional architecture of mokey striate cortex. Journal of Physiology 195, 215–243. Hutto, D. D., Myin, E., 2013. Radicalizing enactivism: basic minds without content. MIT Press, Cambridge (MA). Jacob, P., Jeannerod, M., 2003. Ways of Seeing – The Scope and Limits of Visual Cognition. Oxford University Press, Oxford (UK). Jeannerod, M., 1994. The representing brain: Neural correlates of motor intention and imagery. Behavioral and Brain Science 17, 187–245. Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., andMartin Wattenberg, F. V., Corrado, G., Hughes, M., Dean, J., 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, 339–351. Jumelet, J., Hupkes, D., 2018. Do language models understand anything? on the ability of LSTMs to understand negative polarity items. CoRR abs/1808.10627. Karmiloff-Smith, A., 1992. Beyond modularity: A developmental perspective on cognitive science. MIT Press, Cambridge (MA). Ke, N. R., Bilaniuk, O., Goyal, A., Bauer, S., Larochelle, H., Pal, C., Bengio, Y., 2019. Learning neural causal models from unknown interventions. CoRR abs/1906.01280. Khaligh-Razavi, S.-M., Kriegeskorte, N., 2014. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS Computational Biology 10, e1003915. Khan, S., Tripp, B. P., 2017. One model to learn them all. CoRR abs/1706.05137. Kingma, D. P., Ba, J., 2014. Adam: A method for stochastic optimization. In: Proceedings of International Conference on Learning Representations. Kingma, D. P., Welling, M., 2014. Auto-encoding variational bayes. In: Proceedings of International Conference on Learning Representations. Kirov, C., Cotterell, R., 2018. Recurrent neural networks in linguistic theory: Revisiting Pinker and Prince (1988) and the past tense debate. Transactions of the Association for Computational Linguistics 6, 651–666. Kriegeskorte, N., 2009. Relating population-code representations between man, monkey, and computational models. Frontiers in Neuroscience 3, 363–373. 29 Kriegeskorte, N., Mur, M., Bandettini, P., 2009. Representational similarity analysis – connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience 2, 4. Krizhevsky, A., Sutskever, I., Hinton, G. E., 2012. ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. pp. 1090–1098. Kurzweil, R., 2005. The singularity is near: when humans trascend biology. Viking, New York. Kushner, H. J., Clark, D., 1978. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, Berlin. Lake, B. M., Ullman, T. D., Tenenbaum, J. B., Gershman, S. J., 2017. Building machines that learn and think like people. Behavioral and Brain Science 40, 1–72. Lakoff, G., Johnson, M., 1999. Philosophy in the Flesh. The Embodied Mind and its Challenge to Western Thought. Basic Books, New York. Landau, B., Smith, L. B., Jones, S., 1988. The importance of shape in early lexical learning. Cognitive Development 3, 299–321. Landgrebe, J., Smith, B., 2019. Making AI meaningful again. Synthese https://doi.org/10.1007/s11229-019-02192-y, 1–21. Lehky, S. R., Tanaka, K., 2016. Neural representation for object recognition in inferotemporal cortex. Current Opinion in Neurobiology 37, 23–35. Levenberg, K., 1944. A method for solution of certain non-linear problems in least squares. Quarterly of Applied Mathematics 2, 164–168. López-Rubio, E., 2018. Computational functionalism for the deep learning era. Minds and Machines 28, 667–688. MacWhinney, B. (Ed.), 1999. The Emergence of Language, 2nd Edition. Lawrence Erlbaum Associates, Mahwah (NJ). MacWhinney, B., Leinbach, J., 1991. Implementations are not conceptualizations: Revising the verb learning model. Cognition 29, 121–157. Mahon, B. Z., 2015. What is embodied about cognition? Language and Cognitive Neuroscience 30, 420–429. Mallakin, A., 2019. An integration of deep learning and neuroscience for machine consciousness. Global Journal of Computer Science and Technology 19, 1–10. Marcus, G., 2018. Deep learning: A critical appraisal. CoRR abs/1801.00631. 30 Marr, D., 1982. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman, San Francisco (CA). Minsky, M., 1954. Neural nets and the brain-model problem. Ph.D. thesis, Princeton University. Minsky, M., Papert, S., 1969. Perceptrons. MIT Press, Cambridge (MA). Myin, E., Degenaar, J., 2014. Enactive vision. In: Shapiro, L. (Ed.), The Routledge Handbook of Embodied Cognition. Routledge, London, pp. 90–107. Nakayama, K., 1994. James J. Gibson – an appreciation. Psychological Review 101, 329–335. Newell, A., 1980. Physical symbol systems. Cognitive Science 4, 135–183. Newell, A., Shaw, C., Simon, H. A., 1957. Empirical explorations of the logic theory machine: A case study in heuristic. In: Western Joint Computer Conference Proceedings. ACM, New York, pp. 218–230. Newell, A., Shaw, C., Simon, H. A., 1959. Report on a general problem-solving program. Scientific Report P-1584, RAND Corporation, Santa Monica (CA). Newell, A., Simon, H. A., 1972. Human problem solving. Prentice Hall, Englewood Cliffs (NJ). Noë, A., 2004. Action in Perception. MIT Press, Cambridge (MA). Noë, A., 2010. Vision without representation. In: Gangopadhyay, N., Madary, M., Spicer, F. (Eds.), Perception, Action, and Consciousness: Sensorimotor Dynamics and Two Visual Systems. Oxford University Press, Oxford (UK), pp. 245–256. Norman, J., 2002. Two visual systems and two theories of perception: An attempt to reconcile the constructivist and ecological approaches. Behavioral and Brain Science 25, 73–144. Novaes, C. D., 2012. Formal languages in logic: a philosophical and cognitive analysis. Cambridge University Press, Cambridge (UK). O’Brien, G., Opie, J., 2004. Notes toward a structuralist theory of mental representation. In: Clapin, H., Staines, P., Slezak, P. (Eds.), Representation in Mind – New Approaches to Mental Representation. Elsevier, Amsterdam. Ofner, A., Stober, S., 2018. Towards bridging human and artificial cognition: Hybrid variational predictive coding of the physical world, the body and the brain. In: Advances in Neural Information Processing Systems. Olazaran, M., 1996. A sociological study of the official history of the perceptrons controversy. Social Studies of Science 26, 611–659. 31 Olshausen, B. A., 2014. Perception as an inference problem. In: Gazzaniga, M. S. (Ed.), The Cognitive Neurosciences. MIT Press, Cambridge (MA), pp. 295–304, fifth edition. O’Regan, J. K., Noë, A., 2001. A sensorimotor account of vision and visual consciousness. Behavioral and Brain Science 24, 939–1031. Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., Wermter, S., 2019. Continual lifelong learning with neural networks: A review. Neural Networks 113, 54–71. Pater, J., 2019. Generative linguistics and neural networks at 60: Foundation, friction, and fusion. Language 95, e41–e74. Pearl, J., 1986. Fusion, propagation, and structuring in belief networks. Artificial Intelligence 29, 241–288. Piccinini, G., 2008. Computation without representation. Philosophical studies 137, 205–241. Piccinini, G., Bahar, S., 2013. Neural computation and the computational theory of cognition. Cognitive Science 34, 453–488. Piccinini, G., Scarantino, A., 2010. Computation vs. information processing: why their difference matters to cognitive science. Studies in History and Philosophy of Science 41, 237–246. Pinker, S., 1997. How the Mind Works. Norton, New York. Pinker, S., Prince, A., 1988. On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition 28, 73–193. Plebe, A., 2018. The search of “canonical” explanations for the cerebral cortex. History and Philosophy of the Life Sciences 40, 40–76. Plebe, A., De La Cruz, V. M., 2016. Neurosemantics – Neural Processes and the Construction of Linguistic Meaning. Springer, Berlin. Plebe, A., De La Cruz, V. M., 2018. Neural representations beyond “plus X”. Minds and Machines 28, 93–117. Plebe, A., Grasso, G., 2019. The unbearable shallow understanding of deep learning. Minds and Machines 29, 515–553. Polak, E., 1971. Computational Methods in Optimization: A Unified Approach. Academic Press, New York. Ponte Costa, R., Assael, Y. M., Shillingford, B., de Freitas, N., Vogels, T. P., 2017. Attention is all you need. In: Advances in Neural Information Processing Systems. pp. 272–283. 32 Putnam, H., 1978. Meaning and the Moral Sciences. Routledge, London. Pylyshyn, Z., 1984. Computation and Cognition. MIT Press, Cambridge (MA). Rao, R. P., Ballard, D. H., 1995. An active vision architecture based on iconic representations. Artificial Intelligence 78, 461–505. Ras, G., van Gerven, M., Haselager, P., 2018. Explanation methods in deep learning: Users, values, concerns and challenges. In: Escalante, H. J., Escalera, S., Guyon, I., Baró, X., Güçlütürk, Y., Güçlü, U., van Gerven, M. (Eds.), Explainable and Interpretable Models in Computer Vision and Machine Learning. Springer-Verlag, Berlin, pp. 19–36. Rawat, W., Wang, Z., 2017. Deep convolutional neural networks for image classification: A comprehensive review. Neural Computation 29, 2352–2449. Rezende, D. J., Mohamed, S., Wierstra, D., 2014. Stochastic backpropagation and approximate inference in deep generative models. In: Xing, E. P., Jebara, T. (Eds.), Proceedings of Machine Learning Research. pp. 1278–1286. Robbins, H., Monro, S., 1951. A stochastic approximation method. Annals of Mathematical Statistics 22, 400–407. Robinson, L., Rolls, E. T., 2015. Invariant visual object recognition: biologically plausible approaches. Biological Cybernetics 109, 505–535. Rolls, E., 2016. Cerebral Cortex: Principles of Operation. Oxford University Press, Oxford (UK). Rolls, E. T., Stringer, S. M., 2006. Invariant visual object recognition: A model, with lighting invariance. Journal of Physiology – Paris 100, 43–62. Rosch, E., 1973. On the internal structure of perceptual and semantic categories. In: Moore, T. E. (Ed.), Cognitive development and acquisition of language. Academic Press, New York. Rosenblatt, F., 1958. The perceptron: a probabilistic model for information storage and organisation in the brain. Psychological Review 65, 386–408. Rosenblatt, F., 1962. Principles of Neurodynamics: Perceptron and the Theory of Brain Mechanisms. Spartan, Washington (DC). Rosenfeld, A., Kak, A. C., 1982. Digital Picture Processing, 2nd Edition. Academic Press, New York. Rowlands, M., 2006. Body Language. MIT Press, Cambridge (MA). Rumelhart, D. E., Hinton, G. E., Williams, R. J., 1986. Learning representations by back-propagating errors. Nature 323, 533–536. Rumelhart, D. E., McClelland, J. L., 1986a. On learning the past tenses of English verbs. In: Rumelhart and McClelland (1986b), pp. 216–271. 33 Rumelhart, D. E., McClelland, J. L. (Eds.), 1986b. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge (MA). Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., Fei-Fei, L., 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 211–252. Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhadran, B., et al., 2017. English conversational telephone speech recognition by humans and machines. In: Conference of the International Speech Communication Association. pp. 132–136. Schenk, T., McIntosh, R. D., 2010. Do we have independent visual streams for perception and action? Cognitive Neuroscience 1, 52–62. Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural Networks 61, 85–117. Schmidt, M., Roux, N. L., Bach, F., 2017. Minimizing finite sums with the stochastic average gradient. Mathematical Programming 162, 83–112. Schubbach, A., 2019. Judging machines: philosophical aspects of deep learning. Synthese https://doi.org/10.1007/s11229-019-02167-z, 1–21. Shagrir, O., Bechtel, W., 2018. Marr’s computational level and delineating phenomena. In: Kaplan, D. M. (Ed.), Explanation and Integration in Mind and Brain Science. Oxford University Press, Oxford (UK), pp. 190–214. Shea, N., 2014. Exploitable isomorphism and structural representation. Proceedings of the Aristotelian Society 114, 123–144. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D., 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489. Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for largescale image recognition. CoRR abs/1409.1556. Smith, L. B., 1999. Children’s noun learning: How general learning processes make specialized learning mechanisms. In: MacWhinney (1999). Stringer, S. M., Rolls, E. T., 2002. Invariant object recognition in the visual system with novel views of 3d objects. Neural Computation 14, 2585–2596. Stringer, S. M., Rolls, E. T., Tromans, J. M., 2007. Invariant object recognition with trace learning and multiple stimuli present during training. Network: Computation in Neural Systems 18, 161–187. 34 Swoyer, C., 1991. Structural representation and surrogative reasoning. Synthese 87, 449–508. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions. In: Proc. of IEEE International Conference on Computer Vision and Pattern Recognition. pp. 1–9. Tacchetti, A., Isik, L., Poggio, T. A., 2018. Invariant recognition shapes neural representations of visual input. Annual Review of Vision Science 4, 403–422. Tripp, B. P., 2017. Similarities and differences between stimulus tuning in the inferotemporal visual cortex and convolutional networks. In: International Joint Conference on Neural Networks. pp. 3551–3560. Trischler, A., Ye, Z., Yuan, X., He, J., Bachman, P., Suleman, K., 2016. A parallel-hierarchical model for machine comprehension on sparse data. CoRR abs/1603.08884. Turing, A., 1948. Intelligent machinery. Tech. rep., National Physical Laboratory, London, raccolto in Ince, D. C. (ed.) Collected Works of A. M. Turing: Mechanical Intelligence, Edinburgh University Press, 1969. Ullman, S. (Ed.), 1996. High-level Vision – Object Recognition and Visual Cognition. MIT Press, Cambridge (MA). VanRullen, R., 2017. Perception science in the age of deep neural networks. Frontiers in Psychology 8, 142. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information Processing Systems. pp. 6000–6010. Veselý, K., Ghoshal, A., Burget, L., Povey, D., 2013. Sequence-discriminative training of deep neural networks. In: Conference of the International Speech Communication Association. pp. 2345–2349. Viéville, T., 1997. A Few Steps Towards 3D Active Vision. Springer-Verlag, Berlin. von der Malsburg, C., 1973. Self-organization of orientation sensitive cells in the striate cortex. Kybernetic 14, 85–100. von Helmholtz, H., 1866. Handbuch der physiologische Optik. Voss, Hamburg, english translation in 1925, Treatise on physiological optic, Dover Pub., New York. Wallis, G., Rolls, E., 1997. Invariant face and object recognition in the visual system. Progress in Neurobiology 51, 167–194. 35 Warstadt, A., Cao, Y., Grosu, I., Peng, W., Blix, H., Nie, Y., Alsop, A., Bordia, S., Liu, H., Parrish, A., Wang, S.-F., Phang, J., Mohananey, A., Htut, P. M., Jeretic, P., Bowman, S. R., 2019. Investigating BERT’s knowledge of language: Five analysis methods with NPIs. CoRR abs/1909.02597. Werbos, P., 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University. Werbos, P., 1994. The Roots of Backpropagation: From Ordered Derivatives to Neural Networks. John Wiley, New York. Wheeler, M., 2005. Reconstructing the Cognitive World: The Next Step. MIT Press, Cambridge (MA). Wickelgren, W., 1969. Context sensitive coding, associative memory, and serial order in (speech) behavior. Psychological Review 76, 1–15. Wilcox, E., Levy, R., Futrell, R., 2019a. Hierarchical representation in neural language models: Suppression and recovery of expectations. In: Proceedings BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, pp. 181–190. Wilcox, E., Levy, R., Morita, T., Futrell, R., 2019b. What syntactic structures block dependencies in RNN language models? In: Goel, A., Seifert, C., Freksa, C. (Eds.), Proceedings of the Annual Conference of the Cognitive Science Society. Cognitive Science Society, pp. 1199–1205. Willshaw, D. J., von der Malsburg, C., 1976. How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society of London B194, 431–445. Yamins, D. L. K., Honga, H., Cadieua, C. F., Solomon, E. A., Seibert, D., DiCarlo, J. J., 2014. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the Natural Academy of Science USA 23, 8619–8624. Zhou, J., Cao, Y., Wang, X., Li, P., Xu, W., 2016. Deep recurrent models with fast-forward connections for neural machine translation. Transactions of the Association for Computational Linguistics 4, 371–383. 36