Representation Learning in Sensory Cortex: A Theory

Anselmi, F; Poggio, T

doi:10.1109/ACCESS.2022.3208603

We review and apply a computational theory based on the hypothesis that the feedforward path of the ventral stream in visual cortex's main function is the encoding of invariant representations of images. A key justification of the theory is provided by a result linking invariant representations to small sample complexity for image recognition - that is, invariant representations allow learning from very few labeled examples. The theory characterizes how an algorithm that can be implemented by a set of "simple" and "complex" cells - a "Hubel Wiesel module" - provides invariant and selective representations. The invariance can be learned in an unsupervised way from observed transformations. Our results show that an invariant representation implies several properties of the ventral stream organization, including the emergence of Gabor receptive filelds and specialized areas. The theory requires two stages of processing: the first, consisting of retinotopic visual areas such as V1, V2 and V4 with generic neuronal tuning, leads to representations that are invariant to translation and scaling; the second, consisting of modules in IT (Inferior Temporal cortex), with class- and object-specific tuning, provides a representation for recognition with approximate invariance to class specific transformations, such as pose (of a body, of a face) and expression. In summary, our theory is that the ventral stream's main function is to implement the unsupervised learning of "good" representations that reduce the sample complexity of the final supervised learning stage.