, somebody claims they’ve invented a revolutionary AI structure. However once you see the identical mathematical sample — selective amplification + normalization — emerge independently from gradient descent, evolution, and chemical reactions, you notice we didn’t invent the eye mechanism with the Transformers structure. We rediscovered elementary optimization rules that govern how any system processes info beneath power constraints. Understanding consideration as amplification relatively than choice suggests particular architectural enhancements and explains why present approaches work. Eight minutes right here provides you a psychological mannequin that would information higher system design for the subsequent decade.
When Vaswani and colleagues printed “Consideration Is All You Want” in 2017, they thought they had been proposing one thing revolutionary [1]. Their transformer structure deserted recurrent networks completely, relying as a substitute on consideration mechanisms to course of complete textual content sequences concurrently. The mathematical core was easy: compute compatibility scores between positions, convert them to weights, and use these for selective mixture of knowledge.
However this sample seems to emerge independently wherever info processing techniques face useful resource constraints beneath complexity. Not as a result of there’s some common legislation of consideration, however as a result of sure mathematical buildings appear to characterize convergent options to elementary optimization issues.
We could also be a kind of uncommon circumstances the place biology, chemistry, and AI have converged on related computational methods — not by way of shared mechanisms, however by way of shared mathematical constraints.
The five hundred-Million-12 months Experiment
The organic proof for attention-like mechanisms is remarkably deep. The optic tectum/superior colliculus system, which implements spatial consideration by way of aggressive inhibition, exhibits extraordinary evolutionary conservation throughout vertebrates [2]. From fish to people, this neural structure maintains structural and practical consistency throughout 500+ million years of evolution.
However maybe extra intriguing is the convergent evolution.
Impartial lineages developed attention-like selective processing a number of instances: compound eye techniques in bugs [3], digicam eyes in cephalopods [4], hierarchical visible processing in birds [5], and cortical consideration networks in mammals [2]. Regardless of vastly completely different neural architectures and evolutionary histories, these techniques converged on related options for selective info processing.
This raises a compelling query: Are we seeing proof of elementary computational constraints that govern how advanced techniques should course of info beneath useful resource limitations?
Even easy organisms recommend this sample scales remarkably. C. elegans, with solely 302 neurons, demonstrates refined attention-like behaviors in meals searching for and predator avoidance [6]. Vegetation exhibit attention-like selective useful resource allocation, directing progress responses towards related environmental stimuli whereas ignoring others [7].
The evolutionary conservation is putting, however we must be cautious about direct equivalences. Organic consideration includes particular neural circuits formed by evolutionary pressures fairly completely different from the optimization landscapes that produce AI architectures.
Consideration as Amplification: Reframing the Mechanism
Latest theoretical work has essentially challenged how we perceive consideration mechanisms. Philosophers Peter Fazekas and Bence Nanay demonstrated that conventional “filter” and “highlight” metaphors essentially mischaracterize what consideration truly does [8].
They assert that focus doesn’t choose inputs — it amplifies presynaptic alerts in a non-stimulus-driven means, interacting with built-in normalization mechanisms that create the looks of choice. The mathematical construction they establish is the next:
- Amplification: Improve the energy of sure enter alerts
- Normalization: Constructed-in mechanisms (like divisive normalization) course of these amplified alerts
- Obvious Choice: The mixture creates what seems to be selective filtering

This framework explains seemingly contradictory findings in neuroscience. Results like elevated firing charges, receptive discipline discount, and encompass suppression all emerge from the identical underlying mechanism — amplification interacting with normalization computations that function independently of consideration.
Fazekas and Nanay targeted particularly on organic neural techniques. The query of whether or not this amplification framework extends to different domains stays open, however the mathematical parallels are suggestive.
Chemical Computer systems and Molecular Amplification
Maybe probably the most shocking proof comes from chemical techniques. Baltussen and colleagues demonstrated that the formose response — a community of autocatalytic reactions involving formaldehyde, dihydroxyacetone, and steel catalysts — can carry out refined computation [9].

The system exhibits selective amplification throughout as much as 10⁶ completely different molecular species, reaching > 95% accuracy on nonlinear classification duties. Completely different molecular species reply differentially to enter patterns, creating what seems to be chemical consideration by way of selective amplification. Remarkably, the system operates on timescales (500 ms to 60 minutes) that overlap with organic and synthetic consideration mechanisms.
However the chemical system lacks the hierarchical management mechanisms and studying dynamics that characterize organic consideration. But the mathematical construction — selective amplification creating obvious selectivity — seems strikingly related. Programmable autocatalytic networks present extra proof. Metallic ions like Nd³⁺ create biphasic management mechanisms, each accelerating and inhibiting reactions relying on focus [10]. This generates controllable selective amplification that implements Boolean logic capabilities and polynomial mappings by way of purely chemical processes.
Data-Theoretic Constraints and Common Optimization
The convergence throughout these completely different domains might replicate deeper mathematical requirements. Data bottleneck concept gives a proper framework: any system with restricted processing capability should clear up the optimization drawback of minimizing info retention whereas preserving task-relevant particulars [11].
Jan Karbowski’s work on info thermodynamics reveals common power constraints on info processing [12]. The elemental thermodynamic sure on computation creates choice strain for environment friendly selective processing mechanisms throughout all substrates able to computation:

Data processing prices power, so environment friendly consideration mechanisms have a survival/efficiency benefit, the place σ represents entropy (S) manufacturing charge, and ΔI represents info processing capability.
Each time any system — whether or not a mind, a pc, and even chemical reactions — processes info, it should dissipate power as waste warmth. The extra info you course of, the extra power you need to waste. Since consideration mechanisms course of info (deciding what to deal with), they’re topic to this power tax.
This creates common strain for environment friendly architectures — whether or not you’re evolution designing a mind, chemistry organizing reactions, or gradient descent coaching transformers.
Neural networks working at criticality — the sting between order and chaos — maximize info processing capability whereas sustaining stability [13]. Empirical measurements present that aware consideration in people happens exactly at these vital transitions [14]. Transformer networks throughout coaching exhibit related part transitions, organizing consideration weights close to vital factors the place info processing is optimized [15].
This implies the likelihood that attention-like mechanisms might emerge wherever techniques face the basic trade-off between processing capability and power effectivity beneath useful resource constraints.
Convergent Arithmetic, Not Common Mechanisms
The proof factors towards a preliminary conclusion. Reasonably than discovering common mechanisms, we could also be witnessing convergent mathematical options to related optimization issues:

The mathematical construction — selective amplification mixed with normalization — seems throughout these domains, however the underlying mechanisms and constraints differ considerably.
For transformer architectures, this reframing suggests particular insights:
- Q·Okay computation implements amplification.

The dot product Q·Okay^T computes semantic compatibility between question and key representations, performing as a realized amplification operate the place excessive compatibility scores amplify sign pathways.The scaling issue √d_k prevents saturation in high-dimensional areas, sustaining gradient circulation.
- Softmax normalization creates winner-take-all dynamics

Softmax implements aggressive normalization by way of divisive renormalization. The exponential time period amplifies variations (winner-take-all dynamics) whereas sum normalization ensures Σw_ij = 1. Mathematically this operate is equal to a divisive normalization.
- Weighted V mixture produces obvious selectivity

On this mixture there’s not express choice operator, it’s mainly a linear mixture of worth vectors. The obvious selectivity emerges from the sparsity sample induced by softmax normalization. Excessive consideration weights create efficient gating with out express gating mechanisms.
The mixture softmax(amplification) induce a winner-take-all dynamics on the worth area.


Implications for AI Improvement
Understanding consideration as amplification + normalization relatively than choice affords a number of sensible insights for AI structure design:
- Separating Amplification and Normalization: Present transformers conflate these mechanisms. We would discover architectures that decouple them, permitting for extra versatile normalization methods past softmax [16].
- Non-Content material-Based mostly Amplification: Organic consideration contains “not-stimulus-driven” amplification. Present transformer consideration is solely content-based (Q·Okay compatibility). We may examine realized positional biases, task-specific amplification patterns, or meta-learned amplification methods.
- Native Normalization Swimming pools: Biology makes use of “swimming pools of surrounding neurons” for normalization relatively than international normalization. This implies exploring native consideration neighborhoods, hierarchical normalization throughout layers, or dynamic normalization pool choice.
- Important Dynamics: The proof for consideration working close to vital factors means that efficient consideration mechanisms ought to exhibit particular statistical signatures — power-law distributions, avalanche dynamics, and significant fluctuations [17].
Open Questions and Future Instructions
A number of elementary questions stay:
- How deep do the mathematical parallels lengthen? Are we seeing true computational equivalence or superficial similarity?
- What can chemical reservoir computing educate us about minimal consideration architectures? If easy chemical networks can obtain attention-like computation, what does this recommend concerning the complexity necessities for AI consideration?
- Do information-theoretic constraints predict the evolution of consideration in scaling AI techniques? As fashions turn out to be bigger and face extra advanced environments, will consideration mechanisms naturally evolve towards these common optimization rules?
- How can we combine organic insights about hierarchical management and adaptation into AI architectures? The hole between static transformer consideration and dynamic organic consideration stays substantial.
Conclusion
The story of consideration seems to be much less about invention and extra about rediscovery. Whether or not within the formose response’s chemical networks, the superior colliculus’s neural circuits, or transformer architectures’ realized weights, we see variations on a mathematical theme: selective amplification mixed with normalization to create obvious selectivity.
This doesn’t lower the achievement of transformer architectures — if something, it suggests they characterize a elementary computational perception that transcends their particular implementation. The mathematical constraints that govern environment friendly info processing beneath useful resource limitations seem to push completely different techniques towards related options.
As we proceed scaling AI techniques, understanding these deeper mathematical rules might show extra useful than mimicking organic mechanisms immediately. The convergent evolution of attention-like processing suggests we’re working with elementary computational constraints, not engineering selections.
Nature spent 500 million years exploring these optimization landscapes by way of evolution. We rediscovered related options by way of gradient descent in a number of years. The query now’s whether or not understanding these mathematical rules can information us towards even higher options that transcend each organic and present synthetic approaches.
Closing notice
The actual check: if somebody reads this and designs a greater consideration mechanism because of this, we’ve created worth.
Thanks for studying — and sharing!
Javier Marin
Utilized AI Advisor | Manufacturing AI Methods + Regulatory Compliance
[email protected]
References
[1] Vaswani, A., et al. (2017). Consideration is all you want. Advances in Neural Data Processing Methods, 30, 5998–6008.
[2] Knudsen, E. I. (2007). Basic elements of consideration. Annual Overview of Neuroscience, 30, 57–78.
[3] Nityananda, V., et al. (2016). Consideration-like processes in bugs. Proceedings of the Royal Society B, 283(1842), 20161986.
[4] Cartron, L., et al. (2013). Visible object recognition in cuttlefish. Animal Cognition, 16(3), 391–401.
[5] Wylie, D. R., & Crowder, N. A. (2014). Avian fashions for 3D scene evaluation. Proceedings of the IEEE, 102(5), 704–717.
[6] Jang, H., et al. (2012). Neuromodulatory state and intercourse specify different behaviors by way of antagonistic synaptic pathways in C. elegans. Neuron, 75(4), 585–592.
[7] Trewavas, A. (2009). Plant behaviour and intelligence. Plant, Cell & Setting, 32(6), 606–616.
[8] Fazekas, P., & Nanay, B. (2021). Consideration is amplification, not choice. British Journal for the Philosophy of Science, 72(1), 299–324.
[9] Baltussen, M. G., et al. (2024). Chemical reservoir computation in a self-organizing response community. Nature, 631(8021), 549–555.
[10] Kriukov, D. V., et al. (2024). Exploring the programmability of autocatalytic chemical response networks. Nature Communications, 15(1), 8649.
[11] Tishby, N., & Zaslavsky, N. (2015). Deep studying and the data bottleneck precept. arXiv preprint arXiv:1503.02406.
[12] Karbowski, J. (2024). Data thermodynamics: From physics to neuroscience. Entropy, 26(9), 779.
[13] Beggs, J. M., & Plenz, D. (2003). Neuronal avalanches in neocortical circuits. Journal of Neuroscience, 23(35), 11167–11177.
[14] Freeman, W. J. (2008). Neurodynamics: An exploration in mesoscopic mind dynamics. Springer-Verlag.
[15] Gao, J., et al. (2016). Common resilience patterns in advanced networks. Nature, 530(7590), 307–312.
[16] Reynolds, J. H., & Heeger, D. J. (2009). The normalization mannequin of consideration. Neuron, 61(2), 168–185.
[17] Shew, W. L., et al. (2009). Neuronal avalanches suggest most dynamic vary in cortical networks at criticality. Journal of Neuroscience, 29(49), 15595–15600.
















