How Computer systems “See” Molecules | In the direction of Knowledge Science

Constructing A Profitable Relationship With Stakeholders

Find out how to Spin Up a Venture Construction with Cookiecutter

a pc, Edvard Munch’s The Scream is nothing greater than a grid of pixel values. It has no sense of why swirling traces in a twilight sky convey the agony of a scream. That’s as a result of (trendy digital) computer systems essentially course of solely binary alerts [1,2]; they don’t inherently comprehend the objects and feelings we understand.

To imitate human intelligence, we first want an intermediate type (illustration) to “translate” our sensory world into one thing a pc can deal with. For The Scream, that may imply extracting edges, colours, shapes, and many others. Likewise, in Pure Language Processing (NLP), a pc sees human language as an unstructured stream of symbols that should be became numeric vectors or different structured types. Solely then can it start to map uncooked enter to higher-level ideas (i.e., constructing a mannequin).

Human intelligence additionally will depend on inside representations.

In psychology, a illustration refers to an inside psychological image or picture that stands for one thing within the outdoors world [3]. In different phrases, a illustration is how info is encoded within the mind: the symbols we use (phrases, photographs, reminiscences, creative depictions, and many others.) to face for objects and concepts.

Our senses don’t merely put the exterior world straight into our brains; as a substitute, they convert sensory enter into summary neural alerts. For instance, the eyes convert mild into electrical alerts on the retina, and the ears flip air vibrations into nerve impulses. These neural alerts are the mind’s illustration of the exterior world, which is used to reconstruct our notion of actuality, primarily constructing a “mannequin” in our thoughts.

Between ages one and two, youngsters enter Piaget’s early preoperational stage [4]. That is when children begin utilizing one factor to symbolize one other: a toddler may maintain a banana as much as their ear and babble as if it’s a cellphone, or push a field round pretending it’s a automotive. This type of symbolic play is essential for cognitive growth, as a result of it reveals the kid can transfer past the here-and-now and mission the ideas of their thoughts onto actuality [5].

With out our senses translating bodily alerts into inside codes, we couldn’t understand something [5].

“Rubbish in, rubbish out”. The standard of a illustration units an higher certain on the efficiency of any mannequin constructed on it [6,7].

A lot of the progress in human intelligence has come from bettering how we symbolize information [8].

One of many core objectives of schooling is to assist college students type efficient psychological representations of recent information. Seasoned educators use diagrams, animations, analogies and different instruments to current summary ideas in a vivid, relatable manner. Richard Mayer argues that significant studying occurs when learners type a coherent psychological illustration or mannequin of the fabric, fairly than simply memorizing disconnected details [8]. In significant studying, new info integrates into current information, permitting college students to switch and apply it in novel conditions.

Nevertheless, in observe, components like restricted mannequin capability and finite computing sources constrain how complicated our representations will be. Compressing enter knowledge inevitably dangers info loss, noise, and artifacts. So, as step one, growing a “ok” illustration requires balancing a number of key properties:

It ought to retain the data important to the duty. (A clear drawback definition helps filter out the remaining.)
It needs to be as compact as attainable: minimizing redundancy and conserving dimensionality low.
It ought to separate lessons in function house. Samples from the identical class cluster collectively, whereas these from totally different lessons keep far aside.
It needs to be sturdy to enter noise, compression artifacts, and shifts in knowledge modality.
Invariance. Representations needs to be invariant to job‑irrelevant modifications (e.g. rotating or translating a picture, or altering its brightness).
Generalizability.
Interpretability.
Transferability.

These limitations on illustration complexity are considerably analogous to the restricted capability of our personal working reminiscence.

Human short-term reminiscence, on common, can solely maintain about 7±2 gadgets without delay [9]. When too many impartial items of data arrive concurrently (past what our cognitive load can deal with), our brains bathroom down. Cognitive psychology analysis reveals that with the precise steerage (by adjusting how info is represented), individuals can reorganize info to beat this obvious restrict [10,11]. For instance, we will keep in mind an extended string of digits extra simply by chunking them into significant teams (which is why cellphone numbers are sometimes cut up into shorter blocks).

Now, shifting from The Scream to the microscopic world of molecules, we face the identical problem: how can we translate real-world molecules right into a type that a pc can perceive? With the precise illustration, a pc can infer chemical properties or organic capabilities, and finally map these to increased‑degree ideas (e.g., a drug’s exercise or a molecule’s protein binding). On this article, we’ll discover the widespread strategies that allow computer systems “see” molecules.

Chemical System

Maybe probably the most easy depiction of a molecule is its chemical system, like C₈H₁₀N₄O₂ (caffeine), which tells us there are 8 carbon atoms, 10 hydrogen atoms, 4 nitrogen atoms and a couple of oxygen atoms. Nevertheless, its very simplicity can also be its limitation: a system conveys nothing about how these atoms are related (the bonding topology), how they’re organized in house, or the place useful teams are positioned. That’s why isomers (like ethanol and dimethyl ether) each share C₂H₆O but differ fully in construction and properties.

How Computers “See” Molecules: Figure 1 — Chemical system and 2D buildings of ethanol and dimethyl ether. Picture by writer.

Linear String

One other widespread approach to symbolize molecules is to encode them as a linear string of characters, a format extensively adopted in databases [12,13].

SMILES

Essentially the most basic instance is SMILES (Simplified Molecular Enter Line Entry System) [14], developed by David Weininger within the Eighties. SMILES treats atoms as nodes and bonds as edges, then “flattens” them right into a 1D string by way of a depth‑first traversal, preserving all of the connectivity and ring info. Single, double, triple, and fragrant bonds are denoted by the symbols “-”, “=”, “#”, and “:”, respectively. Numbers are used to mark the beginning and finish of rings, and branches off the primary chain are enclosed in parentheses. (See extra in SMILES – Wikipedia.)

SMILES is easy, intuitive, and compact for storage. Its prolonged syntax helps stereochemistry and isotopes. There’s additionally a wealthy ecosystem of instruments supporting it: most chemistry libraries allow us to convert between SMILES and different commonplace codecs.

Nevertheless, with out an agreed-upon canonicalization algorithm, the identical molecule will be written in a number of legitimate SMILES types. This may probably result in inconsistencies or “knowledge air pollution”, particularly when merging knowledge from a number of sources.

InChI

One other extensively used string format is InChI (Worldwide Chemical Identifier) [15], launched by IUPAC in 2005, to generate globally standardized, machine-readable, and distinctive molecule identifiers. InChI strings, although longer than SMILES, encode extra particulars in layers (together with atoms and their bond connectivity, tautomeric state, isotopes, stereochemistry, and cost), every with strict guidelines and precedence. (See extra in InChI – Wikipedia.)

As a result of an InChI string can turn out to be very prolonged as a molecule grows extra complicated, it’s usually paired with a 27‑character InChIKey hash [15]. The InChIKeys aren’t human‑pleasant, however they’re very best for database indexing and for exchanging molecule identifiers throughout programs.

How Computers “See” Molecules: Figure 2 — Linear representations of caffeine. Picture by writer.

Molecular Descriptor

Many computational fashions require numeric inputs. In comparison with linear string representations, molecular descriptors flip a molecule’s properties and patterns into a vector of numerical options, delivering passable efficiency in lots of duties [7, 16-18].

Todeschini and Consonni describe the molecular descriptor because the “ultimate results of a logical and mathematical process, which transforms chemical info encoded inside a symbolic illustration of a molecule right into a helpful quantity or the results of some standardized experiment” [16].

We will consider a set of molecular descriptors as a standardized “bodily examination sheet” for a molecule, asking questions like:

Does it have a benzene ring?
What number of carbon atoms does it have?
What’s the expected octanol-water partition coefficient (LogP)?
Which useful teams are current?
What’s its 3D conformation or electron distribution like?
…

Their solutions can take varied types, reminiscent of numerical values, categorical flags, vectors, graph-based buildings, tensors and many others. As a result of each molecule in our dataset is described utilizing the identical set of questions (the identical “bodily examination sheet”), comparisons and mannequin inputs turn out to be easy. And since every function has a transparent that means, descriptors enhance the interpretability of the mannequin.

In fact, simply as a bodily examination sheet can’t seize completely every little thing about an individual’s well being, a finite set of molecular descriptors can by no means seize all points of a molecule’s chemical and bodily nature. Computing descriptors is often a non-invertible course of, inevitably resulting in a lack of info, and the outcomes are usually not assured to be distinctive. Subsequently, there are several types of molecular descriptors, every specializing in totally different points.

Hundreds of molecular descriptors have been developed through the years (for instance, RDKit [19], CDK [20], Mordred [17], and many others.). They are often broadly categorized by the dimensionality of data they encode (these classes aren’t strict divisions):

0D: system‑based mostly properties impartial of construction (e.g., atom counts or molecular weight).
1D: sequence-based properties (e.g., counts of sure useful teams).
2D: derived from the 2D topology (e.g., eccentric connectivity index [21]).
3D: derived from 3D conformation, capturing geometric or spatial properties (e.g., charged partial floor space [22]).
4D and better: these incorporate further dimensions reminiscent of time, ensemble, or environmental components (e.g., descriptors derived from molecular dynamics simulations, or from quantum chemical calculations like HOMO/LUMO).
Descriptors obtained from different sources together with experimental measurements.

Molecular fingerprints are a particular form of molecular descriptor that encode substructures right into a fixed-length numerical vector [16]. This desk summarizes some generally used molecular fingerprints [23], reminiscent of MACCS [24], which is proven within the determine under.

Equally, human fingerprints or product barcodes may also be seen as (or transformed to) fixed-format numerical representations.

Completely different descriptors describe molecules from varied points, so their contributions to totally different duties naturally differ. In a job of predicting the aqueous solubility of drug-like molecules, over 4,000 computed descriptors had been evaluated, however solely about 800 made important contributions to the prediction [7].

How Computers “See” Molecules: Figure 3 — Some molecular descriptors of caffeine from PubChem, DrugBank and RDKit. Picture by writer.

Level Cloud

Typically, we want our fashions to study straight from a molecule’s 3D construction. For instance, that is essential once we’re taken with how two molecules may work together with one another [25], want to go looking the attainable conformations of a molecule [26], or need to simulate its conduct in a sure surroundings [27].

One easy approach to symbolize a 3D construction is as a degree cloud of its atoms [28]. In different phrases, a degree cloud is a group of coordinates of the atoms in 3D house. Nevertheless, whereas this illustration reveals which atoms are close to one another, it doesn’t explicitly inform us which pairs of atoms are bonded. Inferring connectivity from interatomic distances (e.g., by way of cutoffs) will be error-prone, and should miss increased‑order chemistry like aromaticity or conjugation. Furthermore, our mannequin should account for modifications of uncooked coordinates as a consequence of rotation or translation. (Extra on this later.)

Graph

A molecule may also be represented as a graph, the place atoms (nodes) are related by bonds (edges). Graph representations elegantly deal with rings, branches, and complicated bonding preparations. For instance, in a SMILES string, a benzene ring should be “opened” and denoted by particular symbols, whereas in a graph, it’s merely a cycle of nodes related in a loop.

Molecules are generally modeled as undirected graphs (since bonds haven’t any inherent course) [29-31]. We will additional “enhance” the graph with further domain-specific information to make the illustration extra interpretable: tagging nodes with atom options (e.g., component sort, cost, aromaticity) and edges with bond properties (e.g., order, size, energy). Subsequently,

(uniqueness) every distinct molecular construction may correspond to a singular graph, and
(reversibility) we may reconstruct the unique molecule from its graph illustration.

How Computers “See” Molecules: Figure 4 — Ball-and-stick and two representations of caffeine’s 3D conformation. (Grey: carbon; blue: nitrogen; plum: hydrogen; pink: oxygen). Picture by writer.

Chemical reactions primarily contain breaking bonds and forming new ones. Utilizing graphs makes it simpler to trace these modifications. Some response‑prediction fashions encode reactants and merchandise as graphs and infer the transformation by evaluating them [32,33].

Graph Neural Networks (GNNs) can straight course of graphs and study from them. Utilizing molecular graph illustration, these fashions can naturally deal with molecules of arbitrary measurement and topology. In reality, many GNNs have outperformed fashions that solely relied on descriptors or linear strings on many molecular duties [7,30,34].

Usually, when a GNN makes a prediction, we will examine which elements of the graph had been most influential. These “essential bits” ceaselessly correspond to precise chemical substructures or useful teams. In distinction, if we had been taking a look at a specific substring of a SMILES, it’s not assured to map neatly to a significant substructure.

A graph doesn’t all the time imply simply the direct bonds connecting atoms. We will assemble totally different sorts of graphs from molecular knowledge relying on our wants, and generally these alternate graphs yield higher outcomes for specific purposes. For instance:

Full graph: Each pair of nodes is related by an edge. It may introduce redundant connections, however may be used to let a mannequin take into account all pairwise interactions.
Bipartite graph: Nodes are divided into two units, and edges solely join nodes from one set to nodes from the opposite.
Nearest-neighbor graph: Every node is related solely to its nearest neighbors (based on some criterion), for controlling complexity.

Extensible Graph Representations

We will incorporate chemical guidelines or impose constraints inside molecular graphs. In de novo molecular design, (early) SMILES‑based mostly generative fashions usually produced SMILES strings ended up proposing invalid molecules, as a result of: (1) assembling characters might break SMILES syntax, and (2) even a syntactically right SMILES may encode an not possible construction. Graph‑based mostly generative fashions keep away from them by constructing molecules atom by atom and bond by bond (below user-specified chemical guidelines). Graphs additionally allow us to impose constraints: require or forbid particular substructures, implement 3D shapes or chirality, and so forth; thus, to information era towards legitimate candidates that meet our objectives [35,36].

Molecular graphs may deal with a number of molecules and their interactions (e.g., drug-protein binding, protein-protein interfaces). “Graph-of-graphs” deal with every molecule as its personal graph, then deploy a higher-level mannequin to find out how they work together [37]. Or, we might merge the molecules into one composite graph, together with all atoms from each companions and add particular (dummy) edges or nodes to mark their contacts [38].

Thus far, we’ve been contemplating the usual graph of bonds (the 2D connectivity), however what if the 3D association issues? Graph representations can actually be augmented with 3D info: 3D coordinates might be hooked up to every node, or distances/angles might be added as attributes on the sides, to make fashions extra delicate to distinction in 3D configurations. A greater choice is to make use of fashions like SE(3)-equivariant GNNs, which guarantee their outputs (or key inside options) remodel (or keep invariant) with any rotation or translation of the enter.

In 3D house, the particular Euclidean group SE(3) describes all attainable inflexible motions (any mixture of rotations and translations). (It’s generally described as a semidirect product of the rotation group SO(3) with the interpretation group R³.) [28]

After we say a mannequin or a operate has SE(3) invariance, we imply that it provides the identical outcome regardless of how we rotate or translate the enter in 3D. This type of invariance is commonly a necessary requirement for a lot of molecular modeling duties: a molecule floating in resolution has no mounted reference body (i.e., it may well tumble round in house). So, if we predict some property of the molecule (say its binding affinity), that prediction shouldn’t be influenced by the molecule’s orientation or place.

Sequence Representations of Biomacromolecules

We’ve talked largely about small molecules. However organic macromolecules (like proteins, DNA, and RNA) can include hundreds and even hundreds of thousands of atoms. SMILES or InChI strings turn out to be extraordinarily lengthy and complicated, resulting in the related huge computational, storage, and evaluation prices.

This brings us again to the significance of defining the issue: for biomacromolecules, we’re usually not within the exact place of each single atom or the precise bonds between every pair of atoms. As a substitute, we care about higher-level structural patterns and useful modules: like a protein’s amino acid spine and its alpha‑helices or beta‑sheets, which fold into tertiary and quaternary buildings. For DNA and RNA, we might care about nucleotide sequences and motifs.

We describe these organic polymers as sequences of their constructing blocks (i.e., major construction): proteins as chains of amino acids, and DNA/RNA as strings of nucleotides. There are well-established codes for these constructing blocks (outlined by IUPAC/IUBMB): for example, in DNA, the letters A, C, G, T symbolize the bases adenine, cytosine, guanine, and thymine respectively.

Static Embeddings and Pretrained Embeddings

To transform a sequence into numerical vectors, we will use static embeddings: assigning a set vector to every residue (or k-mer fragment). The only static embedding is one-hot encoding (e.g., encode adenine A as [1,0,0,0]), turning a sequence right into a matrix. One other method is to study dense (pretrained) embeddings by leveraging giant databases of sequences. For instance, ProtVec [39] breaks proteins into overlapping 3‑mers and trains a Word2Vec‑like mannequin (generally utilized in NLP) on a big corpus of sequences, assigning every 3-mer a 100D vector. These realized fragment embeddings are proven to seize biochemical and biophysical patterns: fragments with related capabilities or properties cluster nearer within the embedding house.

k-mer fragments (or k-mers) are substrings of size okay extracted from a organic sequence.

Tokens

Impressed by NLP, we will deal with a sequence as if it’s a sentence composed of tokens or phrases (i.e., residues or k-mer fragments), after which feed them into deep language fashions. Skilled on huge collections of sequences, these fashions study biology’s “grammar” and “semantics” simply as they do in human language.

Transformers can use self‑consideration to seize lengthy‑vary dependencies in sequences; and we primarily use them to study a “language of biology”. (Some) Meta’s ESM sequence of fashions [40-42] skilled Transformers on lots of of hundreds of thousands of protein sequences. Equally, DNABERT [43] tokenizes DNA into okay‑mers for BERT coaching on genomic knowledge. These sorts of obtained embeddings have been proven to encapsulate a wealth of organic info. In lots of instances, these embeddings can be utilized straight for varied duties (i.e., switch studying).

Descriptors

In observe, sequence-based fashions usually mix their embeddings with physicochemical properties, statistical options, and different descriptors, reminiscent of the share of every amino acid in a protein, the GC content material of a DNA sequence, or indices like hydrophobicity, polarity, cost, and molecular quantity.

Past the primary classes above, there are another unconventional methods to symbolize sequences. Chaos Recreation Illustration (CGR) [44] maps DNA sequences to factors in a 2D airplane, creating distinctive picture patterns for downstream evaluation.

Structural Representations of Biomacromolecules

The complicated construction (of a protein) determines its capabilities and specificities [28]. Merely figuring out the linear sequence of residues is commonly not sufficient to totally perceive a biomolecule’s operate or mechanism (i.e., sequence-structure hole).

Constructions are usually extra conserved than sequences [28, 45]. Two proteins may need very divergent sequences however nonetheless fold into extremely related 3D buildings [46]. Fixing the construction of a biomolecule may give insights that we wouldn’t get simply from the sequence alone.

Granularity and Dimensionality Management

A single biomolecule might include on the order of 10³-10⁵ atoms (or much more). Encoding each atom and bond explicitly into numerical type produces prohibitively high-dimensional, sparse representations.

Including dimensions to the illustration can rapidly run into the curse of dimensionality. As we improve the dimensionality of our knowledge, the “house” we’re asking our mannequin to cowl grows exponentially. Knowledge factors turn out to be sparser relative to that house (it’s like having a couple of needles in an ever-expanding haystack). This sparsity means a mannequin may want vastly extra coaching examples to search out dependable patterns. In the meantime, the computational price of processing the information usually grows polynomially or worse with dimensionality.

Not each atom is equally essential for the query we care about: we frequently flip to regulate the granularity of our illustration or cut back dimensionality in good methods (such knowledge usually has a lower-dimensional efficient illustration that may describe the system with out (important) efficiency loss [47]):

For proteins, every amino acid will be represented by the coordinates of simply its alpha carbon (C_α). For nucleic acids, one may take every nucleotide and symbolize it by the place of its phosphate group or by the middle of its base or sugar ring.
One other instance of managed granularity comes from how AlphaFold [49] represents protein utilizing spine inflexible teams (or frames). Basically, for every amino acid, a small set of main-chain atoms, usually the N, C_α, C (and possibly O) are handled as a unit. The relative geometry of those atoms is sort of mounted (covalent bond lengths and angles don’t differ considerably), in order that unit will be thought-about as a inflexible block. As a substitute of monitoring every atom individually, the mannequin tracks the place and orientation of that whole block in house, lowering the dangers related to extreme levels of freedom [28] (i.e., errors from the interior motion of atoms inside a residue).

How Computers “See” Molecules: Figure 5 — Heavy atoms in protein spine with dihedral angles. Picture derived from [28].

If we have now a big set of protein buildings (or an extended molecular dynamics trajectory), it may be helpful to cluster these conformations into a couple of consultant states. That is usually accomplished when constructing Markov state fashions: by clustering steady states right into a finite set of discrete “metastable” states, we will simplify a posh vitality panorama right into a community of some states related by transition possibilities.

Many coarse-grained molecular dynamics power fields, reminiscent of MARTINI [50] and UNRES [51], have been developed to symbolize structural particulars utilizing fewer particles.

To seize for side-chain results with out modelling all inside atoms or including extreme levels of freedom, a typical method is to symbolize every side-chain with a single level, usually its middle of mass [52]. Such side-chain centroid fashions are sometimes used at the side of spine fashions.
The 3Di Alphabet launched by Foldseek [53] defines a 3D interplay “alphabet” of 20 states that describe protein tertiary interactions. Thus, a protein’s 3D construction will be transformed right into a sequence of 20 symbols; and two buildings will be aligned by aligning their 3Di sequences.
We might spatially crop or give attention to simply a part of a biomolecule. For example, if we’re finding out how a small drug molecule binds to a protein (say, in a dataset like PDBBind [54], which is filled with protein-ligand complexes), we might solely feed the pockets and medicines into our mannequin.
Combining totally different granularities or modalities of knowledge.

Level Cloud

We may mannequin a biomacromolecule as an enormous 3D level cloud of each atom (or residue). As famous earlier, the identical limitations apply.

Distance Matrix

A distance matrix data all pairwise distances between sure key atoms (for proteins, generally the C_αof every amino acid), and is inherently invariant to rotation and translation as a consequence of its symmetric nature. A contact map simplifies this additional by indicating solely which pairs of residues are “shut sufficient” to keep in touch. Nevertheless, each representations lose directional info; so not all structural particulars will be recovered from them alone.

Graph

Equally, similar to we will use graphs for small molecules, we will use graphs for macromolecular buildings [55,56]. As a substitute of atoms, every node may symbolize a bigger unit (see Granularity and Dimensionality Management). To enhance interpretability, further information like residue descriptors and identified interplay networks inside a protein, can also be integrated in nodes and edges. Word that the graph illustration for biomacromolecules inherits most of the benefits we mentioned for small molecules.

For macromolecules, edges are sometimes pruned to maintain the graph sparse and manageable in measurement: primarily a type of native magnification that focuses on native substructures, whereas far-apart relationships are handled as background context.

Common dimensionality discount strategies reminiscent of PCA, t-SNE and UMAP are additionally extensively used to investigate the high-dimensional structural knowledge of macromolecules. Whereas they don’t give us representations for computation in the identical sense because the others we’ve mentioned, they assist mission complicated knowledge into decrease dimensions (e.g., for visualization or insights).

Latent Area

After we prepare a mannequin (particularly generative fashions), it usually learns to encode knowledge right into a compressed inside illustration. This inside illustration lives in some house of decrease dimension, referred to as the latent house. Consider London’s complicated city structure, dense and complicated, whereas the latent house is sort of a “map” that captures its essence in a simplified type.

Latent areas are often circuitously interpretable, however we will discover them by seeing how modifications in latent variables map to modifications within the output. In molecular era, if a mannequin maps molecules right into a latent house, we will take two molecules (say, as two factors in that house) and generate a path between them. Ochiai et. al. [57] did this by taking two identified molecules as endpoints, interpolating between their latent representations, and decoding the intermediate factors. The outcome was a set of recent molecules that blended options of each originals: hybrids that may have blended properties of the 2.

—— About Writer ——

Tianyuan Zheng
[email protected] | [email protected]
Computational Biology, Bioinformatics, Synthetic Intelligence

Division of Pc Science and Know-how
Division of Utilized Arithmetic and Theoretical Physics
College of Cambridge

Reference

Patterson DA, Hennessy JL. Pc group and design ARM version: the {hardware} software program interface. Morgan kaufmann; 2016 Could 6.
Harris S, Harris D. Digital Design and Pc Structure, RISC-V Version. Morgan Kaufmann; 2021 Jul 12.
Kosslyn SM, Koenig O. Moist thoughts: The brand new cognitive neuroscience. Simon and Schuster; 1992.
Piaget J, Prepare dinner M. The origins of intelligence in youngsters. New York: Worldwide universities press; 1952.
Bergen D. The position of faux play in youngsters’s cognitive growth. Early Childhood Analysis & Observe. 2002;4(1):n1.
Bengio Y, Courville A, Vincent P. Illustration studying: A evaluate and new views. IEEE transactions on sample evaluation and machine intelligence. 2013 Mar 7;35(8):1798-828.
Zheng T, Mitchell JB, Dobson S. Revisiting the appliance of machine studying approaches in predicting aqueous solubility. ACS omega. 2024 Jul 31;9(32):35209-22.
Mayer RE. Multimedia studying. In Psychology of studying and motivation 2002 Jan 1 (Vol. 41, pp. 85-139). Tutorial Press.
Miller GA. The magical quantity seven, plus or minus two: Some limits on our capability for processing info. Psychological evaluate. 1956 Mar;63(2):81.
Chase WG, Simon HA. Notion in chess. Cognitive psychology. 1973 Jan 1;4(1):55-81.
Simon HA. How Large Is a Chunk? By combining knowledge from a number of experiments, a fundamental human reminiscence unit will be recognized and measured. Science. 1974 Feb 8;183(4124):482-8.
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L. PubChem 2025 replace. Nucleic acids analysis. 2025 Jan 6;53(D1):D1516-25.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The protein knowledge financial institution. Nucleic acids analysis. 2000 Jan 1;28(1):235-42.
Weininger D. SMILES, a chemical language and data system. 1. Introduction to methodology and encoding guidelines. Journal of chemical info and pc sciences. 1988 Feb 1;28(1):31-6.
Heller S, McNaught A, Stein S, Tchekhovskoi D, Pletnev I. InChI-the worldwide chemical construction identifier commonplace. Journal of cheminformatics. 2013 Jan 24;5(1):7.
Todeschini R, Consonni V. Molecular descriptors for chemoinformatics: quantity I: alphabetical itemizing/quantity II: appendices, references. John Wiley & Sons; 2009 Oct 30.
Moriwaki H, Tian YS, Kawashita N, Takagi T. Mordred: a molecular descriptor calculator. Journal of cheminformatics. 2018 Feb 6;10(1):4.
Jaganathan Okay, Tayara H, Chong KT. An explainable supervised machine studying mannequin for predicting respiratory toxicity of chemical substances utilizing optimum molecular descriptors. Pharmaceutics. 2022 Apr 11;14(4):832.
RDKit: Open-source cheminformatics. https://www.rdkit.org
Willighagen EL, Mayfield JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova N, Kuhn S, Pluskal T, Rojas-Chertó M, Spjuth O, Torrance G. The Chemistry Growth Equipment (CDK) v2. 0: atom typing, depiction, molecular formulation, and substructure looking out. Journal of cheminformatics. 2017 Jun 6;9(1):33.
Sharma V, Goswami R, Madan AK. Eccentric connectivity index: A novel extremely discriminating topological descriptor for construction− property and construction− exercise research. Journal of chemical info and pc sciences. 1997 Mar 24;37(2):273-82.
Stanton DT, Jurs PC. Growth and use of charged partial floor space structural descriptors in computer-assisted quantitative structure-property relationship research. Analytical Chemistry. 1990 Nov 1;62(21):2323-9.
Boldini D, Ballabio D, Consonni V, Todeschini R, Grisoni F, Sieber SA. Effectiveness of molecular fingerprints for exploring the chemical house of pure merchandise. Journal of Cheminformatics. 2024 Mar 25;16(1):35.
Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of MDL keys to be used in drug discovery. Journal of chemical info and pc sciences. 2002 Nov 25;42(6):1273-80.
Kitchen DB, Decornez H, Furr JR, Bajorath J. Docking and scoring in digital screening for drug discovery: strategies and purposes. Nature critiques Drug discovery. 2004 Nov 1;3(11):935-49.
Friedrich NO, Meyder A, de Bruyn Kops C, Sommer Okay, Flachsenberg F, Rarey M, Kirchmair J. Excessive-quality dataset of protein-bound ligand conformations and its utility to benchmarking conformer ensemble turbines. Journal of chemical info and modeling. 2017 Mar 27;57(3):529-39.
Karplus M, McCammon JA. Molecular dynamics simulations of biomolecules. Nature structural biology. 2002 Sep 1;9(9):646-52.
Zheng T, Rondina A, Micklem G, Lio P. Challenges and Tips in Deep Generative Protein Design: 4 Case Research.
Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, Adams RP. Convolutional networks on graphs for studying molecular fingerprints. Advances in neural info processing programs. 2015;28.
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural message passing for quantum chemistry. InInternational convention on machine studying 2017 Jul 17 (pp. 1263-1272). Pmlr.
Wu Z, Pan S, Chen F, Lengthy G, Zhang C, Yu PS. A complete survey on graph neural networks. IEEE transactions on neural networks and studying programs. 2020 Mar 24;32(1):4-24.
Jin W, Coley C, Barzilay R, Jaakkola T. Predicting natural response outcomes with weisfeiler-lehman community. Advances in neural info processing programs. 2017;30.
Shi C, Xu M, Guo H, Zhang M, Tang J. A graph to graphs framework for retrosynthesis prediction. InInternational convention on machine studying 2020 Nov 21 (pp. 8818-8827). PMLR.
Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing Okay, Pande V. MoleculeNet: a benchmark for molecular machine studying. Chemical science. 2018;9(2):513-30.
Lim J, Ryu S, Kim JW, Kim WY. Molecular generative mannequin based mostly on conditional variational autoencoder for de novo molecular design. Journal of cheminformatics. 2018 Jul 11;10(1):31.
Maziarka Ł, Pocha A, Kaczmarczyk J, Rataj Okay, Danel T, Warchoł M. Mol-CycleGAN: a generative mannequin for molecular optimization. Journal of Cheminformatics. 2020 Jan 8;12(1):2.
Wang H, Lian D, Zhang Y, Qin L, Lin X. Gognn: Graph of graphs neural community for predicting structured entity interactions. arXiv preprint arXiv:2005.05537. 2020 Could 12.
Jiang D, Hsieh CY, Wu Z, Kang Y, Wang J, Wang E, Liao B, Shen C, Xu L, Wu J, Cao D. InteractionGraphNet: a novel and environment friendly deep graph illustration studying framework for correct protein–ligand interplay predictions. Journal of medicinal chemistry. 2021 Dec 8;64(24):18209-32.
Asgari E, Mofrad MR. Steady distributed illustration of organic sequences for deep proteomics and genomics. PloS one. 2015 Nov 10;10(11):e0141287.
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J, Fergus R. Organic construction and performance emerge from scaling unsupervised studying to 250 million protein sequences. Proceedings of the Nationwide Academy of Sciences. 2021 Apr 13;118(15):e2016239118.
Rao R, Meier J, Sercu T, Ovchinnikov S, Rives A. Transformer protein language fashions are unsupervised construction learners. Biorxiv. 2020 Dec 15:2020-12.
Rao RM, Liu J, Verkuil R, Meier J, Canny J, Abbeel P, Sercu T, Rives A. MSA transformer. InInternational convention on machine studying 2021 Jul 1 (pp. 8844-8856). PMLR.
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers mannequin for DNA-language in genome. Bioinformatics. 2021 Aug 1;37(15):2112-20.
Jeffrey HJ. Chaos recreation illustration of gene construction. Nucleic acids analysis. 1990 Apr 25;18(8):2163-70.
Illergård Okay, Ardell DH, Elofsson A. Construction is three to 10 occasions extra conserved than sequence—a examine of structural response in protein cores. Proteins: Construction, Perform, and Bioinformatics. 2009 Nov 15;77(3):499-508.
Chothia C, Lesk AM. The relation between the divergence of sequence and construction in proteins. The EMBO journal. 1986 Apr 1;5(4):823-6.
Roel-Touris J, Don CG, V. Honorato R, Rodrigues JP, Bonvin AM. Much less is extra: coarse-grained integrative modeling of huge biomolecular assemblies with HADDOCK. Journal of chemical idea and computation. 2019 Sep 20;15(11):6358-67.
Duong VT, Diessner EM, Grazioli G, Martin RW, Butts CT. Neural Upscaling from Residue-Degree Protein Construction Networks to Atomistic Constructions. Biomolecules. 2021 Nov 30;11(12):1788.
Jumper J, Evans R, Pritzel A, Inexperienced T, Figurnov M, Ronneberger O, Tunyasuvunakool Okay, Bates R, Žídek A, Potapenko A, Bridgland A. Extremely correct protein construction prediction with AlphaFold. nature. 2021 Aug 26;596(7873):583-9.
Marrink SJ, Risselada HJ, Yefimov S, Tieleman DP, De Vries AH. The MARTINI power discipline: coarse grained mannequin for biomolecular simulations. The journal of bodily chemistry B. 2007 Jul 12;111(27):7812-24.
Liwo A, Baranowski M, Czaplewski C, Gołaś E, He Y, Jagieła D, Krupa P, Maciejczyk M, Makowski M, Mozolewska MA, Niadzvedtski A. A unified coarse-grained mannequin of organic macromolecules based mostly on mean-field multipole–multipole interactions. Journal of molecular modeling. 2014 Aug;20(8):2306.
Cao F, von Bülow S, Tesei G, Lindorff‐Larsen Okay. A rough‐grained mannequin for disordered and multi‐area proteins. Protein Science. 2024 Nov;33(11):e5172.
Van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CL, Söding J, Steinegger M. Quick and correct protein construction search with Foldseek. Nature biotechnology. 2024 Feb;42(2):243-6.
Wang R, Fang X, Lu Y, Yang CY, Wang S. The PDBbind database: methodologies and updates. Journal of medicinal chemistry. 2005 Jun 16;48(12):4111-9.
Ingraham J, Garg V, Barzilay R, Jaakkola T. Generative fashions for graph-based protein design. Advances in neural info processing programs. 2019;32.
Jing B, Eismann S, Suriana P, Townshend RJ, Dror R. Studying from protein construction with geometric vector perceptrons. arXiv preprint arXiv:2009.01411. 2020 Sep 3.
Ochiai T, Inukai T, Akiyama M, Furui Okay, Ohue M, Matsumori N, Inuki S, Uesugi M, Sunazuka T, Kikuchi Okay, Kakeya H. Variational autoencoder-based chemical latent house for big molecular buildings with 3D complexity. Communications Chemistry. 2023 Nov 16;6(1):249.

How Computer systems “See” Molecules | In the direction of Knowledge Science

Constructing A Profitable Relationship With Stakeholders

Find out how to Spin Up a Venture Construction with Cookiecutter

Related Posts

Constructing A Profitable Relationship With Stakeholders

Find out how to Spin Up a Venture Construction with Cookiecutter

10 Information + AI Observations for Fall 2025

How the Rise of Tabular Basis Fashions Is Reshaping Knowledge Science

Plotly Sprint — A Structured Framework for a Multi-Web page Dashboard

How To Construct Efficient Technical Guardrails for AI Functions

Debugging and Tracing LLMs Like a Professional

Leave a Reply Cancel reply

POPULAR NEWS

XMN is accessible for buying and selling!

College endowments be a part of crypto rush, boosting meme cash like Meme Index

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

Scaling Statistics: Incremental Customary Deviation in SQL with dbt | by Yuval Gorchover | Jan, 2025

Coinbase Recordsdata Authorized Movement In opposition to SEC Over Misplaced Texts From Ex-Chair Gary Gensler

EDITOR'S PICK

Ought to Knowledge Scientists Care About Quantum Computing?

Discovering Golden Examples: A Smarter Strategy to In-Context Studying

Is Cardano’s plan to transform a part of ADA treasury into Bitcoin a smart transfer?

Nationwide Lab’s Machine Studying Venture to Advance Seismic Monitoring Throughout Vitality Industries

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

How Computer systems “See” Molecules | In the direction of Knowledge Science

READ ALSO

Chemical System

Linear String

SMILES

InChI

Molecular Descriptor

Level Cloud

Graph

Extensible Graph Representations

Sequence Representations of Biomacromolecules

Static Embeddings and Pretrained Embeddings

Tokens

Descriptors

Structural Representations of Biomacromolecules

Granularity and Dimensionality Management

Level Cloud

Distance Matrix

Graph

Latent Area

Reference

Related Posts

Leave a Reply Cancel reply

POPULAR NEWS

EDITOR'S PICK

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?