Explore chapters and articles related to this topic
Monte Carlo Molecular Simulations
Published in Mihai V. Putz, New Frontiers in Nanochemistry, 2020
Bogdan Bumbăcilă, Mihai V. Putz
A 2014 study proved that the Monte Carlo method is an efficient approach to build up a robust model for estimating HIV-1 integrase inhibition of coumarin compounds. The aim of this study was the building of QSAR models for coumarin derivatives as HIV-1 integrase inhibitors with the application of the Monte Carlo method. A dataset of 26 coumarin derivatives with known HIV-1 integrase inhibition activity was selected for the study. Canonical simplified molecular input-line entry system (SMILES) for all the compounds were generated with the ACD/Chem Sketch program (ACD/Chem Sketch v.11.0) in order to preserve consistency because different software may generate different SMILES notations. SMILES is a representation of the molecular structure by a sequence of symbols. Some symbols represent molecular fragments, such as atoms or bonds (e.g., ‘C,’ ‘N,’ ‘=,’ ‘#,’ etc.). Some of these fragments are represented by two symbols (e.g., ‘Br,’ ‘Cl,’ ‘@@,’ etc.) which cannot be separated. Optimal SMILES-based descriptors, determined by descriptor correlation weight (DCW(T,Nepoch)), were calculated with CORAL software (Veselinović et al., 2014).
Electrolytes for High-Voltage Lithium-Ion Battery
Published in Ming-Fa Lin, Wen-Dung Hsu, Jow-Lay Huang, Lithium-Ion Batteries and Solar Cells, 2021
Ming-Hsiu Wu, Chih-Ao Liao, Ngoc Thanh Thuy Tran, Wen-Dung Hsu
The SMILES is a molecular representation in the form of a line notation for describing the molecular structure by using ASCII characters. For example, benzene is denoted in the form of SMILES as c1ccccc1. It is the most popular representation in the field of machine learning since it follows particular grammar syntax and can be directly applied to natural language processing (NLP) models. In practice, because many machine learning algorithms cannot process characters (strings) as input directly, it has to be converted into numeric form. The way to standardize converting SMILES into numeric form is first setting every different character as an atom type. Then, for a given molecule every character of its SMILES is converted into bit vectors formed by the atom types. The bit vectors then combined follow the sequence of characters appearing in its SMILES to form a binary matrix that can directly operate as machine learning input. This scheme is known as one-hot encoding. Invertibility is a main advantage of SMILES, since one-hot encoding representation can be converted back to original molecules directly. SMILES, however, also suffers drawback at the same time. One molecule can have multiple SMILES representations. The nonuniqueness SMILES stem from the arbitrary starting atom in a molecule can be used to construct its SMILES. Some cheminformatic packages, such as RDKit [12], have the function to canonize the SMILES. However, Bjerrum et al. argue that the latent space created from canonical SMILES may have problems, since only specific grammar syntax has been learned, instead of the general underlying rule of molecule structure [13].
Smiles
Published in Mihai V. Putz, New Frontiers in Nanochemistry, 2020
Mihai V. Putz, Nicoleta A. Dudaş
Nowadays, SMILES is most commonly used for storage and retrieval of compounds across multiple computer platforms, allow interpretation and generation of chemical notation, independent of the specific computer system in use (Weininger, 1988; 1990; Weininger et al., 1989, Daylight Chemical Information Systems, 2008a,b,c, 2013; Sliwoski et al., 2013).
Target-specific toxicity knowledgebase (TsTKb): a novel toolkit for in silico predictive toxicology
Published in Journal of Environmental Science and Health, Part C, 2018
Yan Li, Gabriel Idakwo, Sundar Thangapandian, Minjun Chen, Huixiao Hong, Chaoyang Zhang, Ping Gong
Chemical data table: This table stores inherent chemical properties that can be calculated or derived based on its structure. Currently, more than 100,000 chemical substances with known toxicity targets have been curated and deposited. Each chemical is labeled with three identifiers: Chemical Abstracts Services number (CASN), US EPA’s DTXSID/DTXCID, and PubChem CID. Through DTXSID/DTXCID and CID, a chemical can be linked to DSSTox and PubChem, respectively, enabling the access to additional chemical information from the external sources. The chemical structure is described by SMILES. Other properties include International Chemical Identifier (InChI), general name, IUPAC name, and so on, which can be used for query.
Prediction of the coefficient of linear thermal expansion for the amorphous homopolymers based on chemical structure using machine learning
Published in Science and Technology of Advanced Materials: Methods, 2021
Ekaterina Gracheva, Guillaume Lambard, Sadaki Samitsu, Keitaro Sodeyama, Ayako Nakata
Since given SMILES can be written in multiple forms when starting from different atoms within a molecule, multiple SMILES strings correspond to the same molecule introducing input invariance. To address this issue, SMILES-X implements data augmentation: given a molecule consisting of atoms, one can write SMILES representations, with duplicates removed. This allows the model to deepen its understanding of structure–property relationships, by becoming agnostic to the SMILES multiple arrangements. The details on the SMILES-X software can be found in the corresponding paper [12].
NMR-TS: de novo molecule identification from NMR spectra
Published in Science and Technology of Advanced Materials, 2020
Jinzhe Zhang, Kei Terayama, Masato Sumita, Kazuki Yoshizoe, Kengo Ito, Jun Kikuchi, Koji Tsuda
NMR-TS is still in development and has a handful of limitations and possibilities of improvement. First, SMILES cannot represent many features of organic molecules such as axial chirality. It may be resolved by using graph-based representations [24]. In our study, NMR-TS is tested only with computationally generated spectra and still needs to be tested with experimental spectra where peaks are unclear. Impurities are possible obstacles for accurate identification. NMR-TS cannot identify multiple compounds in a mixture, but could be extended by incorporating peak separation techniques presented in [14]. To save computational time, we employed only one conformer per molecule. If k conformers are considered, the accuracy of NMR-TS should improve at expense of almost k-fold increase in computational cost. Also, our DFT-based spectrum computation can be replaced, e.g., by ENSO [44] in pursuit of better accuracy. See Fig. S3 for comparison of our spectrum with that of ENSO. ENSO took 250 minutes to compute a spectrum, while our DFT calculation took 11 minutes. Compared to our DFT calculation, ENSO showed better accuracy in predicting the experimental spectrum, presumably because ENSO uses multiple conformers for spectrum calculation, while our calculation relies on only one conformer. For molecule generation from experimental spectra, we would need a robust method like ENSO. At this point, the application of NMR-TS is limited to relatively small molecules due to high computational cost. To deal with larger molecules, the incorporation of fragment assembly [14] into NMR-TS might be beneficial. Finally, it is difficult for users to understand why NMR-TS succeeds for some molecules and fails for others. In general, interpreting the results of a neural-network-based system is known to be very difficult [45]. Nevertheless, some methods for explainable AI might improve the interpretability of NMR-TS [45].