Explore chapters and articles related to this topic
Symbols, Terminology, and Nomenclature
Published in W. M. Haynes, David R. Lide, Thomas J. Bruno, CRC Handbook of Chemistry and Physics, 2016
W. M. Haynes, David R. Lide, Thomas J. Bruno
The IUPAC International Chemical Identifier (InChI) is a freely available, non-proprietary identifier for chemical substances that can be used in both printed and electronic data sources. It is generated from a computerized representation of a molecular structure diagram, which can be produced by chemical structuredrawing software. Its use enables linking of diverse data compilations and unambiguous identification of chemical substances. A full description of the Identifier and software for its generation are available from the IUPAC Web site (Ref. 1), and a helpful compilation of answers to frequently asked questions has been put together at the Unilever Centre for Molecular Science Informatics (Ref. 2). Commercial structure-drawing software that will generate the Identifier is available from several organizations, listed on the IUPAC Web site. The conversion of structural information to the Identifier is based on a set of IUPAC structure conventions, and rules for normalization and canonicalization (conversion to a single, predictable sequence) of an input structure representation. The resulting InChI is simply a series of characters that serve to uniquely identify the structure from which it was derived. The InChI uses a layered format to represent all available structural information relevant to compound identity. InChI layers are listed below. Each layer in an InChI representation contains a specific type of structural information. These layers, automatically extracted from the input structure, are designed so that each successive layer adds additional detail to the Identifier. The specific layers generated depend on the level of structural detail available and whether or not allowance is made for tautomerism. Of course, any ambiguities or uncertainties in the original structure will remain in the InChI. This layered structure design offers a number of advantages. If two structures for the same substance are drawn at different levels of detail, the one with the lower level of detail will, in effect, be contained within the other. Specifically, if one substance is drawn with stereo-bonds and the other without, the layers in the latter will be a subset of the former. The same will hold for compounds treated by one author as tautomers and by another as exact structures with all H-atoms fixed. This can work at a finer level. For example, if one author includes double bond and tetrahedral stereochemistry, but another omits stereochemistry, the latter InChI will be contained in the former. The InChI layers are 1. Formula 2. Connectivity (no formal bond orders) a. disconnected metals b. connected metals 3. Isotopes 4. Stereochemistry a. double bond (Z/E) b. tetrahedral (sp3) 5. Tautomers (on or off ) Charges are not part of the basic InChI, but rather are added at the end of the InChI string. Two examples of InChI representations are given below. It is important to recognize, however, that InChI strings are intended for use by computers and end users need not understand any of their details. In fact, the open nature of InChI and its flexibility of representation, after implementation into software systems, may allow chemists to be even less concerned with the details of structure representation by computers.
A Systematic Review of Deep Learning Approaches for Natural Language Processing in Battery Materials Domain
Published in IETE Technical Review, 2022
Geetanjali Singh, Namita Mittal, Satyendra Singh Chouhan
The increase in the available chemical data, machine learning and DL architectures enable the advancement in the field of cheminformatics and discovery of important material, which is useful in the representation and exploration of new material formation. New materials were discovered using various traditional methods but suffered from a bottleneck, as the interaction between the molecules is still difficult to predict and the whole process is time consuming. Few text alternatives for chemical structure representation are chemical formula, international chemical identifier (InChI) name given by IUPAC [50] and simplified molecular input line entry specification (SMILES) [51]. These text-based representations of the chemical entities are easily available to the researchers on the internet and hence NLP techniques can be utilized in the processing of the text-based representations of these chemicals and help to discover the unstructured or hidden knowledge. For the purpose of using NLP, some chemical databases are utilized such as UniProt, PDB, PFam, PROSITE, PubChem, DrugBank [52]. Carrera et al. [53], discovered six novel guanidinium salts using the QSPR model, which is based on the CPG neural networks, which can understand the structural relationship of the guanidinium cations.
Target-specific toxicity knowledgebase (TsTKb): a novel toolkit for in silico predictive toxicology
Published in Journal of Environmental Science and Health, Part C, 2018
Yan Li, Gabriel Idakwo, Sundar Thangapandian, Minjun Chen, Huixiao Hong, Chaoyang Zhang, Ping Gong
Chemical data table: This table stores inherent chemical properties that can be calculated or derived based on its structure. Currently, more than 100,000 chemical substances with known toxicity targets have been curated and deposited. Each chemical is labeled with three identifiers: Chemical Abstracts Services number (CASN), US EPA’s DTXSID/DTXCID, and PubChem CID. Through DTXSID/DTXCID and CID, a chemical can be linked to DSSTox and PubChem, respectively, enabling the access to additional chemical information from the external sources. The chemical structure is described by SMILES. Other properties include International Chemical Identifier (InChI), general name, IUPAC name, and so on, which can be used for query.