Browse views: by Year, by Function, by GLF, by Subfunction, by Conference, by Journal

Euclidean Chemical Spaces from Molecular Fingerprints: Hamming Distance and Hempels Ravens

Martin, Eric and Cao, Edward (2014) Euclidean Chemical Spaces from Molecular Fingerprints: Hamming Distance and Hempels Ravens. Journal of Computer-Aided Molecular Design.


Molecules are often characterized by sparse binary fingerprints, where 1s represent the presence of substructures and 0s represent their absence. Fingerprints are especially useful for similarity calculations, such as database searching or clustering, generally measuring similarity as the Tanimoto coefficient. In other cases, such as visualization, design of experiments, or latent variable regression, a low-dimensional Euclidian “chemical space” is more useful, where proximity between points reflects chemical similarity. A temptation is to apply principal components analysis (PCA) directly to these fingerprints to obtain a low dimensional continuous chemical space. However, Gower has shown that distances from PCA on bit vectors are proportional to the square root of Hamming distance. Unlike Tanimoto similarity, Hamming similarity gives equal weight to shared 0s as to shared 1s. I.e., Hamming similarity gives as much weight to substructures that neither molecule contains, as to substructures which both molecules contain. Proximity in the corresponding chemical space reflects mainly similar size and complexity rather than shared chemical substructures. These are not well suited for visualizing and optimizing coverage of space, or as latent variables for regression. A more suitable alternative is performing Multi-Dimensional Scaling (MDS) on the Tanimoto distance matrix, which produces a space where proximity does reflect structural similarity.

Item Type: Article
Keywords: chemical fingerprint, PCA, principal components analysis, MDS, multidimensional scaling, distance geometry, Tanimoto similarity, Hamming similarity
Date Deposited: 26 Apr 2016 23:45
Last Modified: 26 Apr 2016 23:45