SUBSET: computation of a representative subset from a large dataset

Introduction

SUBSET is a clustering program useful for the selection of a set of input vectors evenly scattered over the entire input space. The area to which SUBSET has been applied so far is chemo-informatics. Although SUBSET has been developed to cluster chemical databases, it does not contain any algorithm to handle molecular structures. Its inputs are the usual bitstrings, e.g, "1100011001...," which are widely used to represent the presence and absence of molecular fragments and/or structural features. The only distance metric available yet is the Tanimoto coefficient.

SUBSET works as a batch program and should be very simple to use. There is only one parameter to choose (the Tanimoto coefficient). The input data are read from (and expected in) a simple text format with one entry per line. The program output is a selection of the input lines.

SUBSET has been designed for performance. The algorithm employed to compute the Tanimoto coefficient is strongly optimized. Calculating the Tanimoto coefficient between two 431-bit vectors can be done about 1,900,000 times a second on an Intel 500 MHz Pentium II Linux computer. The algorithm used for clustering is based on the Stochastic Clustering Algorithm (SCA) described by Reynolds et al. [Lead Discovery Using Stochastic Cluster Analysis (SCA): A New Method for Clustering Structurally Similar Compounds. J. Chem. Inf. Comput. Sci.. 1998; 38(2): 305-312]. Unlike the published algorithm, the order of input of the database vectors is not randomized. Using SUBSET, calculating a subset of 34,471 molecules from the NCI database (~250K entries), using 431-bit vectors and a typical Tanimoto coefficient of 80%, takes about 30 minutes and requires 4 MB of memory.

SUBSET is written in ANSI C. It should be easy to recompile the source code on any platform. Unlike many other scientific programs that have arbitrary limitations, SUBSET has no arbitrary limits for the number of entries, the subset size and the length of the bit strings. All data structures grow dynamically when needed. Great care has been taken to ensure C code correctness at run time.

SUBSET has been applied to an evaluation of the diversity of chemical databases (see Johannes H. Voigt, Bruno Bienfait, Shaomeng Wang, and Marc C. Nicklaus, Comparison of the NCI Open Database with Seven Large Chemical Structural Databases. J. Chem. Inf. Comput. Sci. 2001; 41(3): 702-712; abstract)

Download

The latest release of the program can be downloaded here (file: /subset/subset_1.0.tgz">subset_1.0.tgz). (~ 64 kB).

Installation

SUBSET is delivered only in source code in the form of a Unix tar archive compressed with gzip. To extract the archive, use the following standard command:

gunzip < 'ARCHIVE_NAME' | tar xfv -

where ARCHIVE_NAME is the name of the downloaded file (Note: MS Windows users can use Winzip 8 to open the archive).

Using your Unix shell, change the working directory to the installation directory and type the command:

make

The latter command will start the compilation and linking phases. To check if the SUBSET program has been built correctly, the make command also runs a small test suite.

Input File Format

Input files for SUBSET are simple text files containing lines in the form of label - blank character - bitstrings.

One or more blank characters are used as a separator. Example:

Mol 0101010101010

Mol2 0101010010010

More examples can be found in the Test directory. These example files were generated with the help of the CACTVS toolkit (see http://www.xemistry.com/). The CACTVS subdirectory contains a TCL script useful for generating SUBSET input files from SMILES, MDL or any other chemistry file format supported by CACTVS.

Usage

Here is a Unix command line example:

subset -sim 0.5 < Test/nci_1000.tab > temp.tab

The argument to the -sim option is the similarity factor (Tanimoto coefficient), which is a number in the range 0.0 to 1.0. A small number will yield a small number of subsets.

Limitations

The *reported* number of distance comparisons might be wrong when using large datasets because of an integer overflow. This does not affect the results saved in the output file.

Disclaimer

This program is provided free of charge to anyone on an "as is" basis, and without warranty of any kind, including but not limited to any implied warranty of merchantability or fitness for a particular purpose. In no event shall the author or the National Institutes of Health be liable for any direct, indirect, incidental, special, or consequential damages arising from use or distribution of this software.

SUBSET was written by Bruno Bienfait while on a Visiting Fellowship at the National Cancer Institute, National Institutes of Health.

Last Update: