GDC - Generic Data Clustering

The GDC software attempts to mimick the C4.5 system, except that it implements unsupervised clustering algorithms, rather than supervised methods. Currently supported clustering algorithms are: k-means, k-mediods and agglomerative clustering.

To run GDC on a Solaris 2.7 machine at ISI, run /nfs/isd/hdaume/bin/gdc; for other systems or external to ISI, download the source code (GDC.tar.gz) and compile it with GHC.

Input Format

The input format for GDC is simple: each line corresponds to one data element; each entry in that line corresponds to its values on a given feature except for the first column which names the entity (not used for clustering). Features should be prescaled. Additionally, the last column may be used to specify the correct classification of this entry, for evaluation purposed (also not used for clustering). For instance, the following is a valid input file:

E1 0.5 1 1.2 0.2 C1
E2 1 0.75 1 1.3 C2
E3 1 1 1 1.4 C2

In this instance, there are three entries (E1, E2 and E3) with four entries in the feature vector and two classes (C1 and C2). The class names must begin with an alphanumeric and feature values must beging with a digit (i.e., you must write "0.2" not just ".2"). The entity name is also optional, in which case entities will be named En for n beginning at 1 and going down.

We allow Haskell-style comments; single line comments begin with -- and comment spands are enclosed in {- and -}.

Running GDC

This is equally simple. Assume we saved the above data in a file named test; we could run gdc as:

% gdc test -o test --k-means

This would create several file, which I will explain briefly:

test.clusters

- contains the actual cluster layouts, for instance, we get:

k chosen by slope statistic = 2

1: [["E1","E2","E3"]]
2: [["E2","E3"],["E1"]]
3: [["E3"],["E2"],["E1"]]
Which basically means that if we want one cluster, put them all in one cluster; if we want two clusters (which is what it thinks is the best choice), put E2 and E3 together and leave E1 by itself; finally if we want three clusters, put each in its own cluster.

test.dot

- a file suitable for input to dot that shows you the hierachical clustering sturcuture

test.graph

A file suitable for input to xgraph which shows the k/within-cluster mean squared error and slopes

test.pr

precision/recall graph for xgraph

test.se

k/string-edit-distance graph for xgraph

If you don't include true classification information (i.e., the last column in the input file), it won't create the pr and se files, since we need to know the "right answer" in order to calculate these.

Command line arguments

- you can run gdc in one of three ways (currently -- this may be expanded later): --k-means, --k-mediodss or --agglom. When run in mediods or agglomerative mode, you need to specify a distance metric (either --cosine, --euclidean-distance or --euclidean-sqr). For binary data, you can also use --dice, --jaccard or --overlap for those coefficients. Furthermore, when running in agglomerative mode, you need to specify the clustering metric (i.e., when to merge two clusters), one of --min, --max or --avg.

You can specify the prefix of all the output files with -o and the titles of the graphs with --title=STRING.

You can specify the number of clusters (if this isn't specified, it clusters for all possible values) by specifying -k3 to only do three clusters or -k3-8 to cluster for values of k from 3 the 8 (inclusive).

Finally, you can control the output more precisely. You can suppress the creation of any of the output files by specifying --no-output-output-extension, for instance, --no-output-graph. Furthermore, you can rename the output of any of these files by specifying --output-output-extension =FILE, for instance, --output-se=my-se-file

You can get a list of all the options by running gdc --help, but this doesn't include any information about which options go with which other options. gdc should warn you if you try to do something dumb (like specifying --cosine with k-means (which must use squared euclidean distance), but it might not all the time. Use your judgment.

If you have any questions or encounter any bugs, please let me know. Finally, if something seems to be running too slowly, it may simply be because I didn't spend too much time optimizing the algorithm (ahem, k-means -- k-mediods might be a better bet); if you really want a better version, let me know and I'll put it on my to-do list.

To-do list

Allow feature weighting, implement divisive clustering, optimize k-means, general optimizations, allow unknown entries in the input files, implement fuzzy k-means, allow random seeds to be specified in the command line.