ABSTRACT
Fast retrieval of the relevant information from the
databases has always been a significant issue. Differenttechniques have been
developed for this purpose one of them is Data Clustering. Data
clustering is a common technique for statistical data analysis, which is
used in many fields, including machine learning, data mining, pattern
recognition, image analysis and bioinformatics. Clustering is the
classification of similar objects into different groups, or more precisely, the
partitioning of a data set into subsets (clusters), so that the data in each
subset (ideally) share some common trait - often proximity according to some
defined distance measure. The Data Clustering Engine (DCE) can be used
with formatted or unformatted data. The data does not need to be cleaned or
scrubbed to achieve high-quality results.
INTRODUCTION:
Data clustering is a method in which we make cluster of
objects that are somehow similar in characteristics. The criterion for checking
the similarity is implementation dependent.
Clustering is often confused with classification, but there
is some difference between the two. In classification the objects are assigned
to pre defined classes, whereas in clustering the classes are also to be
defined. Precisely, Data Clustering is a technique in which, the information
that is logically similar is physically stored together. In order to increase
the efficiency in the database systems the number of disk accesses are to be
minimized. In clustering the objects of similar properties are placed in one
class of objects and a single access to the disk makes the entire class
available.
Example to Elaborate the Idea of
Clustering
In order to
elaborate the concept a little bit, let us take the example of the library
system. In a library books concerning to a large variety of topics are
available. They are always kept in form of clusters. The books that have some
kind of similarities among them are placed in one cluster. For example, books
on the database are kept in one shelf and books on operating systems are kept
in another cupboard, and so on. To further reduce the complexity, the books
that cover same kind of topics are placed in same shelf. And then the shelf and
the cupboards are labeled with the relative name. Now when a user wants a book
of specific kind on specific topic, he or she would only have to go to that
particular shelf and check for the book rather than checking in the entire
library.
DEFINITIONS:
In this section some
frequently used terms are defined.
1.Cluster
A cluster is an ordered list of objects, which have some common
characteristics. The objects belong to an interval [a , b], in our case [0 , 1]
[1]
2. Distance
Between Two Clusters
The distance between
two clusters involves some or all elements of the two clusters. The clustering
method determines how the distance should be computed.
3. Similarity
A similarity measure
SIMILAR ( Di, Dj ) can be used to represent the
similarity between the documents. Typical similarity generates values of 0 for
documents exhibiting no agreement among the assigned indexed terms, and 1 when
perfect agreement is detected. Intermediate values are obtained for cases of
partial agreement.
4. Average
Similarity
If the similarity
measure is computed for all pairs of documents ( Di, Dj )
except when i=j, an average value AVERAGE SIMILARITY is obtainable.
Specifically, AVERAGE SIMILARITY = CONSTANT SIMILAR ( Di, Dj
), where i=1,2,….n and j=1,2,….n and i <
> j
5. Threshold
The lowest possible
input value of similarity required to join two objects in one cluster.
6. Similarity
Matrix
Similarity between
objects calculated by the function SIMILAR (Di,,Dj),
represented in the form of a matrix is called a similarity matrix.
7. Cluster Seed
First document or
object of a cluster is defined as the initiator of that cluster i.e. every
incoming object’s similarity is compared with the initiator. The initiator is
called the cluster seed.
TYPES OF CLUSTERING METHODS:
There are many
clustering methods available, and each of them may give a different grouping of
a dataset. The choice of a particular method will depend on the type of output
desired, The known performance of method with particular types of data, the
hardware and software facilities available and the size of the dataset. In
general , clustering methods may be divided into two categories based on the
cluster structure which they produce. The non-hierarchical methods divide a
dataset of N objects into M clusters, with or without overlap.
These methods are
sometimes divided into partitioning methods, in which the classes
are mutually exclusive, and the less common clumping method, in
which overlap is allowed. Each object is a member of the cluster with which it
is most similar, however the threshold of similarity has to be defined. The
hierarchical methods produce a set of nested clusters in which each pair of
objects or clusters is progressively nested in a larger cluster until only one
cluster remains.
The hierarchical methods can be further divided into agglomerative
or divisive methods. In agglomerative methods ,
the hierarchy is build up in a series of N-1 agglomerations, or Fusion, of
pairs of objects, beginning with the un-clustered dataset. The less common divisive
methods begin with all objects in a single cluster and at each of N-1 steps
divides some clusters into two smaller clusters, until each object resides in
its own cluster.
Some of the
important Data Clustering Methods are described below.
1.Partitioning
Methods
The partitioning
methods generally result in a set of M clusters, each object belonging to one
cluster. Each cluster may be represented by a centroid or a cluster
representative; this is some sort of summary description of all the objects
contained in a cluster. The precise form of this description will depend on the
type of the object which is being clustered. In case where real-valued data is
available, the arithmetic mean of the attribute vectors for all objects within
a cluster provides an appropriate representative; alternative types of centroid
may be required in other cases, e.g., a cluster of documents can be represented
by a list of those keywords that occur in some minimum number of documents
within a cluster. If the number of the clusters is large, the centroids can be
further clustered to produces hierarchy within a dataset.
Single Pass: A very simple partition method, the single
pass method creates a partitioned dataset as follows:
Make the first
object the centroid for the first cluster.
For the next object,
calculate the similarity, S, with each existing cluster centroid, using some
similarity coefficient.
If the highest
calculated S is greater than some specified threshold value, add the object to
the corresponding cluster and re determine the centroid; otherwise, use the
object to initiate a new cluster. If any objects remain to be clustered, return
to step 2.
As its name implies,
this method requires only one pass through the dataset; the time requirements
are typically of order O(NlogN) for order O(logN) clusters. This
makes it a very efficient clustering method for a serial processor. A
disadvantage is that the resulting clusters are not independent of the order in
which the documents are processed, with the first clusters formed usually being
larger than those created later in the clustering run.
2. Hierarchical
Agglomerative methods
The hierarchical
agglomerative clustering methods are most commonly used. The construction of an
hierarchical agglomerative classification can be achieved by the following
general algorithm.
Find the 2 closest
objects and merge them into a cluster
Find and merge the
next two closest points, where a point is either an individual object or a
cluster of objects.
If more than one
cluster remains, return to step 2
Individual methods
are characterized by the definition used for identification of the closest pair
of points, and by the means used to describe the new cluster when two clusters
are merged.
There are some
general approaches to implementation of this algorithm, these being stored
matrix and stored data, are discussed below
In the second matrix
approach, an N*N matrix containing all pairwise distance values is first
created, and updated as new clusters are formed. This approach has at least an
O(n*n) time requirement, rising to O(n3) if a simple serial scan of
dissimilarity matrix is used to identify the points which need to be fused in
each agglomeration, a serious limitation for large N.
The stored data
approach required the recalculation of pairwise dissimilarity values for each
of the N-1 agglomerations, and the O(N) space requirement is therefore achieved
at the expense of an O(N3) time requirement.
3. The Single
Link Method (SLINK)
The single link
method is probably the best known of the hierarchical methods and operates by
joining, at each step, the two most similar objects, which are not yet in the
same cluster. The name single link thus refers to the joining of pairs
of clusters by the single shortest link between them.
4. The Complete
Link Method (CLINK)
The complete link
method is similar to the single link method except that it uses the least
similar pair between two clusters to determine the inter-cluster similarity (so
that every cluster member is more like the furthest member of its own cluster
than the furthest item in any other cluster). This method is characterized by
small, tightly bound clusters.
5 .The Group
Average Method
The group average
method relies on the average value of the pair wise within a cluster, rather
than the maximum or minimum similarity as with the single link or the complete
link methods. Since all objects in a cluster contribute to the inter –cluster
similarity, each object is, on average more like every other member of its own
cluster then the objects in any other cluster.
6. Text Based
Documents
In the text based
documents, the clusters may be made by considering the similarity as some of
the key words that are found for a minimum number of times in a document. Now
when a query comes regarding a typical word then instead of checking the entire
database, only that cluster is scanned which has that word in the list of its
key words and the result is given. The order of the documents received in the
result is dependent on the number of times that keyword appears in the
document.
Applications
1.Biology
In biology has two main applications in the fields of computational biology
and bioinformatics.
- In transcriptomics,
clustering is used to build groups of genes with related expression
patterns. Often such groups contain functionally related proteins, such as
enzymes for a specific pathway, or genes that are co-regulated. High
throughput experiments using expressed sequence tags (ESTs) or DNA micro
arrays can be a powerful tool for genome annotation, a general aspect of genomics.
- In sequence analysis,
clustering is used to group homologous sequences into gene families. This
is a very important concept in bioinformatics, and evolutionary biology in
general. See evolution by gene duplication.
2.Marketing research
Cluster analysis is widely used in market research when working with
multivariate data from surveys and test panels. Market researchers use cluster
analysis methods to partition the general population of consumers into market
segments and to better understand the relationships between different groups of
consumers/potential customers.
- Segmenting the market and
determining target markets
- Product positioning
- New product development
- Selecting test markets
3.Other
applications
Social network analysis: In the study of social networks, clustering
may be used to recognize communities within large groups of people.
Image segmentation: Clustering can be used to divide a digital image
into distinct regions for border detection or object recognition.
Data mining: Many data mining applications involve partitioning data
items into related subsets; the marketing applications discussed above
represent some examples. Another common application is the division of
documents, such as World Wide Web pages, into genres.