Krify Articles
  Sign Up | Log In     Krify Home | Videos | Buzz | Free SMS | Matrimony | Answers | Web Hosting | Bulk SMS | Mail | News | Blogs


Music | Shopping | Forums | Freelance | Wall papers | Classifieds | Web director | Greetings | Bookmarks | Search | Games | Adman

Home Recent Articles Popular Articles Article of the day
Category list


Advertisements Ask Questions, Share Knowledge with Krify Answers
 
Posted by:  Raghu
 Article viewed:  1371  times



Introduction about data clustering

ABSTRACT 

Fast retrieval of the relevant information from the databases has always been a significant issue. Differenttechniques have been developed for this purpose one of them is Data Clustering. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. The Data Clustering Engine (DCE) can be used with formatted or unformatted data. The data does not need to be cleaned or scrubbed to achieve high-quality results.
 

INTRODUCTION:

Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics. The criterion for checking the similarity is implementation dependent.

Clustering is often confused with classification, but there is some difference between the two. In classification the objects are assigned to pre defined classes, whereas in clustering the classes are also to be defined. Precisely, Data Clustering is a technique in which, the information that is logically similar is physically stored together. In order to increase the efficiency in the database systems the number of disk accesses are to be minimized. In clustering the objects of similar properties are placed in one class of objects and a single access to the disk makes the entire class available.

Example to Elaborate the Idea of Clustering

In order to elaborate the concept a little bit, let us take the example of the library system. In a library books concerning to a large variety of topics are available. They are always kept in form of clusters. The books that have some kind of similarities among them are placed in one cluster. For example, books on the database are kept in one shelf and books on operating systems are kept in another cupboard, and so on. To further reduce the complexity, the books that cover same kind of topics are placed in same shelf. And then the shelf and the cupboards are labeled with the relative name. Now when a user wants a book of specific kind on specific topic, he or she would only have to go to that particular shelf and check for the book rather than checking in the entire library.

 

DEFINITIONS:

In this section some frequently used terms are defined.
 

1.Cluster

A cluster is an ordered list of objects, which have some common characteristics. The objects belong to an interval [a , b], in our case [0 , 1] [1]

2. Distance Between Two Clusters 

The distance between two clusters involves some or all elements of the two clusters. The clustering method determines how the distance should be computed. 

3. Similarity 

A similarity measure SIMILAR ( Di, Dj ) can be used to represent the similarity between the documents. Typical similarity generates values of 0 for documents exhibiting no agreement among the assigned indexed terms, and 1 when perfect agreement is detected. Intermediate values are obtained for cases of partial agreement.

4. Average Similarity 

If the similarity measure is computed for all pairs of documents ( Di, Dj ) except when i=j, an average value AVERAGE SIMILARITY is obtainable. Specifically, AVERAGE SIMILARITY = CONSTANT SIMILAR ( Di, Dj ), where i=1,2,….n and j=1,2,….n and i < > j 

5. Threshold 

The lowest possible input value of similarity required to join two objects in one cluster. 

6. Similarity Matrix 

Similarity between objects calculated by the function SIMILAR (Di,,Dj), represented in the form of a matrix is called a similarity matrix. 

7. Cluster Seed

First document or object of a cluster is defined as the initiator of that cluster i.e. every incoming object’s similarity is compared with the initiator. The initiator is called the cluster seed. 

TYPES OF CLUSTERING METHODS:

There are many clustering methods available, and each of them may give a different grouping of a dataset. The choice of a particular method will depend on the type of output desired, The known performance of method with particular types of data, the hardware and software facilities available and the size of the dataset. In general , clustering methods may be divided into two categories based on the cluster structure which they produce. The non-hierarchical methods divide a dataset of N objects into M clusters, with or without overlap. 

These methods are sometimes divided into partitioning methods, in which the classes are mutually exclusive, and the less common clumping method, in which overlap is allowed. Each object is a member of the cluster with which it is most similar, however the threshold of similarity has to be defined. The hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains.

The hierarchical methods can be further divided into agglomerative or divisive methods. In agglomerative methods , the hierarchy is build up in a series of N-1 agglomerations, or Fusion, of pairs of objects, beginning with the un-clustered dataset. The less common divisive methods begin with all objects in a single cluster and at each of N-1 steps divides some clusters into two smaller clusters, until each object resides in its own cluster. 

Some of the important Data Clustering Methods are described below. 

1.Partitioning Methods

The partitioning methods generally result in a set of M clusters, each object belonging to one cluster. Each cluster may be represented by a centroid or a cluster representative; this is some sort of summary description of all the objects contained in a cluster. The precise form of this description will depend on the type of the object which is being clustered. In case where real-valued data is available, the arithmetic mean of the attribute vectors for all objects within a cluster provides an appropriate representative; alternative types of centroid may be required in other cases, e.g., a cluster of documents can be represented by a list of those keywords that occur in some minimum number of documents within a cluster. If the number of the clusters is large, the centroids can be further clustered to produces hierarchy within a dataset. 

Single Pass: A very simple partition method, the single pass method creates a partitioned dataset as follows: 

Make the first object the centroid for the first cluster.

For the next object, calculate the similarity, S, with each existing cluster centroid, using some similarity coefficient.

If the highest calculated S is greater than some specified threshold value, add the object to the corresponding cluster and re determine the centroid; otherwise, use the object to initiate a new cluster. If any objects remain to be clustered, return to step 2.

As its name implies, this method requires only one pass through the dataset; the time requirements are typically of order O(NlogN) for order O(logN) clusters. This makes it a very efficient clustering method for a serial processor. A disadvantage is that the resulting clusters are not independent of the order in which the documents are processed, with the first clusters formed usually being larger than those created later in the clustering run. 

2. Hierarchical Agglomerative methods 

The hierarchical agglomerative clustering methods are most commonly used. The construction of an hierarchical agglomerative classification can be achieved by the following general algorithm. 

Find the 2 closest objects and merge them into a cluster

Find and merge the next two closest points, where a point is either an individual object or a cluster of objects.

If more than one cluster remains, return to step 2

Individual methods are characterized by the definition used for identification of the closest pair of points, and by the means used to describe the new cluster when two clusters are merged. 

There are some general approaches to implementation of this algorithm, these being stored matrix and stored data, are discussed below 

In the second matrix approach, an N*N matrix containing all pairwise distance values is first created, and updated as new clusters are formed. This approach has at least an O(n*n) time requirement, rising to O(n3) if a simple serial scan of dissimilarity matrix is used to identify the points which need to be fused in each agglomeration, a serious limitation for large N.

The stored data approach required the recalculation of pairwise dissimilarity values for each of the N-1 agglomerations, and the O(N) space requirement is therefore achieved at the expense of an O(N3) time requirement.

3. The Single Link Method (SLINK) 

The single link method is probably the best known of the hierarchical methods and operates by joining, at each step, the two most similar objects, which are not yet in the same cluster. The name single link thus refers to the joining of pairs of clusters by the single shortest link between them. 

4. The Complete Link Method (CLINK)

The complete link method is similar to the single link method except that it uses the least similar pair between two clusters to determine the inter-cluster similarity (so that every cluster member is more like the furthest member of its own cluster than the furthest item in any other cluster). This method is characterized by small, tightly bound clusters. 

5 .The Group Average Method 

The group average method relies on the average value of the pair wise within a cluster, rather than the maximum or minimum similarity as with the single link or the complete link methods. Since all objects in a cluster contribute to the inter –cluster similarity, each object is, on average more like every other member of its own cluster then the objects in any other cluster.

 

6. Text Based Documents

In the text based documents, the clusters may be made by considering the similarity as some of the key words that are found for a minimum number of times in a document. Now when a query comes regarding a typical word then instead of checking the entire database, only that cluster is scanned which has that word in the list of its key words and the result is given. The order of the documents received in the result is dependent on the number of times that keyword appears in the document.

Applications

1.Biology

In biology has two main applications in the fields of computational biology and bioinformatics.

  • In transcriptomics, clustering is used to build groups of genes with related expression patterns. Often such groups contain functionally related proteins, such as enzymes for a specific pathway, or genes that are co-regulated. High throughput experiments using expressed sequence tags (ESTs) or DNA micro arrays can be a powerful tool for genome annotation, a general aspect of genomics.
  • In sequence analysis, clustering is used to group homologous sequences into gene families. This is a very important concept in bioinformatics, and evolutionary biology in general. See evolution by gene duplication.

2.Marketing research

Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market researchers use cluster analysis methods to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers.

  • Segmenting the market and determining target markets
  • Product positioning
  • New product development
  • Selecting test markets

3.Other applications

Social network analysis: In the study of social networks, clustering may be used to recognize communities within large groups of people.

Image segmentation: Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.

Data mining: Many data mining applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples. Another common application is the division of documents, such as World Wide Web pages, into genres.

 

 


Disclaimer: The above article is responsible of the individual who post, krify.com does not hold responsible for any kind of disinformation. If you discover one or more of the krify.com pages direct you to messages that harass, abuse, have obscene, unlawful, defamatory, libellous, hateful, or otherwise objectionable content; or have spam, please inform to krify.com and that will be deleted as soon as possible.

Other interesting Article in the Category Other Tech Articles
  e=mc2 103 years later, Einstein's proven right
  The Nature of Electricity
  Learn to Control Your PC
  Boyle’s Law
  Air Pressure
  Planetary Motion
  Levers and Buoyancy
  Get your electricity from outer space
  CERN launches world's largest particle collider
  THE MYTH OF REFRESHING THE DESKTOP

Write your Comment
Name:
Message:
 Verification code:    

Home Email to Friend
Google Pack

© Copyright - 2007 KRIFY SOFTWARE TECHNOLOGIES (P) LTD