Talk:Cluster analysis/Archive 1

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Missing reference

I find this in the article:

This is the basic structure of the algorithm (J. MacQueen, 1967):

But when I looked at the bibliograpy, it was not there. If anyone has the information, could they add it? Michael Hardy 18:36, 21 November 2005 (UTC)

Impossibility Theorem

The clustering impossibility theorem should be mentioned. it is similar to Arrow's impossibility theorem, except for clustering. I don't know who created it though. Briefly, it states that any clustering algorithm should have these 3 properties:

Richness - "all labelings are possible"
scale invarience
Consistancy - "if you increase inter-cluster distance and decrease intra-cluster distance, cluster assignments should not change"

No clustering algorithm can have all 3.

-- BAxelrod 03:53, 16 December 2005 (UTC)

It is a good thing to have and mostly one should reference, I guess \bibitem{JK02} Jon Kleinberg. An impossibility theorem for clustering - Advances in Neural Information Processing Systems, 2002.

However, if there is not another source, then I'd mention that there is a little problem with this theorem as it is presented in that article.

First, it deals with graphs G(V,E), having |V| >= 2 and having distances d(i,j) = 0 iff i==j, i,j in V. Thus, take richness and scale invariance (which means that a graph with some fixed weights has the same clustering if all the weights are multiplied by some positive constant), a graph with |V| = 2, and boom - here you go. For each clustering we get either scale invariance or richness. If there is richness, then scale invariance does not work and the other way round. Sweet, is not it? Or am I wrong somewhere?

Could you please explain something about Isodata Algorithm for data clustering

The last external link on this page has an example on ISODATA clustering. I will try to do a digest when I have time, but feel free to beat me to it. Bryan

On-line versus off-line clustering

Beyond the division of clustering methodologies hierarchical/partitional agglomerative/divisive, it is possible to differentiate betewen: Arrivial of data points: on-line/off-line Type of data: stationary/non-stationary

Additionally it may be helpful to discuss some of the difficulties in clustering data, in particular choosing the correct number of centroids, limits on memory or processing time, and techniques for solving them, such as measuring phase transition (Rose, Gurewitz & Fox)

Another division would be Extensional vs. Conceptual Clustering. --Beau 10:40, 28 June 2006 (UTC)

Relation to NLP

It seems a large amount of the effort in text mining related to text clustering is left out of this article, but it seems to be most appropriate place. Josh Froelich 20:16, 9 January 2007 (UTC)

Affinity Propagation (Clustering by Passing Messages Between Data Points)

I believe that this algorithm developed at the University of Toronto by Brendan Frey, a professor in the department of Electrical and Computer Engineering and Delbert Dueck, a member of his research group which appeared in Science (journal) Feb 07 will change the way people think about clustering. See www.psi.toronto.edu/affinitypropagation/ and www.sciencemag.org/cgi/content/full/sci;315/5814/972 . However I am not capable of writing a full introduction, so I hope someone better equiped for the job will do that. Including the AP breakthrough is a must in my view to retain the currency of this article. Bunty.Gill 08:50, 24 April 2007 (UTC)

QT: recursion or iteration?

"Recurse with the reduced set of points." (with link to recursion)

Is this really recursion? I would call it iteration. You repeat the process until you can't go any further, then you stop. Sounds like a while loop to me.

--84.9.95.214 17:28, 1 July 2007 (UTC)

If you take a look at the Recursion (computer science) article "Any function that can be evaluated by a computer can be expressed in terms of recursive functions without the use of iteration, in continuation-passing style; and conversely any recursive function can be expressed in terms of iteration.", i.e. you can rewrite anything primitive recursive to an iterated algo. Here, recursion is in the philosophical sense, since you apply the same analysis to a reduced set of points.

Requested move: Should be entitled 'Cluster Analysis' and not 'Data Clustering'.

As mentioned in the article there are several different terms used to describe cluster analysis. However, the most frequent term used in books, papers, etc, always appears to be cluster analysis by a fair margin. It would make sense to use the most widely used term as the page name.

Approximately 7-10 days ago I collected statistics on the number of hits on each term from several different sources. The data could be used as an approximate indicator for the prevalence of each term. The numbers in brackets represent papers released since January 2000. On average the ratio of [papers published since January 2000 to number of papers overall] for the ACM and IEEE combined is approximately 0.757 and 0.759 for data clustering and cluster analysis respectively, thereby suggesting a relatively stable prevalence of each term in the literature over time.

Source	IEEE	ACM	Alexa	Yahoo	Google	Ask	Gigablast	Live Search
Results for "Data Clustering"	524 (368)	682 (555)	20,000	96,900	353,000	43,600	38,219	27,279
Results for "Cluster Analysis"	571 (511)	1159 (724)	130,000	715,000	1,860,000	307,200	316,708	148,096

Based on these results, and the titles and content of books on the subject, I propose that the page title be changed to "Cluster Analysis".

--MatthewKarlsen 16:07, 15 July 2007 (UTC)

Page moved, per unopposed request. Cheers. -GTBacchus^(talk) 01:00, 23 July 2007 (UTC)

Normalized Google Distance

Normalized Google Distance (NGD) -- when I saw this I thought it was a prank. It turns out that someone has actually written a (suprisingly well-informed) paper or two on this [1] -- but that does not mean it is a serious approach (or that it is not just the type of prank that bored computer scientists come up with in their free time).

Is anyone here informed on this? Is NGD a viable norm function (in its domain)? I have been unable to find any peer-reviewed publications regarding the topic (New Scientist hardly counts). --SteelSoul 23:12, 1 November 2007 (UTC)

Elbow criterion section

Someone should edit out the blatant advertising for the excel plugin product. It would also be nice to have some additional info on picking the number of clusters?

Sorry, I just had Excel at hand when making the graph. This is not an advertisement, this is not real data, but a crappy Excel hand made graph to help visualize the EC heuristic. If you look through the archives, you will see I also made an extremely ugly representation for the hierarchical clustering, which was thankfully replaced.

As for how to tweak clustering, either you have a good idea of how many clusters you want (number criterion), or a good idea of how much total variance (or another perf metric) you want to explain, or all you are left with is heuristic. That's from a Computer science POV.

Looking in another direction, for example statistics, there are ways to compare models with a different number of parameters, like Akaike information criterion and the methods linked from there, or maybe something based on information entropy. It will again help you choose in the tradeoff between having too many cluster or having too low perf because of low cluser number. I'm sorry, I don't have any relevant article nor the time at the moment to find one. Bryan

Actually ran into an article about using entropy criterion to stop clustering: Cluster Identification Using Maximum Configuration Entropy, by C.H. Li. —The preceding unsigned comment was added by 86.53.54.179 (talk) 19:02, 1 April 2007 (UTC).

I'm wondering if there isn't an error in this sentence: "Percent of variance explained or percent of total variance is the ratio of within-group variance to total variance." I'm thinking that as the number of clusters increases, the within-group variation decreases, which is not what is shown on the graph. Should this be "... the ratio of the between-group variance to the total variance." Mhorney (talk) 17:57, 11 January 2008 (UTC)

Addressed by 91.89.16.141 User A1 (talk) 06:38, 17 March 2008 (UTC)

The section is in violation of WP:NOR. —Preceding unsigned comment added by 71.100.12.147 (talk) 11:09, 11 September 2008 (UTC)

"Elbow criterion" is in violation of WP:NOR

Theoretical and Empirical Separation Curves

Use of the term "cluster" to refer to a subset of a group of attributes which define a bounded class shows an obvious lack of comprehension of the subject matter. Use of the term "cluster" is not valid in this context when referring to the number of attributes as a selected subset of a group of attributes but valid only when referring to a multiset count of the values of an attribute where the count of the set or multiset values equals the number of clusters.

The number of attributes selected as the the number of attributes in the subset is not arbitrarily selected or fixed but initially set to one for the first separation analysis and thereafter progressively incremented until 100% separation is achieved or to some point prior to target set size exceeding computer capacity or the time allocated for classification is exceeded. The minimum number of attributes (not clusters) is determined mathematically as follows:

 $t_{min}={\frac {\log G}{\log V}}$ , where:

t_min is the minimal number of characteristics to result in theoretical separation,
G is the number of elements in the bounded class and
V is the highest value of logic in the group.

71.100.14.204 (talk) 20:47, 11 September 2008 (UTC)

Wikipedia:Articles for deletion/Optimal classification. I see you are attacking other clustering algorithms now. How nice. --Jiuguang (talk) 16:40, 11 September 2008 (UTC)

Actually wang I almost informed you of the opportunity here to apply the other skill you know best besides backstabbing, which is nominating articles for deletion, but decided why bother. Having added stalking to your list of skills it would not be long before you showed up without my help. BTW - I know why you are not into this stuff and want to delete it... robots don't need it, they go right from decision table construction to doing their thing.

The only reason you are tagging this as OR is because it directly contradicts your statements on Rypka's method as the only algorithm that can determine the optimal cluster size. Please stop - you've already been blocked for sock puppetry, personal attacks, evading infinite blocks, etc; you don't want to add vandalism to the list. --Jiuguang (talk) 23:38, 11 September 2008 (UTC)

You're delusional wang. Get a life. —Preceding unsigned comment added by 71.100.167.222 (talk) 23:40, 11 September 2008 (UTC)

Well... doesn't this IP look familiar? - Jameson L. Tai ^{talk ♦ contribs} 05:33, 12 September 2008 (UTC)

Not quit as familar as the jail cells with which you and the other members of the robotics cabal will become. —Preceding unsigned comment added by 71.100.10.82 (talk) 09:04, 12 September 2008 (UTC)

Robotics Cabal! That's a new one. Actually, it's quite catchy. Jiuguang, would you like to join the Robotics Cabal? - Jameson L. Tai ^{talk ♦ contribs} 15:17, 12 September 2008 (UTC)

He's probably busy with his buddy Chavez. Maybe he'll be free later, or you could join them. —Preceding unsigned comment added by 71.100.3.239 (talk) 17:59, 12 September 2008 (UTC)

A user coming from several different IP-numbers 71.100.*.* (DSL verizon) has an compulsion to add tags such as "original research" around the "elbow criterion". It is not apparent for me why this is so. Looking at the explained variance as a function of the number of clusters is a well-known method. I haven't heard of the term "elbow criterion" before, but looking at Google Scholar [2] there seems to be no doubt that it is used in peer-reviewed communication. — fnielsen (talk) 12:24, 15 September 2008 (UTC)

Right, I'm adding tags just out of compulsion to add tags and not because the contents is bogus. My question is why you guys do not want trash to be replaced with bonafide content? —Preceding unsigned comment added by 71.100.4.227 (talk) 17:16, 15 September 2008 (UTC)

71.100.*.*, let me put it this way - your conduct is in direct violation of the Verizon Internet Acceptable Use Policy (see [3]), and consider this your final warning to stop your range of disruptive activities on Wikipedia, including but not limited to

Vandalism
Personal attacks
Trolling of WP:Reference desk and WP:Village Pump.
Harassment of editors, both on and off wiki

A report to Wikipedia:Abuse reports will take place for any further abuse, which will then result in an official communication to Verizon, possibly leading to the termination of your account (as detailed by the AUP). You don't want this. Please stop. --Jiuguang (talk) 17:50, 15 September 2008 (UTC)

image for intuition

Hey, it would be nice to have some image that intuitively shows the idea of clustering. Usually in courses on machine learning or tutorials on clustering such images are shown. They are usually two dimensional depicting a number of points dispersed in the coordinate system, circles mark clusters/groups of points. I think for somebody opening an article about clustering and who is new to the topic such an image could be very helpful. Ben ^T/_C 14:03, 10 February 2009 (UTC)

Hierarchical clustering should be its own article

That's a pretty big topic. Shouldn't be just a subsection of this article. I may have to start one if no else does within the next several months. Makewater (talk) 20:36, 6 April 2009 (UTC)

I agree. Unfortunately I know very little about it. -3mta3 (talk) 08:14, 7 April 2009 (UTC)

Standalone page for "Choosing the number of clusters in a data set"

The number of different ways to choose k seems to warrant more than a subsection on this page, especially since identification of the number of clusters in a data set is a separate issue from ways of actually performing clustering. I've expanded the former subsection on the topic into a standalone page, Determining the number of clusters in a data set. -JohnMeier (talk) 00:42, 8 April 2009 (UTC)

link to cluster sampling

Hi, I found this page when I was looking for information on Cluster Sampling. Perhaps there should be one of those nifty disambiguation links at the top of this page. I don't really understand what this cluster analysis thingy is or how important it is so I'm not sure whether or not such a link would be justified, but it would have saved me some time.220.239.204.226 (talk) 05:43, 4 November 2009 (UTC)

Red links are created based on the following.....

http://scholar.google.com/scholar?as_q=population+management&num=10&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=title&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdts=5&hl=en

--222.64.209.26 (talk) 03:37, 20 November 2009 (UTC)

http://scholar.google.com/scholar?hl=en&q=allintitle%3A+similarity+testing&btnG=Search&as_sdt=2000&as_ylo=&as_vis=0

--222.64.209.26 (talk) 04:01, 20 November 2009 (UTC)

http://scholar.google.com/scholar?as_q=similarity+cluster+analysis&num=10&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=title&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdts=5&hl=en

--222.64.209.26 (talk) 04:04, 20 November 2009 (UTC)

http://scholar.google.com/scholar?as_q=&num=10&btnG=Search+Scholar&as_epq=similarity+test&as_oq=&as_eq=&as_occt=title&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdts=5&hl=en

--222.64.209.26 (talk) 04:19, 20 November 2009 (UTC)

http://scholar.google.com/scholar?hl=en&q=%22facial+similarity%22+test&btnG=Search&as_sdt=2000&as_ylo=&as_vis=0

--222.64.209.26 (talk) 04:28, 20 November 2009 (UTC)

http://scholar.google.com/scholar?hl=en&q=allintitle%3A+%22facial+similarity%22&btnG=Search&as_sdt=2000&as_ylo=&as_vis=0

--222.64.209.26 (talk) 04:29, 20 November 2009 (UTC)

Talk:Shadow (disambiguation)#What is it called like....

http://scholar.google.com/scholar?as_q=conscious+embodiment&num=10&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=title&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdts=5&hl=en

--222.64.209.26 (talk) 05:25, 20 November 2009 (UTC)

Daemon (mythology), Demon or spiritual Plug-in --- http://scholar.google.com/scholar?as_q=&num=10&btnG=Search+Scholar&as_epq=unconscious+embodiment&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdts=5&hl=en

--222.64.209.26 (talk) 05:29, 20 November 2009 (UTC)

It's pity that the application of the technique has been limited

http://scholar.google.com/scholar?hl=en&q=allintitle%3A+population+management+cluster+analysis&btnG=Search&as_sdt=2000&as_ylo=&as_vis=0

--222.64.209.26 (talk) 03:40, 20 November 2009 (UTC)

I'm sure if the CA is used in conjunction with the DNA profiling for Population management, lots of overloads of population can be managed.--222.64.209.26 (talk) 03:46, 20 November 2009 (UTC)

Look at that...

http://scholar.google.com/scholar?hl=en&q=allintitle%3A+population+management+DNA+profiling&btnG=Search&as_sdt=2000&as_ylo=&as_vis=0

--222.64.209.26 (talk) 03:51, 20 November 2009 (UTC)

http://scholar.google.com/scholar?hl=en&q=allintitle%3A+similarity+population+management&btnG=Search&as_sdt=2000&as_ylo=&as_vis=0

--222.64.209.26 (talk) 04:14, 20 November 2009 (UTC)

Addressing WHAT IS NOT

http://scholar.google.com/scholar?hl=en&q=allintitle%3A+difference+testing&btnG=Search&lr=lang_en&as_sdt=2000&as_ylo=&as_vis=0

--222.67.208.51 (talk) 06:46, 24 November 2009 (UTC)

Rewrite

I've tagged this article as a cleanup as it is becoming very confusing and unwieldy. My vague suggestions:

Focus on the main types of clustering, with a section for each. My impression is that the most common are:
- k-means clustering
- Hierarchical clustering (now has its own article)
- Mixture models (currently not mentioned at all)
Simplify all the others to bullet points (or mention in the appropriate section if they are a variant of the above). Create sub articles if need be.
Prune the reference and external links lists

Any comments? —3mta3 (talk) 17:00, 18 May 2009 (UTC)

Support. Also spectral clustering is worth a separate section in my opinion. Took (talk) 22:20, 7 September 2009 (UTC)

I also agree (with both), spectral clustering is widely being used in graph theory. So, I think it should be also a separate section. --Conjugado (talk) 15:19, 11 January 2010 (UTC)

No strict definition for the problem itself.

There are lots of details about different methods and metrics used to solve the problem, which is defined too unstrictly.

Maybe that's right

As I understand it, clustering can be used for different purposes, but it is generally used to find classes of data points that aren't easily noticeable otherwise. However, if you are using the term 'problem' in the CS theoretical sense, then each clustering algorithm really has a different problem.

Darthhappyface (talk) 04:32, 25 May 2010 (UTC)

clustergram - a method for visualizing cluster analysis results

Hi all,

I wrote an article about a method to visualize cluster analysis, here: http://www.r-statistics.com/2010/06/clustergram-a-graph-for-visualizing-cluster-analyses-r-code/

I was wondering where (and if) to add the above information about this method to the page, but couldn't quite figure out where in the page to do so. Any suggestions?

Talgalili (talk) 16:36, 15 June 2010 (UTC)

I don't think it needs to be included. It's just one out of a dozen visualization methods around. The article needs cleanup, not further bloat. --Chire (talk) 16:44, 15 June 2010 (UTC)

Hi Chire, I'll take your opinion in the matter and won't try to add it.

At the same time, why not include in the article various visualization methods for clustering? (or start another article on the subject). I (personally) find it both interesting and useful.

With much respect, Talgalili (talk) 16:50, 15 June 2010 (UTC)

I do agree that cluster visualization is an interesting subject. But beware of WP:COI. I am also not sure about whether it actually suits the "encyclopedia" aspects of Wikipedia. Given that articles these days are quickly deleted, my recommendation would be to start the article in your user namespace (i.e. User:Talgalili/Cluster_visualization), and when you have substantial content, covering multiple visualizations and links to the visualized algorithms (the article you linked seems to be very k-means-centric, and k-means by far is not the most advanced clustering algorithm; its results are quite unstable, and k is not easy to choose right), all the references to the relevant articles, then move it to the main namespace. This will likely save you some headaches and frustration, since new articles often face an "request for deletion" within a few days, unfortunately. --Chire (talk) 17:52, 15 June 2010 (UTC)

P.S. it also is related to Scientific visualization, Data visualization and Scatter plot, these might already contain some clustering visualization information. --Chire (talk) 18:25, 15 June 2010 (UTC)

Hello Chire, thank you for the suggestions :)

It sounds like a big project to take on myself. Maybe at a later stage.

BTW, that technique was only implemented there on k-means, but it is meant for help with assessing any clustering algorithm

Cheers, Talgalili (talk) 19:51, 15 June 2010 (UTC)

Error in one of the formulae for Spectral clustering?

In the section on spectral clustering, I think the formula P = S*D^(-1) should be P = D^(-1)*S instead as written in eq. 4 of the paper by Meila and Shi namely "A random walks view of spectral segmentation" Meila, M., Shi J., AISTATS 2001. In general D^(1) and S DO NOT commute. One definition leads to the transpose of the other for P - same eigenvalues but different eigenvectors. Unless someone can provide a justification for this formula, I think this could be a bonafide error. TonyMath (talk) 19:47, 19 April 2011 (UTC)

Something else: is that definition for the Laplacian matrix correct? because the link on the Laplacian matrix gives L = D-A and that agrees with the Mathworld definition but I don't know how to reconcile it with the formula used? The formula used is L = I - D^(-1/2)*S*D^(-1/2) which is eq. (5) of another paper by Meila and Shi entitled "Multi-way cuts and spectral clustering" taken from "Spectral Graph Theory" by Fan R.K. Chung but are the formulae consistent?

Laplacian matrix in Spectral Clustering

This section is not complete. It should give more details about the eigenvectors extraction, and explain why. Moreover, the formulae for the Laplacian matrix is L=D-S. Indeed, the formula in the article is the one for the normalized Laplacian matrix. This normalization is due to the relaxation of the constraints on the indicator vector. —Preceding unsigned comment added by Guillaumew (talk • contribs) 09:26, 4 May 2011 (UTC)

Limits of Cluster Analysis

I understand that CA will cluster just anything we throw at it: random data, linearly correlated data, etc. Could somebody knowledgeable please point out when NOT to use CA? --Stevemiller (talk) 04:56, 15 February 2008 (UTC)

It is often a good idea to compare the clustering results with random data and then compare it to confidence intervals. So I would say don't use clustering algorithms unless you are looking for or anticipating clusters (or their absence). Clustering algorithms can be used to compare the probability of a particular distribution of clusters forming and compare that to a random (or other comparator) case. For example you might be looking at two sets of star systems, you want to know if there is a tendency for them to form into clusters in the presence if there are detectable levels of ficticium (a fictious element). So you run your clustering algorithms past the data and see if there is a difference between the star systems with low ficticium and high ficticium, and how that correlates to clustering and if that is statistically significant, most likely using confidence intervals. You also may need to compare it to random data. User A1 (talk) 16:02, 15 February 2008 (UTC)

First of all, many algorithms will only work on vector space data (e.g. k-means need to be able to compute a mean). Then if you have unnormalized data, it will likely have a bias. And finally, in particular when you choose the various input parameters (e.g. k for k-means, but also distance functions and similar things) inappropriately, you'll often not get any sensible result. So it's not "plug and play", but much of the work is finding the appropriate parameters. --Chire2 (talk) 14:18, 7 May 2010 (UTC)

One person's opinion. Please allow me to suggest a criterion for determining if two separate clusters really should be separate: two distinct well-defined subpopulations constitute distinct clusters if and only if characteristics of interest possess a non-zero difference in their means at a statistically significant level of confidence. Thus, (1) a criterion for identifying distinct clusters is given utilizing a standard, well accepted statistical methodology - the difference between two means (2) whether two clusters are distinct may be ambiguous depending on the confidence level required. For example, two subpopulations may be distinct clusters with 95% confidence but not 99% confidence (3) if no distinct clusters are identified under this criterion, cluster analysis fails: the null hypothesis cannot be rejected and the population should therefore be considered homogeneous at the level of statistical confidence used in testing the hypothesis of distinctness. —Preceding unsigned comment added by Davidjcorliss (talk • contribs) 02:36, 24 May 2011 (UTC)

The mean is only a sensible cluster approximation when you have spherical clusters such as produced by kMeans or EM and a sensible notion of mean such as given by a vector space. For the more advanced cluster analysis methods and data models, this may not be appropriate. Even k-medoids already highlights that there may not be a sensible "mean" at all (despite being the simples modification of k-means that can do without having a computable mean). Maybe you're just using k-means too much. :-) --Chire (talk) 05:29, 24 May 2011 (UTC)

Merger proposal

Yes, the two should be merged. I dont think Wikipedia is a how to manual - so the applications, use and basic methodology whould be in a single article--Maven111 (talk) 12:36, 3 March 2010 (UTC)

The article cluster analysis (in marketing) seems to repeat a lot of information. Should we incorporate it somehow? —3mta3 (talk) 08:19, 21 May 2009 (UTC)

I think it's not a good idea, because the cluster analysis is dealing with the methods and the "in marketing" one is simply mentioning the uses of the analysis in economy. Kroolik (talk) 11:45, 25 May 2009 (UTC)

The applications of cluster analysis are numerous, but in all cases it is for grouping and the underlying method is domain independent. Nearly all of the marketing version is on cluster analysis with some added relationship with factor analysis, and some other multidimensional ordination/projection techniques. The merge seems to be a good idea. An application section can be added to this article. Shyamal (talk) 10:45, 29 May 2009 (UTC)

This section should be merged with market segmentation discussions of benefit segmentation analysis (marketing). It is an application of a broader concept, but it is very sepcific to benefit segmentation. —Preceding unsigned comment added by 74.75.128.221 (talk) 23:43, 25 October 2009 (UTC)

In my opinion, the cluster analysis article is already too long, so the marketing application should not be merged. Instead, more parts could be split out into separate articles (there is such a request for the agglomerative hierarchical methods) The "applications" section gives a lot more applications than the "market analysis" application. --Chire2 (talk) 13:59, 7 May 2010 (UTC)

Yes - the two should be merged. I apply cluster analysis in marketing segmentation as a consultant and also in astrostatistics and in education analysis in my academic research. Yet, the mathematics is, at all points, the same. It is a single discipline with a multiplicity of applications. Davidjcorliss (talk) 02:46, 24 May 2011 (UTC)

Well, describe the mathematical parts here, and the application in various domains on a separate page? This one is overfull and a mess. And yet, it does barely touch the advanced methods. There is so much beyond k-means and single-link, but still everybody in "application" still uses them despite all their known drawbacks and defects (in particular, preferring same-spatially-sized clusters due to the voronoi partitioning clearly is not sensible in real world applications). Just recently I met a biology researcher who was surprised to find OPTICS produce to produce much more useful results for him on his retina data ... despite OPTICS having been around for over 10 years already it just started to arrive in applications. Therefore in my opinion the article should include much more of these advanced methods and the application examples should be moved to separate pages. --Chire (talk) 05:41, 24 May 2011 (UTC)

Boldface & Robotics project attention needed

There seems to be an inordinate and excessive use of boldface. In particular there is what appears to be simply a list of applications of clustering which has very little prose in it. Some trimming is suggested as per MOS:BOLD#Boldface, where the spiecific example shows the list as a list article, not in the middle of a non-list article. Chaosdruid (talk) 14:23, 16 February 2012 (UTC)

Spectral Clustering

No link or mention in this article of spectral clustering, which has its own entire wikipedia article. — Preceding unsigned comment added by 192.249.47.174 (talk) 21:05, 19 June 2012 (UTC)

External validation indices

About the pairwise F-measure : "This measure is able to compare clusterings with different numbers of clusters". Actually, all of the reviewed indices are able to do so. Well, they are some well known drawbacks, especially with the Rand index which tends to 1 while the number of clusters increase because of the False Positive term, but you just have to keep that in mind ;) Moreover, if you want to compare clusters independently, a matching is obviously required and it is generally done by the Hungarian method for the sake of efficiency. — Preceding unsigned comment added by 84.98.253.168 (talk) 02:19, 23 October 2012 (UTC)

Adding citations to support the statement that Clustering is a main task of explorative data mining

On 11/3/2012 I added 4 independent citations to support the article's statement that Clustering is a main task of explorative data mining. These citations were removed the same day. I propose that it would be good to have citations for claims like this.

I am a novice wikipedia author, and on 11/3, I did not sign my edit or say anything on the TALK page. I now know that I should do both of these things. I should also point out that I am an author on 2 of the citations I added. However, I do not think that there is a COI in this. It is an area in which I have done research, so it is an area I know. I am simply trying to add citations that I think would help the article. However, if other wikipedia authors evaluate this and feel that my 2 citations should be excluded, I would still encourage the community to retain the other two citations that I added on 11/3.

Thank you for considering this.Karl (talk) 01:43, 6 November 2012 (UTC)

Long sentence with little value.

The sentence "Since algorithms that produce clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm based on this criterion." is very long and adds very little value to the article. It should be rewritten. — Preceding unsigned comment added by 92.229.28.245 (talk) 13:44, 30 January 2013 (UTC)

Terminology: "classification" is supervised, "clustering" is unsupervised -- Really?

please see Talk:Statistical classification Fgnievinski (talk) 23:38, 3 May 2014 (UTC)

Misplaced comment by Ninjarua

The data in figures should be explained, it means that explaining what is x-axis and what is y-axis in the figures above. The similar problem also occurs to other sections in this article. Moreover, the data in two figures are different, so how can we compare the difference between two methods?

The preceding comment was placed at the bottom of the Connectivity based clustering section in the article by Ninjarua. — Anita5192 (talk) 17:55, 21 June 2014 (UTC)

This is artificial data. There is nothing to explain on the axes. The data sets are independent and cannot be compared; but it is the same method. See other sections for results by other methods on the same data. But the article is organized by method, not by data set (as the data sets are not of interest). --Chire (talk) 08:38, 23 June 2014 (UTC)

Multi-assignment clustering?

The topic of multi-assignment clustering seems to be missing altogether from wikipedia. --Nicolamr (talk) 22:55, 28 July 2014 (UTC)

I haven't seen that term used a lot. But isn't this the same as this (in the article):

* overlapping clustering (also: alternative clustering, multi-view clustering): while usually a hard clustering, objects may belong to more than one cluster.

It's not covered in much detail, but it also not covered much in literature either; not many "landmark" approaches yet like k-means and DBSCAN. --Chire (talk) 09:14, 31 July 2014 (UTC)

You are right, it's precisely that. Thanks for the answer. --Nicolamr (talk) 19:57, 31 July 2014 (UTC)

A few missing historical references

The text says: "Cluster analysis was originated in anthropology by Driver and Kroeber in 1932 and introduced to psychology by Zubin in 1938 and Robert Tryon in 1939[1][2] and famously used by Cattell beginning in 1943[3] for trait theory classification in personality psychology."

Driver and Kroeber 1932 and Zubin 1938 does not appear on the references list. Anyone has access to Ken (1994) to check the full references for these two works?

--Lucas Gallindo (talk) 23:12, 2 December 2014 (UTC)

All software links were removed

Here they are, should we put them back ?

Software implementations

Free

The flexclust package for R
COMPACT - Comparative Package for Clustering Assessment (in Matlab)
YALE (Yet Another Learning Environment): freely available open-source software for data pre-processing, knowledge discovery, data mining, machine learning, visualization, etc. also including a plugin for clustering, fully integrating Weka, easily extendible, and featuring a graphical user interface as well as a XML-based scripting language for data mining;
mixmod : Model Based Cluster And Discriminant Analysis. Code in C++, interface with Matlab and Scilab
LingPipe Clustering Tutorial Tutorial for doing complete- and single-link clustering using LingPipe, a Java text data mining package distributed with source.
Weka : Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
Tanagra : a free data mining software including several clustering algorithms such as K-MEANS, SOM, Clustering Tree, HAC and more.
Cluster : Open source clustering software. The routines are available in the form of a C clustering library, an extension module to Python, a module to Perl.
python-cluster Pure python implementation

Non-free

Clustan
Peltarion Synapse (using self-organizing maps)[4]
Eudaptics Viscovery: Data Mining Suite for Visual Cluster Analysis

I removed these, about a month ago. Wikipedia's external link guidelines discourage large link directories like this. None of the links appear to meet the guidelines, each is just an instance of data clustering software with nothing to indicate it is especially notable to the topic. Normally I would have replaced them with a DMOZ link, but there doesn't appear to be a category for this. -SpuriousQ (talk) 10:14, 22 March 2007 (UTC)

Actually it did take some time to build up such a list of software. Also, since this is an article about algorithms, any software implementing them is relevant to the topic (if you have an interest in the algo, it might very well be because you need a working implementation, or a reference implementation). Bryan

I understand that, but that list was just an invitation for spam and links to data clustering software someone wrote one day. WP:EL is clear that links should be kept to a minimum. I would probably have no problem with links to clearly notable implementations; for example, one that was the subject of a peer-reviewed published paper or one created by noted researchers in the field. -SpuriousQ (talk) 15:12, 28 March 2007 (UTC)

No problem, I accept the policy and see its utility, but think the links and comments are important, that's why I asked for a DMOZ directory to be created. Also, there is a problem with the fact that you removed some internal wikipedia links in the process (the YALE, Weka, Peltarion Synapse links). So, if we want to keep the links somewhere, do we try to move them to DMOZ or do we create an intermediary Wikipedia page for each package ? Bryan

Hmm. The standard way of linking to internal articles is in the See also section. But I feel it's a bit questionable to give such prominence to the three instances that happen to have Wikipedia articles. Another idea would be to have an article List of data clustering software and link to that in See also. I would prefer that option, but it's also iffy to have a list article with so few examples. Do you know of any other articles we could add to that list? -SpuriousQ (talk) 01:58, 30 March 2007 (UTC)

Data mining, software section. The software links there actually suffer the same problem as the ones here, although to a lesser extent (more wikipedia articles, less external links). Please see discussion with David Eppstein. Bryan

I think the list should either be kept out of the article altogether, or it should be explicitly limited to a small number (say half a dozen) of particularly notable clustering systems, for which some strong argument can be made for their inclusion beyond "it's a clustering system" or even "it's a free clustering system". By a strong argument, I mean statements such as "it's the most widely used free clustering system for Linux" or the like. Anything less restrictive is just an invitation for an unencyclopedic link farm. —David Eppstein 03:22, 30 March 2007 (UTC)

I agree, it's time for a cleanup. Just to give a little "historical perspective", the list was started when commercial spam appeared, it kept it under very good control because it gave space for people tempted to spam the page with their soft's link to express themselves. The list has grown, and it now has many annotations, so I'm quite happy with it and would like to keep some of its info.

Now, to continue the conversation together with SpuriousQ, imagine we move everything to a list of Data mining software. We would lose info about implementations (libraries,scripts) that are not autonomous software, thus failing to highlight "reference implementations" that people may want to examine at a code level, only linking to "working implementations" that people would like to use. So my proposition would be: if the soft is autonomous, we add it to the data mining software list. If the soft is just a library, and seems to be from academia, we keep a link to it. In our case, that would keep flexclust, COMPACT, mixmod and Cluster, the python-cluster link would be lost forever, and the rest would be transferred to the more general and richer data mining software list. What do you think ? Bryan.

This would have been very useful a few weeks ago... trawling Google for OSS implementations and avoiding all the spam took ages. I am actually quite angry that someone just removed them all without a discussion. --MatthewKarlsen 09:34, 21 June 2007 (UTC)

There are other well-known (within the field) directories for such things... Maybe use one of them? [5] and [6], for example? stoborrobots (talk) 15:23, 22 February 2011 (UTC)

Yes, they should all be put back. Software links are very useful on statistics pages -- PeterLFlomPhD (talk) 22:21, 23 August 2015 (UTC)

This will just be a useless mess again. There are over 100 clustering packages on CRAN, do you want to link all of them? This topic is too broad for links to be useful. Use Google, or go to a more specific topic. **Wikipedia is not a link directory**, that is DMOZ. 91.52.54.125 (talk) 05:56, 24 August 2015 (UTC)

Software links

I read the earlier discussion "Software links were removed". I can somewhat appreciate the concerns raised but I think there is a problem.

Without information about relevant analysis software, the article seems to me lacking. I came to the article with a general concept of what cluster analysis is about and a desire to find out how to look for clusters in a data set I have.

The problem is that (in my perception) this article goes from the general conceptual level into the advanced level of discussing different theoretic approaches, with very little in between.

The article List of statistical packages has an extensive list. Could this article refer to that article to the extent of indicating packages there that include cluster analysis? Ideally also indicating which packages would be more helpful to someone unused to cluster analysis?

If this sort of information is NOT appropriate to have in the article, can anyone offer me any "private" user talk page advice? Thanks. Wanderer57 (talk) 19:21, 11 May 2011 (UTC)

Link spam is a big issue in this article, therefore I'm opposed to adding a software section. It might however be ok to start an article "List of Cluster Analysis Software" and have it linked. However, understand WP:NOT. If it just boils down to collecting links, it doesn't belong into Wikipedia but instead we should just link to the appropriate DMOZ/ODP category. --Chire (talk) 09:02, 12 May 2011 (UTC)

I think a section here or a separate list is ok if and only if all of the entries in the list are wikilinked articles to notable packages. I'd prefer not to see any external links in such a section or list. —David Eppstein (talk) 15:00, 12 May 2011 (UTC)

This also occurred to me on that List of statistical packages - pretty much all of them have Wikipedia articles, so this must be a list with "notable" packages only. Because there ought to be just thousands more... As for this article, I believe it is a bit too messy and confusing already, I'd prefer splitting such things off to a separate article. Especially since there is little overlap (references etc.) I guess. --Chire (talk) 17:10, 12 May 2011 (UTC)

Maybe for R the link could go to a CRAN task view? -- PeterLFlomPhD (talk) 11:44, 24 August 2015 (UTC)

Content in section 3 duplicated verbatim a published journal article.

It is unclear if they copied wikipedia, or someone pasted content from this article: http://files.aiscience.org/journal/article/html/70110028.html — Preceding unsigned comment added by Sakoht (talk • contribs) 02:23, 25 April 2016 (UTC)

It's obvious they copied from Wikipedia. You should contact the journal. But this looks like one of the indian spam publishers, so they likely won't care. HelpUsStopSpam (talk) 05:39, 25 April 2016 (UTC)

ROFL: "An overview of algorithms explained in Wikipedia can be found in the list of statistics algorithms." - they even copied that from Wikipedia... Chire (talk) 07:32, 25 April 2016 (UTC)

Citations needed

I do not think references for applications are needed. This just attracts spam. HelpUsStopSpam (talk) 19:51, 1 November 2016 (UTC)

Requiring Reliable Sources is a central pillar of Wikipedia. Far from attracting spam, it prevents it, as only suitably sourced materials will survive. If you are supposing that bluelinks demonstrate Notability, recall that Wikipedia, like all websites "that anyone can edit" is not itself a reliable source. Therefore, sources are required. Chiswick Chap (talk) 20:02, 1 November 2016 (UTC)

Wikipedia requires reliable sources, but only for "material whose verifiability has been challenged or is likely to be challenged". For applications, I am not convinced that it is likely that someone will challenge that Cluster Analysis could be used here (I'm pretty sure someone has tried cluster analysis even on his bowel contents, although he probably did not find anything interesting). But calling people to add references to every use case means that we will see a huge citation link farm at the bottom of the page. But I would actually prefer to delete that "applications" section entirely, it is a rather useless list. Or turn it into a "please spam here" dead-end article, like Examples of data mining. HelpUsStopSpam (talk) 21:38, 12 November 2016 (UTC)

I hear what you say, but don't see signs of spamcruft accumulating as in some popular topics I've edited. And unfortunately or not, I have already challenged the section, so there it is. A longer citation list is not a big problem, nobody is forced to read it. The list is not useless, as it shows where cluster analysis is applied, and more interestingly to my mind, what benefits it brings in each area – and those reasons are remarkably diverse. We may also note that other parts of the article like the Evaluation and assessment section could be better cited. All the best, Chiswick Chap (talk) 09:28, 13 November 2016 (UTC)

External links modified

Hello fellow Wikipedians,

I have just modified one external link on Cluster analysis. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20100421170848/http://academic.research.microsoft.com/CSDirectory/Paper_category_7.htm to http://academic.research.microsoft.com/CSDirectory/paper_category_7.htm

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 20:32, 9 August 2017 (UTC)

What happened to the Quality Threshold algorithm description?

I distinctly remember there was a section on the Quality Threshold (QT) algorithm. I cannot find it anymore and it's a very useful algorithm that should be included. Why was this section deleted? I cannot find any discussion in the Talk section about this removal? — Preceding unsigned comment added by 131.107.160.106 (talk) 23:44, 1 April 2018 (UTC)

Evaluation section refinement and copyright issues

The evaluation section extension with benchmarking frameworks contained copied content from the Clubmark paper, however this paper had already been published in the public domains (The Clubmark paper, arXiv) with the respective public licences and the original paper was explicitly cited in the content.

In addition, today the author(s) have emailed the permission to permissions-en(at)wikimedia(dot)org to use the paper under the (CC-BY-SA), version 3.0. {{OTRS pending}}

How to recover the roll-backed refinements to the Cluster Analysis page temporary removed because of the copyright issues? --Glokc (talk) 06:12, 11 February 2019 (UTC)

@Glokc: I've replied to the copyright holder seeking clarification about something. Once permission has been confirmed, I will ask an administrator to undelete the content. --AntiCompositeNumber (talk) 14:54, 31 May 2019 (UTC)

@Glokc: before restoring the contents, please first seek consensus. Because you seem to have a Wikipedia:Conflict of interest, you should avoid adding advertisement of the Clubmark project to Wikipedia. The old contents were full of advertisement speech ("industrial grade") that is not appropriate for an encyclopedia and hence should not be copy&pasted here anyway. Furthermore, clubmark appears to be centered on community detection rather than traditional clustering, and ignoring the most standard clustering methods such as k-means. Obviously these domains overlap, but yours appears to be pretty much on the community detection side, and insights obtained from your benchmark are better located there (the concerns with respect to WP:COI however remain). Your point of view on what "clustering" is may be biased by a network science viewpoint, but Wikipedia should reflect a more broad consensus beginning with the highly-cited classics such as "Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data (Vol. 6). Englewood Cliffs: Prentice hall.", "Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons." and "Anderberg, M. R. (1973). Cluster analysis for applications: Probability and mathematical statistics.". Corrections to definitions and explanations here are of course welcome, but do not start another WP:EDITWAR just to get your link into Wikipedia, please. Thank youl HelpUsStopSpam (talk) 08:03, 1 June 2019 (UTC)

What's the point of cluster analysis?

Could someone the statistical field include a line or two in the intro (or elsewhere) that explains the purpose of the cluster analysis? The "What" and "How" is explained to a good extent but I can't find the "why" anywhere. Given it's use in machine learning and data mining, I think it would be timely to include the reasons. Economicactvist (talk) 08:26, 25 June 2019 (UTC)

Wiki Education Foundation-supported course assignment

This article was the subject of a Wiki Education Foundation-supported course assignment, between 6 September 2020 and 6 December 2020. Further details are available on the course page. Student editor(s): Rc4230.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 17:53, 16 January 2022 (UTC)

External evaluation - Jaccard Index

Currently the description of Jaccard Index ends by saying "Also TN is not taken into account and can vary from 0 upward without bound." However, this is incorrect and contradicts the beginning of the description which correctly says the metric ranges from 0 to 1. The zero to one range is also attested in the main article for Jaccard index. I am therefore going to remove that incorrect last sentence. Showeropera (talk) 18:51, 14 February 2023 (UTC)