Talk:K-means++

It is requested that a computing diagram or diagrams be included in this article to improve its quality. Specific illustrations, plots or diagrams can be requested at the Graphic Lab.
For more information, refer to discussion on this page and/or the listing at Wikipedia:Requested images.

deletion?[edit]

Why is this article nominated for deletion? I did not write this but it seem correctly cited and of some marginal worth. The Wikipedia CS and stats pages are already kinda bare bones. It seems arbitrarily spiteful to the author to remove this page when it is of some use and is being removed simply for “not being of note”. Madjack74 (talk) 15:51, 19 May 2009 (UTC)[reply]

I nominated it because I didn't believe it satisfied the requirements per Wikipedia:Notability, as it seems to be based entirely on one paper from a CS conference. It appears that the material was originally on the k-means clustering page, where the was some question of its notability on the talk page, before it was moved here. —3mta3 (talk) 18:02, 19 May 2009 (UTC)[reply]

I have added a web citation for something that says this method is good (it is not obvious if this is "published" in a recognised sense), and another for slides by the original authors. A simple web search found two other conference papers that reference the basic paper but I was unable to access these to see if these show significant applicability. I think that the prod tag needs to be replaced with something more appropriate, so have done so. Please add citations and revise as necessary. Melcombe (talk) 09:31, 20 May 2009 (UTC)[reply]

Here are a few people that claim independent reviews of k-means++: http://lingpipe-blog.com/2009/03/23/arthur-vassilvitskii-2007-kmeans-the-advantages-of-careful-seeding/ http://www.cse.ohio-state.edu/~johansek/clustering.pdf http://www1.cs.columbia.edu/~rjaiswal/jaiswal-research.pdf. Honestly, I'm not really sure what more could be expected. After all, the paper is only two years old. Also, how is a peer-reviewed research paper from a major conference considered a primary source? As long as the paper is considered reliable (why wouldn't it be?), the algorithm itself seems clearly notable. It's faster and more accurate than k-means, and it has a provable accuracy guarantee, which k-means does not. 69.181.143.165 (talk) 01:00, 21 May 2009 (UTC)[reply]

One question is why you didn't yourself add these citations into the article as they are at least some evidence that someone besides the article's originator and the originators of the method have at least tried out the approach. This is just what the earlier version of thr article was/is lacking. Of course there is the problem that theses citations don't actually indicate that they are from a "peer-reviewed conference", but perhaps you can provide more details for these article than just the web-link ...and include them anyway if no better details are available as they can always be removed if more substantial references appear. I think the discussion has antway moved on beyond "deletion" to one of improving the article. Melcombe (talk) 08:48, 21 May 2009 (UTC)[reply]

No I can't provide more detail, which is why I didn't add them. Sorry! I just found those articles by searching for "k-means++ test" on Google as a way of defending notability in this discussion. I'll take a look at the article when I get a chance.69.181.143.165 (talk) 15:16, 21 May 2009 (UTC)[reply]

I have added these, and another, as a start in the hope that fuller info can be found later of conference details etc.. ... and more "reliable sources". Melcombe (talk) 16:36, 21 May 2009 (UTC)[reply]

This is my first time editing Wiki, so I'm not sure if this is where to discuss, but is the term "vanilla k-means" really a good way to refer to the original algorithm? "Original k-means" sounds much more professional and less hipster-doofus to me...MetaphysicalBrian (talk) 16:57, 4 June 2009 (UTC)[reply]

I think there is a mistake in the "Example Bad Case". Suppose the four points are (0,0), (3,0), (0,8), (3,8), and the initial arbitrary partition is {(0,0), (0,8)}, {(3,0), (3,8)}. In this case the centroids are (0,4) and (3,4), and the total cost is 4 x 4^2 = 64. Without loss of generality suppose we pick point (0,0) and try to improve by moving it to the second cluster, then we get: {(0,8)}, {(0,0), (3,0), (3,8)} In this case the centroids are (0,8) and (2,8/3), and the total cost is 0+11.11+8.11+29.44 = 48.66, which is already better. (Moreover, I didn't check but I think that the next step will move point (3,8) to the first cluster which gives optimal solution). Hagai Cibulski —Preceding unsigned comment added by Hagai Cibulski (talk • contribs) 21:30, 7 December 2009 (UTC)[reply]

The example case is correct. The point is the k-means algorithm will not try the step you suggested. You might want to review its definition, but basically, you had to do two steps to make the improvement you suggested - move a point from one cluster to another and move the centers to new locations. k-means can only do one step at a time, which means it will be stuck. Of course, you could modify it to search for things like this but then it would become much slower, which defeats its purpose.76.66.24.24 (talk) 00:20, 20 December 2009 (UTC)[reply]

I would like to bring up the case for deletion again. Even as an important method for initialisation I fear it is not sufficiently distinct from other forms of initialisation to warrant its own page. Putting a positive spin on things, it would be of more use alongside other forms of initialisation; the discussion on the main K-means page is short. K-means++ should be on Wikipedia in an article alongside the other 101 methods of parameter initialisation, where it will be of more use to readers. SymplectoJim (talk) 11:24, 22 March 2019 (UTC)[reply]

@SymplectoJim: no need to delete. You can merge articles. Which would make sense in my opinion. I would suggest to Wikipedia:Be bold and just do this. Follow the steps in WP:PROMERGE, in particular add Template:Merged-from templates to the talk pages, and clean up at least some of the links. HelpUsStopSpam (talk) 21:20, 22 March 2019 (UTC)[reply]

Algorithm doesn't make any sense[edit]

Especially step 3. Can anyone figure out what that is supposed to mean? Seriously, x_k+1 > D(x_k)² is just nonsense. In the paper I've found it says something completely different. You're supposed to take as a new center any data point x with probability proportional to D(x)². So basically, the standard k-means initialization chooses centers randomly among data point, each being equally weighted (or weighted by D(x)⁰), while k-means++ weights them by D(x)². 77.23.64.247 (talk) 13:43, 4 January 2010 (UTC)[reply]

Step 3 appears to have been wrong. I corrected it. 76.66.24.24 (talk) 14:38, 4 January 2010 (UTC)[reply]

Thanks, it looks correct now. The whole article seems to need to be reworked though. Comparisons between k-Means++ initialization and standard or "vanilla" k-Means are invalid, as there are many different initialization methods for k-Means. The Bad example case given here assumes choosing centers uniformly random, after which k-Means will fail with a fixed probability. k-Means++ initialization however can fail arbitrarily bad as well, with rapidly decreasing probability. Choosing centers furthest away from already chosen centers will in fact always yield the optimal solution for this case. Besides being completely unrealistic, the bad example here doesn't actually show superiority of k-Means++ over different initializations.

A better example that shows k-Means++'s strength in a realistic situation would be the following: Let there be a number of dense clouds of data points and a few outliers around, the desired number of clusters shall be the same as the number of point clouds. Uniformly random initialization is likely to select more than one initial center from the same cloud, which will result in k-Means incorrectly clustering together different clouds. Furthest-away-initialization will select the outliers as initial centers and is also likely to fail. k-Means++ is likely to choose the centers far away from each other, however it is improbable to select outliers, as there are only a few of them compared to the larger number of data points which aren't. If noone has to object, I might create a visualization and rewrite the example case from scratch. 77.23.64.247 (talk) 17:09, 5 January 2010 (UTC)[reply]

Sounds good to me! 76.68.82.182 (talk) 18:25, 10 January 2010 (UTC)[reply]

External links modified[edit]

Hello fellow Wikipedians,

I have just modified one external link on K-means++. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20060209181757/http://www.cs.umd.edu/~mount/Papers/kmlocal.pdf to http://www.cs.umd.edu/~mount/Papers/kmlocal.pdf
Added {{dead link}} tag to http://www.cse.ohio-state.edu/~johansek/clustering.pdf

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 20:13, 3 December 2017 (UTC)[reply]