Talk:DBSCAN

This is the talk page for discussing improvements to the DBSCAN article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Computer science Mid‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Databases (inactive)

This article is within the scope of WikiProject Databases, a project which is currently considered to be inactive.DatabasesWikipedia:WikiProject DatabasesTemplate:WikiProject DatabasesDatabases articles

Robotics Mid‑importance

	This article is within the scope of WikiProject Robotics, a collaborative effort to improve the coverage of Robotics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.RoboticsWikipedia:WikiProject RoboticsTemplate:WikiProject RoboticsRobotics articles
Mid	This article has been rated as Mid-importance on the project's importance scale.

Untitled

Is the abbreviation of epsilon to eps really necessary? Pflaquerre (talk) 07:52, 15 July 2008 (UTC)[reply]

No it is not - but it is consistent with the original publication of Ester, Kriegel, Sander and Xu. mathias (talk) 17:52, 7 April 2009 (UTC)[reply]

It seems as though the algorithm could benefit from being clarified a little bit. What exactly are we recursing on? Is C global? What are the procedure/function's parameters? Why is noise never marked as visited? Pflaquerre (talk) 07:55, 15 July 2008 (UTC)[reply]

I second this Crenshaw (talk) 23:42, 26 November 2008 (UTC)[reply]

In Steps, it says: " starting point and its neighbors are added to this cluster". The pseudo code has only: "mark P as visited". Should the pseudocode say "Mark P and all points in N as visited?" —Preceding unsigned comment added by 98.207.54.162 (talk) 20:59, 7 February 2009 (UTC)[reply]

The problem of big O notation

the notation "O((n2-n)/2)" in the "Nearest Neighborhood Detection" section is wrongly used. It should be O(n^2), because the constant "2" and lower order "n" are asymptotically insignificant. --Dexter (talk) 02:18, 28 April 2009 (UTC)[reply]

Pseudo code wrong?

Timmenzies (talk) 15:07, 14 June 2011 (UTC)[reply]

In the original paper, if the expanding cluster runs into another older cluster then the new cluster gets the same id as the older cluster.

Wasn't sure that this would happens in practice (since expandCluster a complete search through all relevant neighbors) but the original paper says that this can happen in the special case of border points (a note that I do not understand, BTW).

In any case, to be consistent with the paper and to handle that special case, should not the pseudo code be as follows? (my changes are all marked marked with //

DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      N = getNeighbors (P, eps)
      if sizeof(N) < MinPts
         mark P as NOISE
      else
         C = next cluster
         C.id = a new id                                  // new
         expandCluster(P, N, C, eps, MinPts)
          
expandCluster(P, N, C, eps, MinPts)
   add P to cluster C
   for each point P' in N 
      if P' is not visited
         mark P' as visited
         N' = getNeighbors(P', eps)
         if sizeof(N') >= MinPts
            N = N joined with N'
      if P' is not yet member of any cluster
         add P' to cluster C
      else                                                         // new 
         C.id = the id  of  the cluster holding P' // new

The result of DBSCAN is non-deterministic on border points; but you clearly do NOT want to merge clusters when they share a border point, because that would violate the model of density connectedness. Your modification would cause this, breaking the whole model of density connectedness. Also I belive your first sentence is wrong. IIRC the new *point* keeps the Cluster-ID of the old cluster, which is exactly what the code does. An alternative would be to just overwrite the cluster assignment. The two variants are isomorphic, you can emulate the other by just processing the dataset in the opposite order.

Border points are density-reachable from a cluster but not dense themselves. They can be density-reachable by two clusters, but since they are not core points they will NOT merge the two clusters. By being density reachable, they would belong to both clusters, but the DBSCAN algorithm assigns them to whichever cluster discovered them first. They are usually not significant, and this way you get a proper partitioning and not some fuzzy or overlapping clustering. --87.174.53.81 (talk) 22:13, 14 June 2011 (UTC)[reply]

Pictures

This article is in serious need of some pictures. Particularly, the section "Basic Idea". — Preceding unsigned comment added by 99.181.61.118 (talk) 22:19, 15 August 2011 (UTC)[reply]

In the caption of the figure in section "Preliminary", some of the core points only have 3 reachable neighbours. If I'm not mistaken, then given the definition or core points, either the label is wrong (minPts = 4 should be minPts = 3) OR the figure is wrong and only two points should be considered as core points (I guess correcting the caption is easier than correcting the figure). I didn't want to edit before being sure. — Preceding unsigned comment added by 143.239.81.7 (talk) 14:13, 26 September 2018 (UTC)[reply]

The number of points is 4. It does not say "number of points different from the query point". The range query does not exclude the query point - it is part of the database. Chire (talk) 14:29, 12 October 2018 (UTC)[reply]

The topic is not that clear... what kind of ordering is created...

There are few images on this page which seem more useful http://de.wikipedia.org/wiki/OPTICS

Moreover there is no explanation of the ordering... Maybe an example would be very useful.. Please help need to implement but have no idea how the ordering works

cant find any other better source Thank you Varunkumar Jayapaul jv5k5-wiz@yahoo.co.in — Preceding unsigned comment added by 203.196.165.100 (talk) 12:36, 14 June 2012 (UTC)[reply]

DBSCAN does not generate an ordering. You are confusing it with OPTICS algorithm. Have you considered looking at the original publications? They contain pseudocode. --Chire (talk) 16:17, 14 June 2012 (UTC)[reply]

Basic Idea Section Unreadable

The Basic Idea section is incomprehensible to someone who does not understand the algorithm ahead of time, due to having too many pronouns. The "it's" could be p or q.

192.249.47.174 (talk) 15:51, 21 June 2012 (UTC)[reply]

It should be noted in the algorithm details that DBSCAN can create a cluster that is less than minPts

A cluster in DBSCAN is only guaranteed to consists of at least 1 core point.

Since border points that belong to more than 1 cluster will be "randomly" (usually: first-come) assigned to one of the clusters, a core point may not be able to retain/get all its neighbors. Thus, it may be smaller than minPts.

One dimensional example: minPts=4, epsilon=1:

0.0 0.5 1.0 2.0 3.0 3.5 4.0

|-------X-------|                   Core point at 1.0, Cluster 1.
                |-------X-------|   Core point at 3.0, Cluster 2.

All other points are not core points. Since the point 2.0 is not a core point, it does not connect the two clusters. One of the clusters will have 4 members, the other only 3. — Preceding unsigned comment added by 208.38.35.13 (talk) 16:21, 25 February 2014

When copying data from a web site (Stackoverflow): [1] you should A) obey copyrights, and B) give credit. --188.98.200.186 (talk) 11:31, 2 March 2014 (UTC)[reply]

RegionQuery: Including reference point P?

The current regionQuery explicitly contains the given reference point P as well.

regionQuery(P, eps): return all points within P's eps-neighborhood (including P)

In the original paper, the authors state:

A call of SetOfPoints.regionQuery(Point, Eps) returns the Eps-Neighborhood of Point in SetOfPoints as a list of points.

In Neighbourhood (graph_theory), it is stated:

... The neighbourhood described above does not include v itself, and is more specifically the open neighbourhood of v; it is also possible to define a neighbourhood in which v itself is included, called the closed neighbourhood and denoted by NG[v]. When stated without any qualification, a neighbourhood is assumed to be open. ...

The authors do not elaborate on whether it's a closed or open neighborhood. The last sentence of the article about neighborhoods would suggest to exclude P unless it is explicitly stated otherwise.

--Fretdf (talk) 06:27, 20 August 2014 (UTC)[reply]

In a database context, such as when using an R*-tree as recommended by the authors, the query would include the query point. It is part of the database, too. --Chire (talk) 11:24, 20 August 2014 (UTC)[reply]

I have also verified by looking at the original source code. The query point is included (for all I can tell). --Chire (talk) 13:01, 20 August 2014 (UTC)[reply]

Added a link to the SPMF implementation of DBScan

Yesterday, I have added a link to the SPMF implementation of DBScan and it was deleted by IP address "94.216.196.253" as being "an advertisement". However, SPMF is a well-known data mining library used in more than 160 research papers (see SPMF website - citations page), and has many users (more than than 70,000 visitors to the website last year). It is thus relevant to mention that it offers a DBScan implementation. Beside, it is interesting to note that the ip address that have delete the link to SPMF is located in Muenchen, Germany. Thus, it might have been deleted by members of the ELKI data mining library and it may look like a conflict of interest... By the way, I'm the founder of SPMF. — Preceding unsigned comment added by 139.103.155.68 (talk) 19:31, 8 March 2015 (UTC)[reply]

You added this to three articles (WEKA, k-means and this, so I reverted all three...):

GPL-V3 Java implementations of more than 80 data mining algorithms. It offers an efficient Java implementation of the DBScan algorithm using a KD-Tree that is easy to integrate in other Java software. A GUI and command line interface is provided.

This is advertisement text. — Preceding unsigned comment added by 94.216.196.253 (talk) 23:43, 8 March 2015 (UTC)[reply]

The SPMF library is relevant. If you think that there is a problem with the writing style, which is subjective, I suggest to rewrite rather than deleting. You only need to remove the part that you think is inappropriate. I will put back the text for Weka and K-Means page but I will rewrite it differently. Again, if you have a problem with the text, you can rewrite. — Preceding unsigned comment added by 99.252.71.54 (talk) 01:17, 9 March 2015 (UTC)[reply]

Discretization of the space for the graph analogy

I am not sure how important this is for experienced readers, but the fact that a possibly continuous space is seen as a graph should be made explicit; I would like to talk about edges and paths only once the following points are put: -every point in the space is a node -there exists an edge (x,y) if d(x,y)<=e Thank you! Reim (talk) 12:17, 1 July 2015 (UTC)[reply]

Negative description about other software

I just came back to this page after a few months and read that the description of SPMF has been changed to that: "* SPMF offers a minimalistic GPL-V3 Java implementation of the DBSCAN algorithm for Euclidean distance only with KD-Tree support, that often doesn't yield any result at all." According to the IP address, it seems that the text has been rewritten by authors of the Elki software. It is claimed that it often yield no result. But I run the algorithm and there is definitely a result that is produced. If a bug has been found such that it does not produce result in some cases (which need to be demonstrated) what about reporting it so that it can be fixed like most users do when finding a bug, instead of writing some very negative description on Wikipedia. I mean, you can promote your software without writing negative stuff like that about other software. Besides, there is more than one distance function offered in SPMF (subclasses of DistanceFunction). This has just not been mentioned very clearly in the SPMF documentation. The documentation and code has been updated to make this clearer and fix that. I have now fixed the description in the Wikipedia article. In the future, if you find a bug, report it to me, and I will fix it. I'm the founder of SPMF. ~~

Relation to Friends-of-Friends algorithm.

The algorithm is widely used in cosmology to identify dark matter halos from simulations, under the name FOF (friends-of-friends):

The FOF algorithm was described and studied on Page 382, Section e) multiplicity functions of

http://adsabs.harvard.edu/abs/1985ApJ...292..371D

This use was much earlier than the 1996 paper.

There is a short discussion about FOF and DBSCAN under the discussion under 'related work' on page 4 of

http://homes.cs.washington.edu/~magda/papers/kwon-ssdbm10.pdf

FOF is the special case where minPTS == 1. Although with FOF objects with less than (typically 32) points are removed from the catalogue.

Based on this I propose we add a sentence somewhere in this article:

  a special case of DBSCAN, known as Friends-of-friends
  algorithm was first proposed by Davis et al 1985 and still widely used in cosmology.

Objections?

Yu Feng Rainwoodman (talk) 20:03, 18 September 2015 (UTC)[reply]

The pointless case of minPts < 2 is also a special case of single-link hierarchical clustering, and this kind of clustering was discussed already by Sneath 1957. I doubt "FOF" is the first. It's nothing but the transitive closure of the neighbor relation, and not worth being called an "algorithm" at all - transitive closures are a well understood and old mathematical concept. Did Davis discuss anything to make it fast? 93.204.126.210 (talk) 08:42, 19 September 2015 (UTC)[reply]

The article already discusses this:

The low value of minPts = 1 does not make sense, as then every point on its own will already be a cluster. With minPts = 2, the result will be the same as of hierarchical clustering with the single link metric, with the dendrogram cut at height ε.

In other words, they are using single-link clustering by the name "FOF" in cosmology, not DBSCAN. 93.204.126.210 (talk) 08:45, 19 September 2015 (UTC)[reply]

Noise points vs pseudo code

In the DBSCAN function in the pseudo code, a point can be marked as noise, however later that already marked noise point can become a part of a cluster. This is also mentioned in the original article: "The ClId (clusterId) of points which have been marked to be NOISE may be changed later, it they are density-reachable from some other point of the database."

I think this should be indicated in the pseudo code, to avoid the collection of the noise points in the DBSCAN function (as I did). E.g. like this:

expandCluster(P, NeighborPts, C, eps, MinPts) {
   add P to cluster C
   for each point P' in NeighborPts { 
      if P' is not visited {
         mark P' as visited
         NeighborPts' = regionQuery(P', eps)
         if sizeof(NeighborPts') >= MinPts
            NeighborPts = NeighborPts joined with NeighborPts'
      }
      if P' is not yet member of any cluster
         add P' to cluster C
         if P' is marked as NOISE                                       // < here
            unmark P' as NOISE                                          // < here
   }
}

DeVi (talk) 11:39, 28 November 2016 (UTC)[reply]

I'd rather stick to the published pseudocode than make a Wikipedia version of it and worst-case introduce some bugs. I'd also prefer to keep the pseudocode short. Chire (talk) 16:23, 1 December 2016 (UTC)[reply]

Complexity

Algorithm's authors claim that it is possible to achieve DBSCAN with O(n log n) time complexity with an implementation based on index structures. However, there is an ongoing discussion whether the claim is valid. To the best of my knowledge, no formal proof of the linarithmic complexity exists.

The edit was reverted with a comment "already discussed on the article. see references." -- I see no references for complexity analysis in the article. The linarithmic claim was questioned for instance in:

Gunawan, A. and de Berg, M., 2013. A faster algorithm for DBSCAN. Master’s thesis.

Gan, J. and Tao, Y., 2015, May. DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD international conference on management of data (pp. 519-530).

or even:

Schubert, E., Sander, J., Ester, M., Kriegel, H.P. and Xu, X., 2017. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), pp.1-21. 91.193.208.197 (talk) 14:30, 9 May 2024 (UTC)[reply]

none of them doubt "the worst case run time complexity remains O(n²)." - so the article seems up to date, and I'd argue that last article settled the "ongoing discussion" - was there anything new in the last 5 years? Check if the article supports the "O(n²) worst case" claim of the wikipedia article and then just reuse the citation.