Talk:Locality-sensitive hashing

This article is rated Start-class on Wikipedia's content assessment scale.
It is of interest to the following WikiProjects:

Robotics Mid‑importance

	This article is within the scope of WikiProject Robotics, a collaborative effort to improve the coverage of Robotics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.RoboticsWikipedia:WikiProject RoboticsTemplate:WikiProject RoboticsRobotics articles
Mid	This article has been rated as Mid-importance on the project's importance scale.

Computer science C‑class

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

C

This article has been given a rating which conflicts with the project-independent quality rating in the banner shell. Please resolve this conflict if possible.

???

This article has not yet received a rating on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Computing Stub‑class

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Stub	This article has been given a rating which conflicts with the project-independent quality rating in the banner shell. Please resolve this conflict if possible.
???	This article has not yet received a rating on the project's importance scale.
	This article has been automatically rated by a bot or other tool as Stub-class because it uses a stub template. Please ensure the assessment is correct before removing the `\|auto=` parameter.

Reference from "Locality preserving hashing"[edit]

What is the difference between "Locality-sensitive hashing" and "Locality preserving hashing"? The article of the last one refers to this article, but there is no detailed explanation of the motivation behind the "sensitive" and "preserving" notation. — Preceding unsigned comment added by 92.200.44.109 (talk) 08:50, 30 July 2015 (UTC)[reply]

I added a reference to this article that goes into a lot of detail about two specific algorithms, LSH and LPH.

I agree that the difference in terminology (if any) is unclear. What (if anything) is the difference between "sensitive" and "preserving" to a hash function? --DavidCary (talk) 16:52, 22 August 2016 (UTC)[reply]

I don't think there's much of a difference. I think we should fold both terms into the same article. Prad Nelluru (talk) 05:12, 17 April 2019 (UTC)[reply]

Suggestion to remove Nilsimsa Hash section[edit]

The Nilsimsa Hash does not really fit into the LSH definition(s) (Indyk's or Charikar's). Consequently, there is no way to plug into the common framework of LSH and obtain good index-size and query performance guarantees --- one of the strengths of LSH approaches. Hence, I suggest this section should be removed, or at least not in the current place (in the same level with random projection, simhash, and p-stable based lsh).

Is there a name for the more general category that includes Nilsima hash, LSH, LPH, TLSH, etc.? --DavidCary (talk) 16:52, 22 August 2016 (UTC)[reply]

Untitled[edit]

Just made the page. There are some variations among definitions of LSH - I am using Charikar's. Flamholz 19:40, 6 June 2007 (UTC)[reply]

Charikar's definition is too narrow, though a bit easy to understand by beginners.

Do you have a reference to a "better" definition? Please add that reference to the article. Thank you. --DavidCary (talk) 16:52, 22 August 2016 (UTC)[reply]

Definition of an LSH[edit]

I don't think the current definition really makes sense, although maybe it could be modified a little to work.

Specifically: for a metric phi(x,y), we have phi(x,y)->0 (intuitively) as x->y. But if Pr[h(x)=h(y)] -> 0 as x->y, that's bad! I mean, that is just about the opposite of a locality-sensitive hash.

One fix might be to say Pr[h(a) = h(b)] = 1 - phi(a,b) instead.

Although it would also be nice to allow for general boolean combinations of hashes, such as simultaneously hashing to many different values, and calling it a hit if some combination thereof actually collide. —Preceding unsigned comment added by PhiloMath (talk • contribs) 07:14, 5 December 2007 (UTC)[reply]

I totally agree. The definition as it stands is wrong. The phi(a,b) is a similarity not a distance or metric (mathematics). Another error is that the Jaccard_index which is a similarity but is currently is referred to as the "Jaccard distance". Notice that the correct definition looks like a probabilistic version of Injective_mapping. cmobarry (talk) 17:26, 20 December 2007 (UTC)[reply]

Added another variant of LSH definition[edit]

We added the Indyk-Motwani definition of the LSH family, plus an LSH family for the Hamming space (by bit sampling), as well as the LSH algorithm for the nearest neighbor search (approximate). Alex and Piotr. 128.30.48.53 (talk) 02:13, 7 February 2008 (UTC)[reply]

Hey guys, i have a question.[edit]

in the last section:

LSH Algorithm for the Nearest Neighbor Search

... it is being claimed that : ....

query time: $O(L(kt+dnP_{2}^{k}))$ ;

i am trying to figure out from where does the $P_{2}$ comes from... why is the probability for colision is $P_{2}$ ?

can someone please shed some light on this? —Preceding unsigned comment added by Caligola0 (talk • contribs) 18:01, 18 June 2009 (UTC)[reply]

I agree -- what is the meaning of

P_{1}

,

P_{2}

, and

d

? --DavidCary (talk) 16:52, 22 August 2016 (UTC)[reply]

d

is the dimension of the data-points,

P_{1}

is the probability that two close points (distance

\leq R

) collide.

P_{2}

the probability that two far points (distance

\geq cR

) collide. --Thomasda (talk) 18:13, 3 November 2021 (UTC)[reply]

Relation to Vector Quantization?[edit]

hi - could someone clarify the relation to Vector Quantization please? --mcld (talk) 09:21, 8 April 2010 (UTC)[reply]

merge[edit]

I suggest merging locality-preserving hashing into locality-sensitive hashing. There seems to be enough WP:OVERLAP that a single article can cover both, and clarify the distinction (if any) between them. --DavidCary (talk) 15:50, 22 August 2016 (UTC)[reply]

Done Klbrain (talk) 08:13, 10 May 2018 (UTC)[reply]

Random projection[edit]

How is $1-x/\pi$ "closely related" to $cos(x)$ for small $x$ ? It is surely not a Taylor expansion, or anything of that sort. How is this even a relevant comment at this point? — Preceding unsigned comment added by 37.24.141.200 (talk) 21:43, 2 September 2016‎ (UTC)[reply]

That's a referenced example of method; I agree that the approximation is not the Taylor expansion, but it is the method used by the paper. The comment about the relationship between

1-x/\pi

and

cos(x)

is necessary to support the final statement in the section:"Two vectors' bits match with probability proportional to the cosine of the angle between them". Klbrain (talk) 09:09, 4 April 2018 (UTC)[reply]

External links modified[edit]

Hello fellow Wikipedians,

I have just modified one external link on Locality-sensitive hashing. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20101203074412/http://www.vision.caltech.edu/malaa/software/research/image-search/ to http://www.vision.caltech.edu/malaa/software/research/image-search/

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 23:13, 4 January 2018 (UTC)[reply]

Is Geohash algorithm L-s hashing?[edit]

See Geohash. How to proof?

yes, is Locality-preserving. The global probabilities (to check $R$ and $cR$ ) are not easy to calculate...
Amplification? how to check?

Perhaps the easy and first step is to transform Geohash digest from base32 to base4, because Geohash divides globe into 4 regions.

High-dimension data?[edit]

The lead now refers to high-dimension data. I've spent 40 years as a software engineer, and know what ordinary hash coding is all about. And I thought I knew what a dimension is (an orthogonal axis in a graph, or a measure of the wiggliness of a line in fractal theory), but I haven't seen it used to refer to data before, possibly because I've never done any business programming (only systems and tools programming). Perhaps a brief definition could be added here? Just something that could distinguish high from low dimension data. For example, is the data set {1, 2, 3} high or low dimension? Is the data list (0:3, 1: 8, 2: -3) high or low dimension?. I have no idea, but I think if the lead is going to use this term, it should give at least one example, if nothing else. David Spector (talk) 23:29, 12 November 2021 (UTC)[reply]

f=?[edit]

No explanation about function f at

d(p,q)<d(q,r)\Rightarrow |f(p)-f(q)|<|f(q)-f(r)|

Please add explanation Krauss (talk) 11:49, 27 August 2022 (UTC)[reply]

I agree that such an abstract function can use an explanation. David Spector (talk) 17:09, 27 August 2022 (UTC)[reply]