Jump to content

Wikipedia:Reference desk/Archives/Mathematics/2020 September 2

From Wikipedia, the free encyclopedia
Mathematics desk
< September 1 << Aug | September | Oct >> September 3 >
Welcome to the Wikipedia Mathematics Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


September 2

[edit]

Magnitude of binary vectors

[edit]

I was looking at math examples, and it was stated that for binary strings, (I don't know how to type math. If anyone knows how to change || to the double bar and . to the product dot, please do.) I have been trying to work it out, but it is just beyond me. How do you get from the left side of the equality to the right side? 97.82.165.112 (talk) 18:42, 2 September 2020 (UTC)[reply]

Done. –Deacon Vorbis (carbon • videos) 19:29, 2 September 2020 (UTC)[reply]
However, I didn't change the single bars; are those supposed to be double as well? If not, what are they supposed to mean? Because what you have now isn't correct. For example, if one of your vectors is zero and the other isn't, then the LHS is 0, but the RHS is non-0 (assuming the single bars should be double, assuming you mean the regular old norm by that). –Deacon Vorbis (carbon • videos) 19:32, 2 September 2020 (UTC)[reply]
I'm looking at Jaccard index and Cosine similarity. Not only on Wikipedia, but on other sources. They claim that for binary vectors, the two are the same. I personally think that they are not the same. They are similar in function, but not the same. If you look at them, both have in the numerator, which implies that the denominators are equal. 97.82.165.112 (talk) 21:39, 2 September 2020 (UTC)[reply]
Let be a zero binary vector (a bitstring of only zeros). Then , unless these notations mean something else than I think they mean. Then the claim under suspicion simplifies to .  --Lambiam 22:52, 2 September 2020 (UTC)[reply]
It would be nice if everyone used the same notation for all concepts, and all concepts had their own notation, but that's sometimes not the case, so I think some additional context is needed. |A| is sometimes used to denote the cardinality of a set A. Binary vectors can be identified with subset of the set of indices: if A=(a1, a2, ... , an) then i∈A (as a set) iff ai=1. Assuming this, the cardinality of A, |A|, is equal to ‖A‖2, where ‖A‖ is the norm of A taken as a vector in Rn. It's true that |A∪B| = |A|+|B|-|A∩B| for finite sets. It's also true that |A∩B| = A⋅B where the rhs is the dot product of A and B taken as vectors. So one can write |A∪B| = |A|+|B|- A⋅B with this interpretation. I don't know what to make of ‖A‖‖B‖ though. I'm guessing that the statement itself will be trivial to prove (assuming it's true) once we figure out what it means. But we need an explanation of what the symbols mean here using words (e.g. norm, cardinality, union, intersection). A link to where the statement is coming from would be ideal. --RDBury (talk) 05:48, 3 September 2020 (UTC)[reply]
PS. The Jaccard index article does not use the |⋅| notation consistently even within itself; at start of the article |A| means the cardinality of A, and in the section "Other definitions of Tanimoto distance" it means Euclidean norm. --RDBury (talk) 06:03, 3 September 2020 (UTC)[reply]
I think that confusion over notation is the root cause of incorrect claims. If we are limited to binary vectors (or bitmaps), is equivalent to because both are just adding up . Further, is equivalent to because when . Therefore, adding and removing the 2 becomes commonplace, but if it is done sloppily, it is incorrect. For example if . Example: For , , , and . Now, . Because we are using binary vectors, we can state that . But, . Going that far implies that , which is easy to prove wrong. For , and . What I think is happening is that one person is claiming that . Then, another person is claiming that . Then, yet another person sees that and uses transitive property to claim that . If that were the case (which it isn't), then . So, as you say, I think it is all due to bad definitions. In my opinion, the use of to mean should be removed from the Jaccard index page. 97.82.165.112 (talk) 13:27, 3 September 2020 (UTC)[reply]
Whether this is the source of the confusion in this specific case or not – I do not quite see where the term is coming from – you are absolutely right that the inane notation for in the Jaccard article is inexcusably confusing if not outright wrong. I see no reason to use it either, just like we do not express Euler's identity in the form . --Lambiam 15:08, 3 September 2020 (UTC)[reply]
A bit of archeology: this edit of 12 May 2011 replaced by . Wrong for bit vectors, doubly wrong here because the vectors could represent multisets, where the element values are multiplicities. The next edit restricted it to bit vectors. --Lambiam 15:42, 3 September 2020 (UTC)[reply]
This discussion is helpful because I don't feel I'm simply misreading things. But, it means that I have to do a lot more research to see how Tanimoto distance relates to cosine similarity when using bitmaps/binary vectors. The numerator is is both measurements. I'm certain that the denominators are not "equivalent" as I've read from many websites. So, I want to work out how the denoninators relate to one another. In other words, how does relate to . My end goal is to tell someone else: If you accept that Tanimoto distance is a valid metric for comparing your binary data, then cosine similarity is just as valid because it is just a version of Tanimoto created by... (I don't know what the conversion is). 97.82.165.112 (talk) 18:03, 3 September 2020 (UTC)[reply]
Do you know where the lhs came from? Did you find this on the Web? If so, where? Assuming that stands for pointwise (bit by bit) inclusive or, . By the inclusion–exclusion principle, , and using this can be rewritten as
as already noted by RDBury above. Note the striking similarity to
This is unlikely to be a coincidence, but I see no plausible way to connect and , also not by a sequence of bloopers such as .  --Lambiam 20:18, 3 September 2020 (UTC)[reply]
Very late note, but I figured out what the user who added meant. First, it is important to note that there is confusion about use of and . We know that . But, is confusing. When you are working with sets, it is commonly a count of members of the set. When you are working with vectors, it is the magnitude of the vector. To avoid that confusion, is used for magnitude. But, it is also very common to see used for magnitude. So, let's assume that the editor was using . If that is the case, then . But, we know that for bitmaps, . Now, if we change what we mean by and refer to it as a the size of a set, not magnitude of a vector, . To avoid this confusion, I am editing the article to use for magnitude and for count size. Also, I was able to use geometry to make the jump from Jaccard/Tanimoto for bitmaps and cosine distance for bitmaps. Kind of obvious now that cosine comes from angles which essentially comes from geometry, so representing bitmaps geometrically was the way to go. 97.82.165.112 (talk) 13:03, 6 September 2020 (UTC)[reply]
But why write ? What is the square of a binary vector, whether it represents a geometric vector or a set? If this is bitwise squaring, where , then of course , but then what is the point of denoting this no-op?  --Lambiam 13:37, 6 September 2020 (UTC)[reply]
I *think* that the goal is to relate Tanimoto distance (and Jaccard index) to cosine similarity. That implies that there is some relationship between and . But, as I noted, it is mixing notation and creating confusion. It confused me at least. The more I looked at it, there is no relationship between the denominators and . They look similar at a glance, but is just a simple count: . Even if you want to square A, it is . Lookint as , you square A, which is just A. But, you take the square root of the sum. So, if we omit the squaring of A, you still end up with . Expanding it out, it is clear that has nothing to do with . If you square both of them to get rid of the square roots, it still doesn't help. You have a lot more addition and subtraction in the first and just a multiplaction in the second. Therefore, I've ignored everything I've read where people do a lot of handwaving and claim that, somehow, Tanimoto distance bridges between Jaccard index and cosine similarity. One is based on counts of items in a set. The other is based on vectors in Euclidean space. Even if you use binary values for the vectors, you aren't on a unit circle. With two dimensions, you have the possible points (0,0), (0,1), (1,1), and (1,0). That's a square, not a circle. The vector magnitudes are are 0, 1, and 1.41. You aren't binary anymore. But, that is more than anyone likely cares to know. I just wanted to point out that I believe I figured out what the article meant when it claimed . 97.82.165.112 (talk) 19:28, 6 September 2020 (UTC)[reply]