Talk:Content-addressable storage

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Low	This article has been rated as Low-importance on the project's importance scale.
	This article is supported by WikiProject Software (assessed as Low-importance).

Is the term CAS too EMC-specific? Some might prefer the expression "disk archiving". Westwind273 00:34, 8 September 2006 (UTC)[reply]

This page seems entirely biased towards a particular view of CAS technology, and the number of mentions of "John Canessa" is daunting. There's a lot more in content-based storage than is mentioned in this article; it feels like it was written by one person with a very strong bias about the history of the technology, and lacks any authoritative citations for why that view of history is correct. There's a lot of relevant academic work on content addressability - Venti, which _is_ cited, as well as systems such as the Low-Bandwidth File System, Windows' "Single Instance Storage", and enormous work on disk deduplication in research (Fred Douglis at IBM is a good starting point, and Data Domain, recently acquired by EMC, is a good starting point on the corporate side). 128.2.209.18 (talk) 14:56, 3 November 2009 (UTC)DaveAndersen[reply]

It's disgraceful that an article with title "Content-addressable storage" should suggest that the history of the topic began in 1992. Content-addressable storage was a term that had been around for several decades by then, products providing contenct-addressed storage had been available for a long time, and the article looks like an attempt to claim an underserved priority for specific people and products. The coat-hook metaphor is NOT relevant to CAS in general, but only to a particular firm's product, and I guess the use of this is part of the same over-inflated claim. Maybe a disambiguation page would avoid the appearance of commercial puffery instead of an encyclopedia article, with this page NOT carrying the simple title it currently carries (since that would belong to the disambiguation page), but I think an article on content addressable storage in general is needed as a top level article rather than just a disambiguation page. Michealt (talk) 14:52, 25 July 2010 (UTC)[reply]

No info on hash collisions[edit]

Since hashing produces non-unique keys, and collisions are always a risk - despite that really long keys lower that risk - content addressable storage doesn't scale safely for massive collections. The issue is both that multiple documents may share the same key, and more problematically, that the hubris of overconfident programmers leads them to skip writing collision handling code. The article brazenly omits this risk.

For people who say "oh, well these hashes can't collide, they could label every atom in the universe uniquely" - the reality is that this is merely another case of the birthday paradox. And if the hash length *were* enough to be certain, surely tossing one bit wouldn't make it too short, right?...[repeat until interlocutor gets uncomfortable with the shrinking bit count]. Alex North-Keys (talk) 00:18, 28 April 2023 (UTC)[reply]

Seconded. This needs to be mentioned in this article^[a], and prominently^[b].

It is possible to safely use hashes for addressing storage, but each new copy^[c] ingested needs to be checked in some additional way ^[d].

If they match, great; only one copy needs to be retained!
If they don't match, however, then some sort of secondary 'collision identifier' needs to be used. As more and more data is encrypted (and therefore is effectively random), the collision risk becomes higher still. Trying to de-duplicate on the block level (or even worse, using a 'rolling window' method)

Hashes (cryptographic or otherwise) by themselves can be very useful for message authentication, or even just to guarantee data integrity (i.e. making sure a file wasn't intentionally or inadvertently changed or corrupted); alone, however, they cannot safely be used to add content to a storage system.^[e]

See also: the Pigeonhole principle, Record linkage, and the Gambler's fallacy.

- Jim

(I don't have any sources handy at the moment, but I took the time to write my this in the hopes that someone else does; I'm sure there are numerous papers in the ACM library, for example.)

^ (warnings are likely needed in other articles as well)
^ Just today I discovered (yet another) very well-meaning open-source backup project (Kopia) that thinks it can get data duplication 'for free' just because it is using cryptographic hashes.
^ If a particular hash hasn't been seen yet at all', then no additional checks are needed. Every subsequent time, though, these check(s) are vital.
^ (e.g. ideally a full binary comparison, but at least another type of checksum / hash might also be used; storing and checking the file / message size is also a good practice to consider)
^ Git gets away with this because most computer source code is written in fairly low-entropy text files, which reduces the change of a collision greatly.

- Jim Grisham (talk) 05:53, 8 September 2023 (UTC)[reply]

External links modified[edit]

Hello fellow Wikipedians,

I have just modified one external link on Content-addressable storage. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20071012085111/http://www.opensolaris.org/os/project/honeycomb/ to http://www.opensolaris.org/os/project/honeycomb/

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 05:39, 25 May 2017 (UTC)[reply]

[1] (warnings are likely needed in other articles as well)

[2] Just today I discovered (yet another) very well-meaning open-source backup project (Kopia) that thinks it can get data duplication 'for free' just because it is using cryptographic hashes.

[3] If a particular hash hasn't been seen yet at all', then no additional checks are needed. Every subsequent time, though, these check(s) are vital.

[4] (e.g. ideally a full binary comparison, but at least another type of checksum / hash might also be used; storing and checking the file / message size is also a good practice to consider)

[5] Git gets away with this because most computer source code is written in fairly low-entropy text files, which reduces the change of a collision greatly.

[a]

[b]

[c]

[d]

[e]