Talk:Data deduplication

Computing High‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
High	This article has been rated as High-importance on the project's importance scale.

meaning[edit]

I think Data De-duplication or tripple D means the removal of redundant and duplicate data at subfile level. The email example is Single Instance storage (SIS). Some vendors often use the glass of water examples for Deduplication. Compression is also a technology to remove redundant data for archive purpose. Even with single instance storage, to backup a glass of water, another glass will be needed to copy the water atom by atom. With 3D, to backup a glass of water, a map of the atom is with a single atom (H2O) is created, thus reduce the capacity requirement to 1/500 or 1/1000 of orginal data.

just for the record I think it would be a really bad habit to get into, to use "3D" as the abbreviation for "data deduplication", for obvious reasons. would love to see more info on this subject, specifically, an explanation differentiating this from deduplication at the file level, and a link to what that is called. "file organizing" etc. far too generic to find any useful results. cheers. —Preceding unsigned comment added by 68.41.20.99 (talk) 16:15, 3 March 2010 (UTC)[reply]

OK, so WHERE is this implemented ???[edit]

The article doesn't list a single practical implemention. Where is this implemented? In which file system for example? --boarders paradise (talk) 00:06, 6 July 2011 (UTC)[reply]

When I modified the article (think I was using the name StorageMogul) a year ago I also added a bunch of references to different implementations to cite real world examples of different techniques in use by various vendors. I think only one remains. Dont have the energy to get into a wikifight to put it back. Fabkins (talk) 13:25, 7 July 2011 (UTC)[reply]

I restored the section Major commercial players and technology. It was removed by the revision

08:31, 28 November 2010‎ Marokwitz (talk | contribs)‎ m . . (18,067 bytes) (-4,629)‎ . . (Non notable / adverting, no citations) (undo)

I don't see it as useful in its current status, but it's not adverting nor there's a reason to remove such information. Medende (talk) 19:44, 12 July 2012 (UTC)[reply]

some are free, so I remove commercial — Preceding unsigned comment added by 24.226.192.10 (talk) 23:54, 4 October 2012 (UTC)[reply]

SIS is implemented in Windows Server OS'. It is mainly used as part of things like WDS where multiple images are stored on a server for reimaging/staging workstations. You should do a little more research before challenging something.

Merge proposal[edit]

A recent contribution by Skeptiks (talk · contribs) indicates that Data deduplication and Single-instance storage are "somewhat synonymous". Perhaps a merge is in order. -—Kvng 15:41, 22 November 2012 (UTC)[reply]

Speaking as the editor of the SNIA Dictionary, I can understand the motivation for this, but disagree. SIS systems do not necessarily do what is usually thought of as dedup. Alanyoder (talk) 01:57, 21 December 2012 (UTC)[reply]

I agree with Alanyoder. The distinction is subtle from the user's point-of-view but quite meaningful from the developer's perspective. ThomasMikael (talk) 22:47, 19 February 2013 (UTC)[reply]

I also agree with Alanyoder. SIS and Data-Deduplication are related but totally different. It is perfectly possible to have a system that does SIS, but does not do active data-deduplication. A good example is an e-mail system that only stores one copy of an e-mail (SIS), but if you get two different e-mails with the exact same attachment, it is stored twice, because the system is not doing any active data-duplication to create a single pointer to one location. Joel2600 (talk) 11:49, 6 June 2013 (UTC)[reply]

Merge same topic Widefox; talk 13:01, 4 April 2013 (UTC)[reply]

I also agree that they are related, but they are not the same. This should be preserved as a seperate topic. — Preceding unsigned comment added by 12.16.33.89 (talk) 13:08, 14 August 2013 (UTC)[reply]

Rabin?[edit]

How can an article on data deduplication not contain a link to Rabin's fingerprinting algorithm? — Preceding unsigned comment added by 70.181.173.18 (talk) 11:38, 28 February 2013 (UTC)[reply]

Wikipedia needs relevant information[edit]

I agree with several of the discussions here.

A) There needs to be a cursory listing of vendors with products that use deduplication or single instance storage. B) Per this discussion, single instance storage is one method of deduplicating data, and hence should have cross references at the very least C) Last but not least, Rabin is indeed the father of deduplication and should be referenced, well and often IMHO.

Additional Notes: I spent a significant amount of time adding content and creating a list all significant vendors who have deduplication products. Meaning those who roughly produce products that have more than 10 Million in sales per year.

I am a recognized expert in this field, since I am an industry analyst and also formerly a degreed engineer by training with significant IT experience. I have written several papers on the topic, I create primary research on the topic and maintain a listing and analysis of about 20 of the top deduplication products in the industry. Additionally, I have tested several of these products. I believe that all of these facts should provide some basis for me to add relevant content and not have some 17 year old hacker remove my content because it differs with their interpretation of the goals of wikipedia.

Yet somehow, a wiki-nazi^[1] decided that he didn't like any actual examples or names of companies, stating that it was against wikipedia policy. Interestingly, I see hundreds of wiki entries that do in fact list examples. Since I am not promoting any company, nor do I receive any direct or indirect benefit from doing so, you would think that would provide a neutral basis to add entries. Apparently not…

In fact I would propose a new wiki entry on wiki-nazi's and could give some names and examples from this very article.

Comments? Rfellows (talk) 17:17, 17 December 2013 (UTC)[reply]

I recently became interested in the topic of data deduplication and so went to Wikipedia. What I found leads me to agree with you.

Right now the article is of hardly any use at all. It mentions several different approaches (including 'Single Instance Storage') all under the one and same article name. This is absolutely unhelpful to clarify the topic.

There is also no overview of actual practical implementations since someone "helpfully" removed all mentions, the last revision containing that info is https://en.wikipedia.org/w/index.php?title=Data_deduplication&oldid=608804064 and was changed with the comment of "removing excessive references" which somehow came to mean all. This makes article appear as if it is something not actually, or at least hardly, used since there are NO useful references to implementations. --89.14.74.47 (talk) 13:03, 18 November 2015 (UTC)[reply]

References

^ "[1]" UB, 2013. Retrieved 2013-12-17

Hardlinking / "copy only if changed" needs to be in this article[edit]

This article seems to be suggesting that hardlinks are not a form of deduplication. As if finding duplicate filesystem blocks is the only and exclusive possible definition of deduplication. This to me suggests bias by the authors of this article, to promote or maybe advertise for their specific type or brand of this.

Hardlinking is definitely a form of deduplication, and it has existed in unix/linux filesystems going back at least 30 years now.

cp -al combined with rsync --delete-before copy, is a very effective form of deduplication.

If anything, sector/block-level deduplication requires massive CPU power dedicated to the task, to constantly rescan all existing data to look for patterns that match new incoming data.

Hardlinking and rsync doesn't require that. It's very specific about what to check to find existing duplicate data, namely the previous versions of the same data files to be backed up again.

-- DMahalko (talk) 23:30, 18 October 2014 (UTC)[reply]

This article is a good start, but it seems to have been edited at least once by a person who doesn't understand deduplication very well. Deduplication as it is practiced today has almost nothing to do with single instance storage, that is, storing one copy of a file with subsequent copies consisting of pointers to the original file. While that is an early method of reducing duplicate file storage, it is not what people are calling deduplication today.

The repeated reference to this file-replacement method throughout the article is confusing, at the very least. Obviously so, since someone here in the talk section has suggested combining the deduplication article with the SIS article. that would be like combining arithmetic with calculus.

Anyone investigating deduplication with the thought of applying the information today, would be looking for block-based, cryptographic-hash-mediated data reduction technology.

The fact that the article does not mention common and and even ubiquitous implementations of this technology such as Data Domain. EMC Avamar, NetApp, Quantum, Exegrid, GreenBytes, Microsoft (which is free), ZFS (which is available in numerous free and fee versions, such as Solaris, Open Solaris, FreeNAS, ZFSLinux, Nexenta and others) will also lead to confusion on the part of readers. Do the Wikipedia editors know what they are talking about? Are they talking about the same kind of technology that I am considering, or something else?

I am not a deduplication vendor, but I work with it quit a bit because many of my customers use it.

I've never edited a wikipedia article before. My one question is, if the troll comes back who keeps editing out useful links for people trying to learn about deduplication, can we report this person for the behavior and get them banned from hijacking the article over and over? I guess the other thing that could be done is to post corrected information somewhere else and get out the word that the wikipedia article is close to worthless.Bradjensen3 (talk) 20:40, 16 November 2014 (UTC)bradjensen3[reply]

@Bradjensen3: Sounds like a great opportunity for you to improve this article. Brycehughes (talk) 07:51, 17 November 2014 (UTC)[reply]

Error rate calculation is unsupportable.[edit]

From the article, "The hash functions used include standards such as SHA-1, SHA-256 and others. These provide a far lower probability of data loss than the risk of an undetected and uncorrected hardware error in most cases and can be in the order of 10−49% per petabyte (1,000 terabyte) of data.[8]"

The footnote #8 reference is no longer valid. The cited website is apparently for sale.

The calculation of data loss probability is directly related to the probability of a hash collision. While the average probability can be calculated if you assume of the output of the SHA-1 hash is uniformly distributed across the all of the possible (2^160) outputs there is to my knowledge no analysis showing such a distribution. The output distribution of SHA-1 is in fact not well known. The birthday paradox says that the probability of 2 random outputs colliding is 1 in 2^80.

However, cryptographic research has shown that can be reduced to 1 in 2^61.

But if the output distribution is highly skewed then it may be arbitrarily smaller. A collision would cause a storage device using inline deduplication based on SHA-1 to experience data corruption. The likelihood of data corruption corresponds to that of collisions and that is related to the skewness of the of the SHA-1 output distribution. Since the skewness is unknown the likelihood can not be calculated or estimated. The article's claim of 10^-49% per petabyte can not be verified by calculation or simulation. It should be removed.

External links modified[edit]

Hello fellow Wikipedians,

I have just modified 2 external links on Data deduplication. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 07:56, 7 December 2016 (UTC)[reply]

External links modified[edit]

Hello fellow Wikipedians,

I have just modified 5 external links on Data deduplication. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Corrected formatting/usage for http://www.backupcentral.com/content/view/134/47/
Corrected formatting/usage for http://www.microsoft.com/windowsserver2008/en/us/WSS08/SIS.aspx
Corrected formatting/usage for http://www.infostor.com/webcast/display_webcast.cfm?id=540
Corrected formatting/usage for http://www.snia.org/forums/dmf/knowledge/white_papers_and_reports/Understanding_Data_Deduplication_Ratios-20080718.pdf
Added {{dead link}} tag to http://public.dhe.ibm.com/common/ssi/ecm/en/tsu12345usen/TSU12345USEN.PDF
Added archive https://web.archive.org/web/20100911194757/http://www.itnext.in/content/doing-more-less.html to http://www.itnext.in/content/doing-more-less.html

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 06:15, 5 September 2017 (UTC)[reply]

List of file-systems[edit]

I believe it would be relevant to include a list of file-systems with data deduplication.

NatoBoram (talk) 20:04, 26 August 2018 (UTC)[reply]

Implementations[edit]

I'm not prepared right now to edit the section, but it seems to me that the most common place to find deduplication these days is in backup and archiving software. It might be helpful if the Implementations section mentioned that and maybe also gave a few examples. 2601:280:5D00:4690:342F:E2FB:9AB5:1A6E (talk) 14:40, 22 March 2023 (UTC)[reply]

[1] "[1]" UB, 2013. Retrieved 2013-12-17

[1]