Data proliferation

From Wikipedia, the free encyclopedia

Data proliferation refers to the prodigious amount of data, structured and unstructured, that businesses and governments continue to generate at an unprecedented rate and the usability problems that result from attempting to store and manage that data. While originally pertaining to problems associated with paper documentation, data proliferation has become a major problem in primary and secondary data storage on computers.

While digital storage has become cheaper, the associated costs, from raw power to maintenance and from metadata to search engines, have not kept up with the proliferation of data. Although the power required to maintain a unit of data has fallen, the cost of facilities which house the digital storage has tended to rise.[1]

At the simplest level, company e-mail systems spawn large amounts of data. Business e-mail – some of it important to the enterprise, some much less so – is estimated to be growing at a rate of 25-30% annually. And whether it’s relevant or not, the load on the system is being magnified by practices such as multiple addressing and the attaching of large text, audio and even video files.

— IBM Global Technology Services[2]

Data proliferation has been documented as a problem for the U.S. military since August 1971, in particular regarding the excessive documentation submitted during the acquisition of major weapon systems.[3] Efforts to mitigate data proliferation and the problems associated with it are ongoing.[4]

Problems caused[edit]

The problem of data proliferation is affecting all areas of commerce as a result of the availability of relatively inexpensive data storage devices. This has made it very easy to dump data into secondary storage immediately after its window of usability has passed. This masks problem that could gravely affect the profitability of businesses and the efficient functioning of health services, police and security forces, local and national governments, and many other types of organizations.[2] Data proliferation is problematic for several reasons:

  • Difficulty when trying to find and retrieve information. At Xerox, on average it takes employees more than one hour per week to find hard-copy documents, costing $2,152 a year to manage and store them. For businesses with more than 10 employees, this increases to almost two hours per week at $5,760 per year.[5] In large networks of primary and secondary data storage, problems finding electronic data are analogous to problems finding hard copy data.
  • Data loss and legal liability when data is disorganized, not properly replicated, or cannot be found promptly. In April 2005, the Ameritrade Holding Corporation told 200,000 current and past customers that a tape containing confidential information had been lost or destroyed in transit. In May of the same year, Time Warner Incorporated reported that 40 tapes containing personal data on 600,000 current and former employees had been lost en route to a storage facility. In March 2005, a Florida judge hearing a $2.7 billion lawsuit against Morgan Stanley issued an "adverse inference order" against the company for "willful and gross abuse of its discovery obligations." The judge cited Morgan Stanley for repeatedly finding misplaced tapes of e-mail messages long after the company had claimed that it had turned over all such tapes to the court.[6]
  • Increased manpower requirements to manage increasingly chaotic data storage resources.
  • Slower networks and application performance due to excess traffic as users search and search again for the material they need.[2]
  • High cost in terms of the energy resources required to operate storage hardware. A 100 terabyte system will cost up to $35,040 a year to run—not counting cooling costs.[7]

Proposed solutions[edit]

  • Applications that better utilize modern technology
  • Reductions in duplicate data (especially as caused by data movement)
  • Improvement of metadata structures
  • Improvement of file and storage transfer structures
  • User education and discipline[3]
  • The implementation of Information Lifecycle Management solutions to eliminate low-value information as early as possible before putting the rest into actively managed long-term storage in which it can be quickly and cheaply accessed.[2]

See also[edit]

References[edit]

  1. ^ "Downsizing the digital attic". Deloitte Technology Predictions. Archived from the original on July 22, 2011.
  2. ^ a b c d "The Toxic Terabyte", IBM Global Technology Services, July 2006
  3. ^ a b "Evolution of the Data Proliferation Problem within Major Air Force Acquisition Programs". Archived from the original on 2007-10-09. Retrieved 2007-10-09.
  4. ^ Data Proliferation: Stop That
  5. ^ “Dealing with data proliferation”; Vawn Himmelsbach. it business.ca: Canadian Technology News, September 19, 2006
  6. ^ “Data: Lost, Stolen or Strayed”, Computer World, Security
  7. ^ "Power and storage: the hidden cost of ownership”, Computer Technology Review, October 2003