Wikipedia talk:Database download/Archive 1

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Wiki2static ... or?

What is currently a best way to create static version of wikipedia. Does wiki2static have some problems? I would like to put national wikipedia on a CD. Thanks. -Juraj

Wiki2static down - Mirror?

It seems the original page of Wiki2static (http://www.tommasoconforti.com/wiki/) is down. All i get is a ad by some ISP - nothing more. It would be nice if either User:Alfio would put the page back up, or anyone who still has that file would put up a mirror. If its just a problem with the hoster i can provide webspace with a good connection.

- Dario


Wikipedia as XML? / Download one article at time

I want to write a soft that displays information from the wikipedia. I don't want to make a huge database download, but just want the him to have acess to an updated tiny piece of the wikipedia article (in wiki source), more like a browser.How can I? --Alexandre Van de Sande 01:09, 2 Aug 2004 (UTC)

Wouldn't it be nice to have a download of Wikipedia based on an XML language? ~cpb 2004-04-26

Maybe Special:Export would help? Angela. 16:51, Aug 12, 2004 (UTC)

To access any article in xml, one at a time, link:

-- Blinklmc

Experimental Mirror of Wikipedia

I've been attempting to setup a copy of wikipedia on one of my servers for experimenting and testing. I want to use the real data to experiment with the wikipedia code and be able to more closely examine the data structure. I have in mind potentially altering the code to use in another project that is something of a "People Data Store" Example: "Quotes" are made by people, people have biography that relates them to other people, places, and events in time. This could also apply to many other works of people such as "Lyrics", "Books", "Articles", "Film", "Programming code". Lots of possibilities. In many ways it it much like an encyclopedia, just more (for lack of a better way of expressing it) factual and concrete. 8-)

My problem, For a couple of days now I've been attempting to download the datadump of the encyclopedia and history from the download page. Unfortunately all I get instead of a gzip, tar, or zip file is the text data dump to my browser. Is there some way that I can get the current data and history files some other way? I don't care about the size, but a file is much more useful than a text data list of many mb. Also is there some method of data replication that is used to keep other copies current? Any help with this would be much appreciated, Thanks (albrown AT chook DOT com or al AT thetinfoilhat DOT com)

Your browser appears to be helpfully un-gzipping the data for you. If this is a problem (ie, you don't want to take up that much hard disk space just for the dump), try a less intelligent program. ;) "wget" is a nice command-line web/ftp file fetcher; I think there's a version compiled for windows. (Google it.) Keep in mind that the SQL dump will be equally effective zipped or unzipped; you have to read it back into the database or write your own program to suck the data out of the SQL commands. --Brion 19:01 Oct 2, 2002 (UTC)

There are times when I actively hate the latest IE. Is ftp an option then? I tried to connect to ftp.wikipedia.com and didn't get very far as "anon".

Get Mozilla!
Try this instead: make a link to the file you want to download by putting it in brackets, e.g. The Internet Movie Database ([http://www.imdb.com The Internet Movie Database]), then right click on the link and "save target as."
IE for me has been particularly contrary and addlebrained; I can only assume that that's what you're using too. Best, --KQ 22:20 Oct 2, 2002 (UTC)

I was able to get the files by using wget. Thanks for all the help. I wil remember the trick about making a link in brackets. This is odd though as this is the first and only time that I have ever had IE (and netscape 4.7 even tried to download with that brain dead clunky ......) both were unzipping the file into the page, normally I can click on any download link and then get a message asking me if I want to save or not. Oh well, got the files and thanks much again for the help. Al Brown 23:21 CST Oct 2, 2002.

Daily tarballs of older Non-English Wikipedias

These have not yet been upgraded and are running on UseMod-wiki. The software and data are included together in a single tarball.

<-- this is broken, I think The German, Polish, and Esperanto wiki trees are also --> <-- available for live update via rsync. -->

I removed the above section as it is no longer true and the links are dead. Angela. 14:49, Feb 21, 2004 (UTC)

wiki-table to html-table problems in script wiki2static

Hi, is there an update of wiki2static script which converts the sql-dump to a html file structure? This script has problems with conversion of the wiki table definitions to html. For this reason, articles containing tables aren't readable.

Hi, I haven't worked on the script for a while, so I didn't put the new table syntax into wiki2static. I'll see if I can do it in the following days. Alfio 14:58, 16 Mar 2004 (UTC)

Compressed text?

I have the Spanish version in a SQL database and the old_text entries are a bunch of symbolic nonsense. For instance, the entry for Andorra is "UAN1E÷•¸ƒ×¨*g@€`›Š]Õ…;q[£ÄŽìÑ#³D\O;@Ê"þqÞÿÿ¾J#(}8*4¦RªîŒàÈž1á2BQiø¥Žp+IÍBun=ž²:T6R_ÂaPq:tN ]Dî2A5õJÆ $òD#z\”à:7¢HbÌ€Ðß?=3´nìEWðŒù!a.ZGCŒåÑ /U¸>œËE¢XðÔÃçâºÇÜÎw.µ“7õÕÕbçz³¹yâŒq»„GÒ€íÿ)þtÇd¢³ð0e›…;b<àÀ*ä³ü½ªýa½pæ†6ÏkÊÓ/ ¨)oQþBø"

I don't know if this is related, but when I loaded up the .sql file I got the following error: "ERROR 1064 at line 269: You have an error in your SQL syntax..."

It's compressed text, I would guess. See the section "Format Change" at http://download.wikimedia.org/ Mr. Jones 12:55, 9 Aug 2004 (UTC)

Problems with Python

Hmmm. I'm having some trouble with this using python (specifically, ipython).

import MySQLdb as m
conn=m.connect("localhost", "user", "password")
conn.select_db("wikipedia")
conn.query("SELECT * FROM old WHERE old_user=333")
r=conn.store_result()
c=[0]
while (c[len(c) - 1] != ()):
    c.append(r.fetch_row())

z=c[len(c) - 2][0][3]

import gzip
gzip.zlib.decompress(z)
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)

/home/mrjones/<console>

error: Error -3 while decompressing data: incorrect header check

f=file("temp.gz", "w")
f.write(z)
f.close()
del f, z, c, r, conn, m
^D
Do you really want to exit ([y]/n)? y
host:~$gunzip temp.gz

gunzip: temp.gz: not in gzip format
host:~$

Mr. Jones 14:24, 16 Dec 2004 (UTC)

Ah, this seems relevant: Old entries marked with old_flags="gzip" have their old_text compressed with zlib's deflate algorithm, with no header bytes. PHP's gzinflate() will accept this text plainly; in Perl etc set the window size to -MAX_WSIZE to disable the header bytes.

However, it's still not working:

gzip.zlib.decompress(z, gzip.zlib.MAX_WBITS)
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)

/home/mrjones/<console>

error: Error -3 while decompressing data: incorrect header check

Mr. Jones 14:30, 16 Dec 2004 (UTC)

meta:Compression Looks relevant. Mr. Jones 14:44, 16 Dec 2004 (UTC) meta:Old_table Gave me the clue I needed. -MAX_WSIZE is not an option, it means (0 - MAX_WSIZE) . Now working :-D Will come back and clarify docs a bit later. Mr. Jones 14:49, 16 Dec 2004 (UTC)


Lost Connection Error using MySQL

I tried importing the en.sql db into MySQL 4.1.2-alpha and got this error:

$ mysql wiki -u root -p < en.sql Enter password: ******* ERROR 2013 at line 13147: Lost connection to MySQL server during query

Ran it again, got the same error after about 20 GB had been processed.

--

If I remember correctly, these errors are because the data that you are importing is sent over the connection in packets. Typically, in the server, you configure a limit to the size of these packets to avoid denial of service conditions. However, by doing that, you are also limiting the amound of data that can be put in one field, because each field has to be sent in one packet. It could be that one wikipage contains so much data that the server thinks you are trying to cause a denial of service, therefore it kicks you off.

The same thing happens when you are running mozilla's bugzilla; the packet size limits the size of attachments that users can insert into the database.

Tinus 22:19, 10 Aug 2004 (UTC)

meta?

Is this information on Meta somewhere? It looks as though this page was once at m:Meta:Database download (back when the Meta: namespace there was called Wikipedia: also)... but it's gone now. +sj+ 05:38, 16 Jul 2004 (UTC)

I don't think it was on Meta. Links on Meta to Wikipedia:Database download will lead to this page via a redirect from Database download. Angela. 07:20, 16 Jul 2004 (UTC)

Sample blocked crawler email

for some odd reason the arrow on the link http://en.wikipedia.org/wiki/Wikipedia:Database_download in the page is not showing properly. i think it's something to do with the <i> and the wiki stylesheet, or maybe some extra bug in internet explorer... Vbs 09:03, 23 Jul 2004 (UTC)

Joining SQL Download Files

What do I use to join the split SQL dump files?

See the Joining SQL Dump Files thread on wikitech-l. Angela. 20:23, Aug 2, 2004 (UTC)

Titles only download

Is there a possibility to download all article titles, as single compressed file? Pjacobi 17:53, 11 Aug 2004 (UTC)

The article titles of the English Wikipedia are available. Download allentitlesinns0.gz from download.wikimedia.org/archives/en/. It's possible other languages will be added in future. Angela. 00:28, Oct 5, 2004 (UTC)
This is now moved to all_titles_in_ns0.gz which is linked to from download.wikimedia.org/wikipedia/en/. Angela. 04:14, May 2, 2005 (UTC)

How big is the uncompressed wikipedia?

How big is the uncompressed 20040811_old_table.sql.bz2 from en? Less than 30gb, I would guess, given that it was reported to be 18gb fairly recently, and starts out at about 9gb. What is the compression ratio? It seems that the suggested bzip2.exe for windows (used with XP pro) does not work (it's much slower than unxutils' one, which does the same thing, i.e. produces a file of over 30gb without stopping, it just does it faster). I'll have to see if I can do it under Debian. Mr. Jones 21:56, 13 Aug 2004 (UTC)

OK, so the 18Gb is the size of all database files when compressed. I'll clarify the text. Still, the question remains. Mr. Jones 04:33, 14 Aug 2004 (UTC)

The answer is, for the record, that the size of the decompressed old table for the en database of 8-8-2004 is about 40Gb. Mr. Jones 14:13, 14 Aug 2004 (UTC)

Database Dump Compression Format

Is there anywhere I can download the dump files that have been compressed with something other than bz2!? maybe gzip or zip?

Why would you want to do that? (Please sign your posts with ~~~~) Mr. Jones 20:48, 16 Dec 2004 (UTC)

Using Special Export

Is there anyway to use http://en.wikipedia.org/wiki/Special:Export/ to return a nearest match or wikipedia's search results page?

Current size of database?

How much diskspace is currently needed for the en.wikipedia.org DB when it's imported? Currently the SQL file is 52 GB, but when importing the Inno DB database ends up exceeding 70GB. Are there also any suggested MySQL settings for handling a DB this large?

  • The .sql file you download contains only the information contained in wikipedia. When you import this information into a MySQL InnoDB database a number of bits of extra information are calculated, normally to help the database keep track of the data and access it quickly. This extra data (indexes, page-alignments, padding) accounts for the difference you see.
  • MySQL can handle databases of wikipedias size (which are, in database terms, quite modest) with the default settings. If running complex or repetative queries, you may want to adjust the innodb_buffer_pool_size variable in your my.ini file to abour 2/3rds of the physical memory on your PC - for example innodb_buffer_pool_size=640M on a PC with 1GB of RAM.
  • - TB 15:48, 2005 Feb 8 (UTC)

Wikinews database?

If someone gets to this, please include the different language wikinews databases in the DB dumps. -- Ilya 07:51, 23 Dec 2004 (UTC)

bunzip2 in WinXP

Using bunzip for WindowsXP I am unable to unzip the current DB for en. I have had the problem for around three months and was wondering if anyone else has the same issue, or knows a solution. I've tried other programs which handle bunzip files such as WinRAR and I get the same error. I also get an incorrect MD5 sum. The correct one should be "7a70559f2089155f441c322f6c565cc5" and mine is "1d423915d294592237f4450ded3b386b"  :


C:\Documents and Settings\*\Desktop>bunzip2 2004*

bunzip2: Caught a SIGSEGV or SIGBUS whilst decompressing, which probably indicates that the compressed data is corrupted. Input file = 20041023_cur_table.sql.bz2, output file = 20041023_cur_table.sql

It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to *attempt* to recover data from undamaged sections of corrupted files.

bunzip2: Deleting output file 20041023_cur_table.sql, if it exists.


Thanks. - Alterego @ 1:29AM on 10-29-04

Which version of bunzip2.exe are you using? Some versions (e.g. that at http://unxutils.sf.net last time I looked) can't handle files > 2GB. Mr. Jones 13:25, 5 Dec 2004 (UTC)
I am using "Version 0.1p12 29 Aug 1997". Admittantly a bit old lol. Do you know a better one? --Alterego 9:52 12/5/04
Well, that's not going to work, is it? :-) The latest version is 1.02. Try the link on the article page (press Ctrl+F and type "bzip2") Let us know how you get on. Mr. Jones 11:42, 8 Dec 2004 (UTC)

Dumps disabled?

Is there some reason that the weekly dumps have not been updated in three weeks? Michael L. Kaufman 16:06, Jan 26, 2005 (UTC)

They are doing a database conversion I believe. It has been going on for quite some time. --Alterego 17:44, Jan 26, 2005 (UTC)
There are new dumps up now from 2005-02-09, but unfortunately they seem to exclude "de" and "en" (and possibly others that I didn't notice). So if you're looking for those languages, it looks like you're still out of luck. -- All the best, Nickj (t) 01:04, 11 Feb 2005 (UTC)
OK, I'm confused ... if you look here, there is a current en dump. It's not listed on the download page though. Does that mean that they should be listed and it's just a mistake that it's not, or does it mean that there's something wrong with that dump and that it has been excluded for a reason? The same thing applies to de. -- All the best, Nickj (t) 01:14, 11 Feb 2005 (UTC)


Are there plans to update downloads.wikimedia.org in 2005?

Hi, I've been wondering why wiki DB dumps were updated only at start of 2005 and for 3 weeks it was not updated? Is it no longer supported? I need just a single (lithuanian) cur table dump for some statistical analysis of lithuanian articles, so it would be nice to know if and when there are plans to update? Thanks. Knutux 10:28, 2005 Feb 2 (UTC)

Is there any way to get an older Dump?

I need a copy of any dump before the current one 1/7/05? Preferable the one from just before it, but I will take whatever I can get. Any help would be appreciated. Thanks, Michael L. Kaufman 05:15, Feb 4, 2005 (UTC)

Have you tried rsync ([rsync://download2.wikimedia.org/dumps/]). I guess it may be possible to retrieve older revissions using it. Knutux 07:58, 2005 Feb 4 (UTC)
I'm not sure about that. The links with dates are actually just symlinks(sp) to download.wikimedia.org/archives<insert wiki>/<insert lang>/cur_table.sql.bz2 or the appropriate filenames. --Alterego 17:03, Feb 4, 2005 (UTC)

Problems importing the compressed data

I'm trying to import the OLD data of the hebrew wikipedia (he) but it seems that because of the compression there are some special chars in the text (such as - ' etc. ) that confuse the sql - which returen tons of errors. it seems to me that someone forgot formating the text (adding / before special chars and stuff) while exporting. (I've been importing the data few month ago on the same system with no problem, and the CUR data imports without any problem).

  • does anyone else expirienced the same problem?
  • can anyone check if it's not just my local problem?
  • any idia about how to solve it?

thnx, Costello 19:44, 24 Feb 2005 (UTC).

  I had no problems doing just that. You can contact me directly if you require assistance. --LevMuchnik 16:52, May 20, 2005 (UTC)


imagelinks table

Any chance of adding the imagelinks table to the dump script, too? It's a quite small table, and since links, categorylinks, brokenlinks etc. are getting dumped already, imagelinks is the only one still missing from a complete dump set.

Yes, I know that refreshlinks.php can be used to generate imagelinks - but it takes an *extemely* long time to run and also tends to bomb out on my installation eventually.

I've also tried to rework the obsolete rebuildlinks.php script for imagelinks only, but couldn't get it to work for images included by templates etc. and finally had to give up on this. I need imagelinks to decide which images to include in a notebook installation - more precisely, to drop (or not to mirror) orphaned images which are not referenced by a NS_MAIN or NS_TALK article (a considerable percentage BTW). --DerHund 14:14, 27 Feb 2005 (UTC)

I've added this to the script, so it'll be in the next public dump. I'm expecting to start that within the next few hours. Any other table requests from anyone? Jamesday 13:21, 8 Mar 2005 (UTC)
Dump started. We may have to abort it or part of it if it kills the site excessively - still working on making it less disruptive. It's sharing the image server which is experiencing very high disk load. Expect it to take 12 or more hours to run. Jamesday 07:56, 9 Mar 2005 (UTC)
Suspended while processing de wikipedia because it was hurting the site too much. Will resume after peak time today. Jamesday 14:04, 9 Mar 2005 (UTC)

Thanks much for including the imagelinks table in the DB dump, you've been very helpful. --DerHund 22:32, 15 Mar 2005 (UTC)

Image tarball dumps

Production of new image tarball dumps has been temporarily suspended while we work on preventing the production of them from taking the whole site down. Will probably be back again within a few weeks. Jamesday 13:21, 8 Mar 2005 (UTC)

Obviously the use of "a few weeks" is a bit liberal here. :-) 74.166.95.223 01:46, 9 October 2007 (UTC)

No SQL Query

Whenever I try to import the sql dumps using phpmyadmin, it says No SQL Query

AxyJo 23:59, 17 Mar 2005 (UTC)

I may be wrong, havong only used phpmyadmin a few times, but I don't believe you will be successful importing a large dump through it. --Alterego

How would I then import the dumps? 70.49.148.112 21:03, 19 Mar 2005 (UTC)

PhpmyAdmin

How can I import large dumps without phpmyadmin? 69.156.100.44 03:19, 20 Mar 2005 (UTC)

http://dev.mysql.com/doc/mysql/en/mysql.html --Brion 03:27, Mar 20, 2005 (UTC)


Actual size available for download differs from reported

Across 200 wikis

cur_table.sql.bz2:  1531747056 (exactly equal to reported)
       upload.tar: 42964259653
old_table.sql.bz2: 20287081657
                   ___________
total            : 64783088366 bytes

                         61782 actual megabytes
                         50503 reported megabytes
                         _____
                         11279 difference in megabytes

--Alterego 21:32, Mar 20, 2005 (UTC)

The archive link (http://download.wikimedia.org/archives/en/) does not work from http://en.wikipedia.org/wiki/Wikipedia:Database_download. Is it a temporary problem? I'd like to access to the previous (March, 2005) MySQL dump file.

Archives were moved but the link was not updated. Fixed. It's http://download.wikimedia.org/wikipedia/en/ now. JRM · Talk 02:45, 2005 May 6 (UTC)

Compression Format Change and Size Issues

The concatenation of the dump files (english version of Wikipedia) has ended up with a file of around 32 gigs. Apparently, the compression format has changed for bzip2 does not recognize the resulting file as a bz2 one but gunzip is able to uncompress the file (by naming the compressed file old_table.sql.gz). Can anyone officially confirm the change in the compression format? Moreover, the uncompressed file has a size of only 34,201,462 KB which is not much bigger than the compressed file. Is that normal? Nonetheless, the resulting sql files seems to be readable for it is possible to import the 'old table' from it. But I don't know whether the file is complete or not, and whether the old table that I got, will not miss any record.

Does anyone have similar problems?

--Kevouze 14:44, Apr 25, 2005 (UTC)

As of time of this post, the dump files now have the extension .sql.gz. Does that mean that gzip should now be used to decompress them, and that the section on bzip2 is irrelevant? Tsointsoin 00:47, 30 July 2005 (UTC)


Database dumps old, image dumps gone

The database dump hasn't been updated in a month now (it says it's done twice a week on the meta-wiki).

Image dumps have been broken for 2 weeks' time now. Image dumps use some strange compression that apparently can only be uncompressed using the right version of the right set of programs on the right platform (in other words, anything but standard platforms).

Is there any effort to standardize these practices, or is it always going to be "Whenever someone gets around to it or feels like doing it?" Is there anything Joe User can do to help out??

-- Q2 03:52, 16 Jun 2005 (UTC)

Downloading Large Files on Linux

I'm running stock Redhat 9 and can't seem to get the 16 gig image dump to download. I've tried both wget and curl but they both appear to have a 2gig file limit.

Has anyone had any success getting these to download on linux? If so what did you use?

Thanks.

TheLoneCoder 02:30, 1 August 2005 (UTC)

Nevermind. I figured it out by using lynx -dump url | tar xv


Error "Duplicate entry '8-VfD-Q+ç?' for key 2" on import

I've downloaded and ungzipped the latest available dump at the time of this posting (20050623_cur_table.sql). Upon importing it using "mysql -p -u root wikipedia < 20050623_cur_table.sql", I got an error and the import was not completed correctly: ERROR 1062 (23000) at line 1488: Duplicate entry '8-VfD-Q+ç?' for key 2 According to "select count(*) from cur;", I only got 606,328 entries in the table. I was able to work around the problem by editing the text file 20050623_cur_table.sql by hand, and removing the "UNIQUE" in front of the key 'name_title_dup_prevention' (for info HexToolbox is the only text editor I found capable of editing such a large file). The import then gave me 1,811,554 entries. Am I the only one getting this error? Is there any better solution than this workaround? I am on Windows XP using MySQL 14.9 Distrib 5.0.3-beta Tsointsoin 17:09, 2 August 2005 (UTC)


XML import

There was a new dump of the en.wikipedia cur table this weekend, and I'm itching to get my hands on it. Unfortunately for me and my software, the dump is in the new XML format rather than an SQL query. Is there a tool, perhaps, for importing the XML dump into mysql? — brighterorange (talk) 01:50, 7 September 2005 (UTC)


Wikipedia in DICT format

Moved from Wikipedia:Village pump on Thursday, July 10th, 02003.

I wonder if someone thought about making dict files of the Wikipedia. It would be cool to have the Wikipedia wherever I am, independent of an internet connection. (Okay, I still need my laptop for this...) dict seems a good way to achieve this. I'm willing to spend some time hacking a Python script that can create the dict files from the SQL stuff. But I'd like to know if other people are interested in this as well, or maybe there's someone who already did this job... :) --Guaka 22:38 5 Jul 2003 (UTC)

Doesn't Tombraider achieve this? CGS 22:40 5 Jul 2003 (UTC).

Dear Wikipedians! I enjoy very much my tomeraider wikipedia edition from december 2003. And I dream of downloading a current version. At that time it was 180 mb with 180000 articles. Now there are 360000. Please !!!! Thousands of PDA friens will be grateful to you ! The german wikipedia for tomeraider is available for download from 1 of september 2004 with 217 mb and 180000 articles. Vlad

You mean Tomeraider? No... First of all, tomeraider is shareware. And AFAICS it is totally not meant to convert the wikipedia into the dict format. Guaka 02:37 6 Jul 2003 (UTC)
Ha ha :) I know it's not meant to convert files to dict format, but it does what you want - view files on the go without a net connection. CGS 20:28 6 Jul 2003 (UTC)
Another thing is... Tomeraider is non-free software. This is already enough reason not to use it. But even if I wanted to, I couldn't because I run GNU/Linux. Guaka 16:06 7 Jul 2003 (UTC)
If it's the right tool for the job, swallow your pride and run it through Wine. CGS 22:15 7 Jul 2003 (UTC).
I guess you paid for the PDA hardware. So why is $20 for good software a no-go? I chose TomeRaider because it was the best option at the time (and it may still be, not sure). Some people write software for a living, and if they are good at it, I hope they continue doing so. Just because they earn some money they are not neccesarily a second Bill. Not that I wouldn't prefer GNU software which is equally cross plaform, fast and economical with PDA space, it just doesn't seem a matter of higher principle to me. Erik Zachte 22:54, 18 Mar 2004 (UTC)

Erm, excuse me if I'm missing something, but wouldn't it be silly to view Wikipedia on non-free software after we go through so much trouble to make sure that the content is under the GFDL? If the content is free but the medium is not, then the company that produces it controls the content, albeit in an indirect fashion. The company could go out of business and render Tomeraider files useless, etc. At any rate, I would be interested in seeing a GPL'ed Python script that could accomplish this task, especially since I'm a beginning programmer and I'm interested in learning Python. And I'm a beginning Linux user who doesn't have a clue how to use Wine, fix problems with a program running in Wine, or anything particularly complex at all. --Nelson 23:41 8 Jul 2003 (UTC)

I fully agree with that Nelson. We just need to have a name now, so that we have a page for it. Or maybe this project would better fit on the Meta Wikipedia? Guaka 00:10 10 Jul 2003 (UTC)

Try the following Perl script for generating the Dict database. Change the DBI->connect to have the correct values for username and password instead of dbuser and dbpass.

#!/usr/bin/perl -w

use strict;
use DBI();

sub article2dict {
  my ($title, $text) = @_;

  $title =~ s/_/ /g;
  $text =~ s/\r//g;
  $text =~ s/^/  /mg;

  print "$title\n";
  print $text;
  print "\n\n";
}

# Connect to the database.
my $dbh = DBI->connect("DBI:mysql:database=wikipedia;host=localhost",
		       "dbuser", "dbpass",
		       {'RaiseError' => 1});

# Now retrieve data from the table.
my $sth = $dbh->prepare("SELECT cur_title, cur_text FROM cur " .
			"WHERE cur_namespace = 0 ORDER BY cur_title");
$sth->execute();
while (my $ref = $sth->fetchrow_hashref()) {
  article2dict($ref->{'cur_title'}, $ref->{'cur_text'});
}
$sth->finish();

# Disconnect from the database.
$dbh->disconnect();

wik2dict.py

I finally wrote something: wik2dict.py. It tries to create reasonably layouted dict articles. It can also automatically fetch the database dumps. There are some requirements though. And currently it is only version 0.2. So beware.

I would appreciate it if someone (possibly someone at Wikimedia?) could run the script regularly and put the dict files available for everyone to download. Too bad they can't be included in Debian though ("GFDL is non-free"). However, the script itself could probably be included in contrib :) G-u-a-k-@ 17:50, 27 Jul 2004 (UTC)


Moved from Wikipedia:Village pump:


CVSup/Rsync, and why it would be useful

For those who would like to keep their local copy in sync, I would suggest to set up a CVSup server ( http://www.cvsup.org/ ). The snapshot can be made a number of time per day (say 4 times a day) and people can very efficiently synchronize to the latest version of the wikipedia. This is a lot faster and saves a lot of bandwidth compared to downloading a complete tarball every time. The server does not even have to run on the wikipedia server itself, but it seems the logical choice. CVSup is very efficient. If the wikipedia dumps can be tagged with a version number similar to RCS, the synching will probably be blazingly fast. -- Tim Hemel

I don't think that would work very well with the dumps. The bzip2-compressed versions are not going to be cleanly diffable, and if I leave them as text (~380 megabytes for English current revisions, a few gigabytes for old revisions; I'm reluctant to have them sitting around uncompressed), they're still not going to work that well in CVS if I understand its storage system correctly. Each line of the dump is an SQL INSERT statement for about 500 pages, and the slightest change to any of them (including cache invalidation timestamps) would cause the whole line to be sucked out and replaced. --Brion 18:33 25 May 2003 (UTC)
I'm not sure how applicable cvsup would be, but I think rsync is worth considering. Rsync doesn't use diff for computing deltas, so I don't think the "long lines with small changes" problem applies. As for rsyncing compressed files, there are techniques to do that, such as http://svana.org/kleptog/rgzip.html or http://lists.samba.org/archive/rsync/2002-October/004035.html (merely two results from a brief Google search, I'm sure more research on the topic would be fruitful). If people feel this would be worth pursuing, let me know. Neilc 11:30, 7 Aug 2004 (UTC)
gzip does have an --rsyncable option (see http://rsync.samba.org/ftp/unpacked/rsync/patches/gzip-rsyncable.diff and gzip --help). bzip2 doesn't seem to. See http://www.debianplanet.org/node.php?id=524 for a discussion of pros and cons (for w:Debian). I don't know how relevant the server load problem is for d.wp .

See also http://lists.debian.org/debian-devel/2001/10/msg02187.html About efficiency in part (reportedly): http://rsync.samba.org/tech_report/

More later.

Mr. Jones 05:07, 14 Aug 2004 (UTC)


Offsite backup of dumps and mailing lists

I have some questions regarding downloading the database dumps. On the page it says last dump made July 13. Does that mean what I think it means (i.e. if I download the English and non-English tarballs I only have revisions up to the 13th?). Also, as I understand it, I would only have to download the cur tarballs from here on in (if I saved the old ones), is this correct? I figure having an extra backup of the database can't hurt...especially after last night :). Addendum: should I also download the mailing list archives (from what I gather, they're separate from the dumps)? Geez, another question: is it safe to assume images are not included with the dumps? -- Notheruser 15:42 28 Jul 2003 (UTC)

Ok, I think I've found most of the answers to my above questions; I'll list them here in case anyone else was curious. The database hadn't been backed up since July 13 at the time, but, currently, it is now updated until August 1. You have to download the cur and old files to completely backup the English Wikipedia (don't forget about otherlanguages.tar for a full backup). The mailing lists are archived offsite, so they seem safe and images are currently not backed up (about 1GB worth of files). -- Notheruser 18:53, 2 Aug 2003 (UTC)

If one downloads the old database for English, and then import it into a MySQL database, one finds that it is larger than the 4.1 GB limit most operating systems put on the table size. Of course, I can go in and edit the SQL myself, but it would be better if this was done at the source. Would it be possible to have the dump write out more than one table, each less than 4.1 GB in size? -- RayKiddy 20:01, 13 Sep 2003 (UTC)

If your OS still has a 4 GB file limit, you really need a new OS. :) Multiple tables doesn't make any sense, as it wouldn't be usable. I'd recommend (well, I'd really recommend getting yourself a modern Linux or FreeBSD or something) creating the table as InnoDB and making sure your configuration is set up to use <4GB files for the innodb space (as it can use multiple files). --Brion

Incremental wikipedia updates?

from village pump

Once the full Wikipedia is downloaded, can smaller periodic updates covering new stuff and changes be obtained and used to synch the local? --Ted Clayton 04:26, 13 Sep 2003 (UTC)

No, you can't. I've been thinking the same thing myself. I think we need to:
  • Allow incremental updates for all types of download
  • Allow bulk image downloads
  • Package a stripped-down version of the old table in with the cur dumps, where the revision history (users, times, comments etc.) is included, but the old text itself is not
  • Develop a method of compressing the old table so that the similarity between adjacent revisions can be used to full advantage
-- Tim Starling 04:38, Sep 13, 2003 (UTC)

Would it be easier to have incremental updates on something like a subscription basis? The server packages dailies or weeklies and shoots them out to everyone on the list? During off hours, mass-mail fashion?

Can you suggest sources or search-terms for table manipulations treatments, as background for stripping and compressing? --Ted Clayton 03:14, 14 Sep 2003 (UTC)

I'm going to continue this on wikitech-l, because it's very much on-topic there. See Wikipedia:Mailing lists for more information. -- Tim Starling 12:48, Sep 14, 2003 (UTC)
Also Wikitech-l thread on incremental backup
Wouldn't it be a great idea to provide split database dumps, one package with only article, and one package with articles, talk, users and all. This would reduce the spreading of Wikipedia userpages to forks. — Sverdrup 08:24, 18 Mar 2004 (UTC)

BitTorrent

An idea is to use a distributed downloading system. In such a system, multiple computer with the client running would help each other download faster. I recommend BitTorrent, an open-source distributed downloading system. However, to be most effective, BitTorrent should be used on large files that are frequently downloaded. --Ixfd64 06:25, 2004 Aug 15 (UTC)

  • Are the dumps available as torrents? That would be both cool and beneficial, I think. Seriously consider.

Direct access to textual content of Wiki Pages via mySQL?

I am conducting research which utilizes the content of wikipedia. I can access the content of my wiki database dump via a web browser and apache, but this does not suit the nature of my work. I would like to just access the plain (or marked up) text of pages through sql queries. How can I do this? I would imagine this requires 2 things, mapping of a page title to an id, and selecting the textual content associated with that ID. can any one please advise? —Preceding unsigned comment added by 128.238.35.108 (talk) 02:05, 7 December 2007 (UTC)


Unwieldy Images

Are the English Wikipedia images dump actually up to 75.5G from 15G in June? Could this huge tar file be broken into 10G batches or something? 75G is quite a download.

Rsync is the best solution for downloading "upload.tar" file with images (which is 75 GB in size). It allows resuming download, automatic error checking and fixing (separately in any small part of file). And if you already have old version of "upload.tar" - it will download only the difference between them. But this "updating" feature of Rsync is useful only for TAR archives (because they only collect and don't compress data), and useless for files compressed with gzip/bzip2/7-zip, because any little change in data will cause great changes in almost whole file.
For more info look here: http://en.wikipedia.org/wiki/Wikipedia:Database_download#Rsync
--Alexey Petrov 03:26, 9 April 2006 (UTC)
I thought that the images served by Wikipedia are already compressed, so their data is near random and thus gzip/bzip/7-zip/etc are useless on them? —Preceding unsigned comment added by 67.53.37.218 (talk) 03:21, 11 May 2008 (UTC)

Mailing list for notification of new dumps?

It would be great if there was a mailing list to notify interested users when there were new database dumps available. (I couldn't see such a mailing list currently, so I'm assuming it doesn't exist). Ideally the notification email would include HTTP links to the archive files, and would be generated after all the dumps had been completed (I've seen sometimes that the download page has an "intermediate" stage, where the old dumps are no longer listed, but the new ones have not been fully created). This mailing list would presumably be very low-volume, and would be especially useful as there doesn't seem to be an easily predictable timetable for dumps to be released (sometimes it's less than a week, often it's once every two weeks, and sometimes such as now it's up to 3 weeks between updated dumps), and because (for some applications) getting the most current Wikipedia dump possible is quite desirable. -- All the best, Nickj (t) 00:36, 28 Jan 2005 (UTC)

I agree something like this would be useful. I'll try to cook up an rss feed for this that page. --Alterego 04:14, Jan 28, 2005 (UTC)

Perhaps we can share Wiki through BitTorrent form and view it in our Italic textiPod

Help: Namespace download

It seems to me that a lot of people download the database for the "Help:" Namespace, it would seem logical to me to provide a seperate download just for this. This could be done via the perl script that s provided in this very article. It would; 1, save bandwidth, 2, save time, 3, make life easier for mediawiki users :), 4 not be very hard.

Otherwise, would anyone be able to point me in the right direction of someones own download of it?

I would like to second this comment. The idea of creating such a great tool and then forcing new administrators to have to download the entire database just to get the "Help:" pages is crazy Armistej

And one more "gotcha". You need the "Template:" namespace too. Otherwise a lot of your "Help:" pages are missing bits and pieces from them. Unfortunately, the "Template:" namespace contains templates from all parts of the MediaWiki, not just templates relating to "Help:". If I'm running an all-English MediaWiki on my Intranet, I couldn't give $0.02 about the Russian, Chinese, or other non-English templates which will never be referenced. We need some sort of script to find the "dangling" templates and zap them once and for all.

Exactly! I started a Help Desk query about this on 07/08/06, not having seen the above yet, nor having known to ask about the template. I'll repeat here that my motivation was that I do not currently have internet at home, so learning about wiki takes away from my short online opportunities. I bet others are in this boat.
I am appalled that there still appears to be no simple and straightforward way to download the "Help:" namespace and/or the "Template:" namespace. Either that, or it's too difficult to find. Come on, the comment just above is over a year old! Anyone have a simple point-and-click solution yet? --Wikitonic 18:40, 24 September 2007 (UTC)

More frequent cur table dumps?

I think that it'll be precious when you'll do more frequent dumps of cur table. Full dumps takes a lot of time and not everyone need them. As I see it was started yesterday and not finished yet. What do you think about making small dumps once a week and full once a month?

Margospl 19:45, 10 Mar 2005 (UTC)

Saving user contributions

Is there any way to save all my user contributions? That is, when I click on "my contributions" from my user page, I want to have all the links to "diff" saved. I wonder if there is a faster way to do this than click "save as" on each link. Q0 03:52, 7 May 2005 (UTC)

Look into a recursive wget with a recursion level of one. --maru (talk) contribs 03:44, 19 May 2006 (UTC)

New XML database dump format

  • Thank you Beland for updating the page, and for making it clear that XML dumps are the way of the future.
  • Is there a pre-existing way that anyone knows of to load the XML file into MySQL without having to deal with MediaWiki? (What I and presumably most people want is to get the data into a database with minimum pain and as quickly as possible.)
  • Shouldn't this generate no errors?
xmllint 20050909_pages_current.xml

Currently for me it generates errors like this:

20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 55296
[[got:&#xD800;&#xDF37;&#xD800;&#xDF3B;&#xD800;&#xDF30;&#xD800;&#xDF39;&#xD800;&
                                                                              ^
20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 57158
&#xD800;&#xDF37;&#xD800;&#xDF3B;&#xD800;&#xDF30;&#xD800;&#xDF39;&#xD800;&#xDF46
                                                                              ^

-- All the best, Nickj (t) 08:36, 14 September 2005 (UTC)

Problem with importDump.php

I was trying to import the id languange wikipedia but the importDump.php stop at article 467 (from total of around 12000+). Anybody can help me with this problem ? Borgx(talk) 07:53, 16 September 2005 (UTC)

How to import the files?

Is there any way for importing the xml dump files successfully ? importDump.php stop uncompleted with no error. sql dump files is too old :( ,xml2sql-java from Filzstift only importing the "cur" table (I need all tables for statistical needs. Thanks Wanchun, but I still need to see the tables). Borgx(talk) 00:44, 21 September 2005 (UTC)

  • I successfully imported using mwdumper, though it took all day. — brighterorange (talk) 04:20, 21 September 2005 (UTC)
  • I ran mwdumper on the 20051002_pages_articles.xml file using the following command:
 java -jar mwdumper.jar --format=sql:1.5 20051002_pages_articles.xml>20051002_pages_articles.sql

and received the following error concerning an invalid XML format in the download file:

1,000 pages (313.283/sec), 1,000 revs (313.283/sec)
...
230,999 pages (732.876/sec), 231,000 revs (732.88/sec)
231,000 pages (732.88/sec), 231,000 revs (732.88/sec)
Exception in thread "main" java.io.IOException: org.xml.sax.SAXParseException: XML document structures must start and end within the same entity.
        at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
        at org.mediawiki.dumper.Dumper.main(Unknown Source)

ECHOOooo... (talk) 9:22, 8 October 2005 (UTC)

Disable keys

It might be worth commenting that importing the data from the new SQL dumps will be quicker if the keys and indicies are disabled when importing the SQL. --Salix alba (talk) 23:22, 30 January 2006 (UTC)

Arrgh. Gad its slow. I'm trying to import the new SQL format dumps into a local copy of MySQL. I've removed the key definitions which speeded things up, but its still taking an age. I'm trying to import enwiki-20060125-pagelinks.sql.gz on quite a new machine and so far its taken over a day and only got to L. Does anyone have hints of how to speed this process up. --Salix alba (talk) 09:49, 1 February 2006 (UTC)

Check out meta:Talk:Data_dumps#HOWTO_quickly_import_pagelinks.sql.--Bkkbrad 15:51, 22 February 2007 (UTC)


Trademark violation?

Can someone please explain the "trademark violation" to me? How exactly is the wikipedia's GFDL content rendered through the GPL MediaWiki software a trademark violation, suitable only for "private viewing in an intranet or desktop installation"? -- All the best, Nickj (t) 22:50, 1 February 2006 (UTC)

Take a look at sites like http://www.lookitup.co.za/n/e/t/Netherlands.html (a mirror violating our copyrights terms) that have used the static dumps. Many lack the working link back to the article and GFDL, etc. The wording itself is from http://static.wikipedia.org/. Cheers. -- WB 00:47, 2 February 2006 (UTC)
But http://static.wikipedia.org/ does not explain why it is a trademark violation, and neither do you. Can you please explain why it is trademark violation? Is because of the Wikipedia logo? If so, MediaWiki does not show the Wikipedia logo by default (certainly the article only dumps don't), hence no violation. -- All the best, Nickj (t) 06:02, 2 February 2006 (UTC)
Perhaps trademark violation is not the proper term. It is my understanding that with out the proper link back to the original article that the terms of the GFDL have not been met; because of this, with out a working hyperlink to the original location, no permission is granted to use the subject material (the static dumps, in this case), which would make most uses illegal in most of the world. Triddle 07:06, 2 February 2006 (UTC)
Yeah, I didn't write the website. It should be called "copyrights violation" instead. What I found was that most static mirror/forks lack proper licensing. (so do other mirrors/forks though) -- WB 07:16, 2 February 2006 (UTC)
I wasn't aware of the linking back to the original requirement; live and learn! I take it this would be section 4. J. of the GFDL license text, listing the requirements to distribute modified copies? But what if it's not modified? (e.g. it uses an unmodified dump of the data, which presumably is what most static mirrors will do). In that situation, why would the requirement to link back to the previous versions still apply? -- All the best, Nickj (t) 01:09, 3 February 2006 (UTC)
As far as I know, you need a live link back to the original article in Wikipedia, attribution to Wikipedia, mention of the GFDL license, and a link to some copy of GFDL as well. It derives from section 2 of GFDL:
You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3.
As well as Wikipedia's license:
Wikipedia content can be copied, modified, and redistributed so long as the new version grants the same freedoms to others and acknowledges the authors of the Wikipedia article used (a direct link back to the article satisfies our author credit requirement).
I hope that helped. I hope information can be added in DB download page somehow. If you have time, take a look at WP:MF. Cheers. -- WB 04:48, 3 February 2006 (UTC)
I will do! Thank you for clarifying. -- All the best, Nickj (t) 05:58, 3 February 2006 (UTC)
First of all, it's quite difficult to make a Verbatim Copy. Some would say it's almost impossible. More importantly, Wikipedia believes the link satisfies the history requirement for Verbatim Copies; this is important because creating a Verbatim Copy requires copying the history section along with the page. Superm401 - Talk 04:43, 4 February 2006 (UTC)

How do you retrieve previously deleted files?

Maybe it says somewhere, but I just don't see it.Gil the Grinch 16:09, 21 February 2006 (UTC)

  • You will not be able retrieve deleted files from the dumps. If your contribution was deleted for some reason, an admin maybe able to retrieve it; however, it is not guaranteed. Cheers! -- WB 07:35, 22 February 2006 (UTC)

Latest dumps

There were several links to the latest versions of dumps in this article, e.g.: http://download.wikimedia.org/enwiki/latest/

But if somebody will use them, it may be dangerous, because many latest dumps (especially English) are broken. And the worst, there is no information about completeness of every file at these links - just a list of files, which seems absolutely normal.
For example, link http://download.wikimedia.org/enwiki/20060402/ shows warnings about all broken files - I wonder why link http://download.wikimedia.org/enwiki/latest/ doesn't look the same way.

So I changed that links to last complete dumps (should be updated manually). E.g. last complete dump for enwiki is http://download.wikimedia.org/enwiki/20060219/ --Alexey Petrov 03:59, 9 April 2006 (UTC)

I added the latest versions link. As the dumps are now happening on a regular basis about weekly, linking to the last complete dump will need to be updated on a weekly basis. This makes a considerable maintanance task to update this page, which of often out of date, so latest is likely to be more current than the specific date in this page. As for which are broken it generally seems to be only the one with complete history which is broken. The recommended version pages-articles.xml.bz2 (This contains current versions of article content, and is the archive most mirror sites will probably want.) is considerably more up to date (2006-03-26) than 2006-02-19.
Maybe the best thing is just to point people to http://download.wikimedia.org/enwiki/ and let the user browse from there. --Salix alba (talk) 15:06, 9 April 2006 (UTC)
Yes, that seems to be the best solution. I have changed links. --Alexey Petrov 14:36, 11 April 2006 (UTC)

Opening the content of a dump

I have downloaded and decompressed a dump into an XML file. Now what? I made the mistake of trying to open that and it froze up my computer it was so huge! What program do I open it with in order to view the Wikipedia content? What next?! J@red  01:08, 31 May 2006 (UTC)

Yes I've had the same problem, never managed to successfully get a dump imported into SQL on my machine. It all depends on what you want to do, Wikipedia:Computer help desk/ParseMediaWikiDump describes a perl library which can do some operations. I've avoided SQL entirely, instead I've created my own set of perl scripts which extracted linking information and writes it out to text files, I can then use the standard unix grep/sed/sort/uniq etc tools to gather cetain statistics, for example see User:Pfafrich/Blahtex en.wikipedia fixup which ilustrates the use of grep to pull out all the mathematical equations. --Salix alba (talk) 06:50, 31 May 2006 (UTC)
Well I understand that there are several ways to parse it, but is there any feasible way of easily opening up information from a dump in a program like IE or FireFox for viewing, just like I'm viewing this page now? J@red  19:28, 1 June 2006 (UTC)

Simple steps for wikipedia while you travel

I travel a lot and some times like to look up stuff on my laptop, I see that I can do it but like a lot of people I am clueless about computer was hoping that someone would make a section with easy to follow steps to make this work figure that I’m not the only one who would like this . I see the size of wiki with picture 90G wow little to big hehe but how hard would it be to just get picture for say featured article and the rest text only Britannica and Encarta are both 3-4 gigs. if it is text only would be fine.Ansolin 05:59, 14 June 2006 (UTC)

That's basically what I want, too. J@red  19:39, 15 June 2006 (UTC)
If you're running GNU/Linux you can give wik2dict a try. DICT files are compressed hypertext. wik2dict needs an upgrade though. I will work on that very soon. I have also written MaemoDict, to have Wikipedia on my Nokia 770, though I don't have any space on there to put a big Wikipedia on it (so one of the things I will add to wik2dict is to just convert some categories, instead of the whole thing).
There might also be some software to support dict stuff in Windows. Guaka 20:28, 15 June 2006 (UTC)

did not follow most of that i use windows btw i though text only was only about a gig should fit but as i said i have no clue about computers :).Ansolin 04:41, 16 June 2006 (UTC)

See, the problem I have is that I've downloaded the wiki from the downloads page and I've decompressed it, but it's just a big gargantuan xml file that you can't read. How would I now, offline, open/dump that information in a wiki of my own? Do I need mediawiki? This is all so confusing and it seems that nowhere is a proper explanation! J@red  12:36, 16 June 2006 (UTC)

How to import image dumps into a local Wiki installation?

I've imported the text dump file into my local Wiki using importDump.php. And I've downloaded the huge tar image dump,and put the extracted files (directory intact) under the upload path specified in the LocalSetting.php. But my Wiki installation doesn't seem to recognize them. What do I need to do? I think it has something to do with RebuildLinks or RebuildImages or something else. But I want to know the specific procedures. Thanks! ----Jean Xu 12:26, 2 August 2006 (UTC)

Using XML dumps directly

People seem to be having lots of issues with trying to get this working correctly without any detailed instructions, myself included. As an alternative, there is WikiFilter which is a far simpler tool to setup and get working. I am not affiliated with WikiFilter, I have just used the app and found it quite successful. WizardFusion 08:59, 20 September 2006 (UTC)

Possibility of using Bittorrent to free up bandwidth

Is it possible for someone to create an official torrent file from the wikipedia .bz2 dumps? this would encourage leeching to some degree but it would make it easier for people to download the large files for offline use (i would like wikipedia as maths/science reference for offline use when travelling). 129.78.64.106 04:54, 22 August 2006 (UTC)

Why is this all so complicated?

  • Not everybody who wants to have access to Wikipedia offline is a computer systems expert or a computer programmer.
  • What is with this bzip crap, why not use winzip. I've spent so long trying to figure out this bzip thing, that I could have downloaded 10 winzip files already
  • Even if i get this to work, can I get a html end product that I browse offline as if I was online?

195.46.228.21 12 September 2006. Agree, wikipedia a very usefull site has failed the user friendly concept of offline browsing.

  1. Granted.
  2. The best reason not to use WinZip is probably that WinZip is not free. On the other hand, bzip2 is. It is also a better compressor than WinZip 8, although WinZip 10 does compress more. [1] And because the source code of bzip2 is freely available, if someone has a computer that cannot decompress bzip2 files, it is much easier to rectify this situation than in the proprietary case.
  3. It sounds like you want a Static HTML tree dump for mirroring.
  4. Remember: just the articles are about 1.5GB compressed (so maybe around 7-8GB uncompressed?). Current versions of articles, talk pages, user pages, and such are about 2.5GB compressed (so I'd estimate about 13GB uncompressed); Images (e.g. pictures) will run you another 75.5GB. If you also want the old revisions (and who wouldn't?) that's about 45.9GB (I guess around 238GB uncompressed).

The Storm Surfer 04:08, 7 October 2006 (UTC)

Try 7-Zip for a compression utility that supports many formats. Superm401 - Talk 09:00, 3 February 2007 (UTC)

Possible Solution

I have written simple a page for getting the basics working on Ubuntu Linux (server edition). This worked for me, but there are issues with special pages and templates not showing. If anyone can help with this it would be great. It's located at http://meta.wikimedia.org/wiki/Help:Running_MediaWiki_on_Ubuntu. WizardFusion 23:31, 1 October 2006 (UTC)

old images

Hello, why are the image-dumps from last year? Would it be possible to get the small thumbnails as an extra file? It should be possible to seperate the fair-use images from the dump. de:Benutzer:Kolossos09:14, 12 October 2006 (UTC)

Making A DVD

How could I convert a Wikisource/Wikibooks/Wikipedia XML dump to HTML? I'm using Windows XP. —The preceding unsigned comment was added by Uiop (talkcontribs) 12:03, 24 December 2006 (UTC).

You would have to download the Static Wikipedia, but it's something like 60GB in all--including Talk files, User pages, User talk pages, Help and Wikipedia namespaces. I suppose you could write a script to automatically delete those files, but even after that it would be something like 30GB.
A DVD today is limited to 4.7GB. You'll have to compress it beyond recognition.Thegeorgebush
What about Blue-ray Disc (50 GB) - that would work if one could only figure out how to get the damm static html pages --Sebastian.Dietrich 21:13, 5 August 2007 (UTC)

Update

When is Wikipedia going to be dumped again? Salad Days 06:35, 30 December 2006 (UTC)

It seems to be about monthly. Usally on the lower side Reedy Boy 13:12, 30 December 2006 (UTC)
Where are the new dumps? The new place for dumps is also orphaned for more then a week. We at Wikipedia-World could need it really. Thanks. Kolossos 17:54, 23 January 2007 (UTC)
The latest dump is also over a day beyond it's Estimated finish time. Does this mean it failed? Is there anyone who can comment on the actual status? —The preceding unsigned comment was added by 216.84.45.194 (talk) 21:50, 1 February 2007 (UTC).
Looks like it did fail but a new one was started since Flamesplash 18:15, 10 February 2007 (UTC)

Info on gzip --rsyncable vs. bzip

Rsync section seems to say that rsync will work at least as well with bzip2 as with gzip --rsyncable. Not sure this is true. bzip2 does compress using blocks, but (1) the bzip2 manpage says blocks are at least 100KB long, and (2) the blocks are on even 100kb boundaries, so any changes in the length of one article will affect the rest of the archive. gzip --rsyncable seems to ensure that less than 100k of compressed output after a change is affected (I'm basing that the RSYNC_WIN constant of 4096 in the gzip patch; I figure that's just the order of magnitude). More importantly, gzip uses a rolling checksum to decide where block boundaries go, in such a way that inserting or deleting bytes doesn't affect the location of all future block boundaries. It's clever, and kind of out of my league to explain it clearly.

If CPU time on the wikimedia servers isn't an issue, my guess is that rsync -z on an uncompressed file is the lowest-bandwidth way to do incremental transfers, because changes are more localized than under other approaches and the network stream is still gzip-compressed (possibly at a lower compression level than gzip uses on files). Would require more disk space, too.

Of course, really it comes down to 1) what differences between these approaches actually turn out to be in testing, and 2) whether there's demand for less bandwidth-heavy updates given the costs (or other things like how busy the wiki sysadmins are). I'm not up to date on what discussions have already happened. —The preceding unsigned comment was added by 67.180.140.96 (talk) 00:32, 14 January 2007 (UTC).

Followup on rsync/incremental transfer

For whatever it's worth, diffing XML dumps using rsync --only-write-batch -B300 and compressing the result with bzip2 seems to produce a file that's substantially smaller than the monthly dump. (This is based on testing with ruwiki-20061108-pages-articles.xml and ruwiki-20061207-pages-articles.xml: rsync-batch.bz2 was 16.6 MB while the 20061207 dump was 79.9 MB bzipped.) Producing dumps with only new and changed articles (and tools to process them) might also be useful. Again, wikifolks may not have the time or the need for either. Bigger gains may come from using smaller/larger rsync block sizes. 67.180.140.96 03:35, 14 January 2007 (UTC)

1-24-2007 dump frozen

The 1-24-2007 dump looks to be stalled/broken. It's several days past it's ETA, and in the past it has only taken a day or two. Can someone comment on the actual status or if I should be notifying someone else through a different mechanism so that this can be restarted, or the next dump began. Flamesplash 17:02, 5 February 2007 (UTC)

Image dumps getting old

The image dumps are over a year old now (last modified 2005-Nov-27), is there any plan to update them? Bryan Derksen 10:23, 15 February 2007 (UTC)

What are you talking about...? Look Here 2007-02-07... Reedy Boy 10:57, 15 February 2007 (UTC)
There are only database dumps for the various SQL tables at that URL. Image file dumps are located at http://download.wikimedia.org/images/wikipedia/en/ and were last updated in November 2005, as Bryan wrote. I'd also be interested in a newer version.--134.130.4.46 23:54, 22 February 2007 (UTC)
The image metadata database does get dumped so if all else failed I suppose one could rig up a script to download the matching images directly from en.wikipedia.org. I imagine that would take longer and put more of a load on the servers than an image dump would, but if that's the only place the images are available then that's all that I can think of doing. Bryan Derksen 01:26, 1 March 2007 (UTC)

Link to HTML dumps not updated

Not sure where to put this, kept looking for a mail add to some tech staff. Anyway, according to the logfile on http://download.wikipedia.org, the December dump is in progress, and the link points to the November one. However the url http://static.wikipedia.org/downloads/December_2006 works perfectly, so I guess all that needs doing is updating the link? (http://houshuang.org/blog - pre-alpha tool to view html-dumps without unzipping) Houshuang 06:13, 21 February 2007 (UTC)

Wikisign.org alternative

Wikisign seems to be down. Is there an alternative? MahangaTalk 02:53, 12 April 2007 (UTC)

How do HTML dumps work?

The page says, that beginning with V1.5 there are routines to dump a wiki to html. How does this work? How can I use this on my own mediawiki? --Sebastian Dietrich 10:32, 19 May 2007 (UTC)

Image Dumps Stolen

Why are all image dumps gone??? Thousands of users provided this information with the intent that it be freely available, but now ONLY the Wikipedia site can provide this information in a drip format (HTML). This looks a lot like what happened to the CDDB album database. It was collected as free info, but now it's been stolen and is handed out piecemeal so only Wikipedia can provide the info, everyone else must beg. Of course if you actually try to get all the images via a spider you will be banned. This is quite a corruption of how most people (including myself) who contributed to Wikipedia envisioned the information being used. —Preceding unsigned comment added by 63.100.100.5 (talk) 20:19, 31 May 2007

It's probably not as helpful as you'd hope to fling around conspiracy accusations. I don't understand what you mean by saying that the image dumps are "gone" or "stolen"; there are full database dumps available at download.wikimedia.org/enwiki, just as the page says. The current dump, started on May 27, is still in progress, though the dump containing the articles, templates, image descriptions and some metadata pages is complete. What are you looking for that you can't find? grendel|khan 06:12, 1 June 2007 (UTC)
About the only things you wont be able to get your hands on, would be the private user data. Reedy Boy 07:59, 1 June 2007 (UTC)
Maybe he's referring to the claim which is also in the article, in the section Currently Wikipedia does not allow or provide facilities to download all Images. I doubt that information is true, but the article currently says image dumps are and are not available. (SEWilco 13:27, 1 June 2007 (UTC))
No, as I stated before, the images are all gone. Please look at the URL you sent: download.wikimedia.org/enwiki, there are NO images present. If you can find even one image, please state how you found it. All images are gone, all images are stolen. Maybe not a conspiracy, but certainly a very disappointing "change in operating policy", corporate speak for we are *blanking* you and you can't do anything about it.
Still no word on image dump file availability. --66.74.75.39 01:51, 25 July 2007 (UTC)
Still no word on image dumps as of December of 2007. The notice of "Check back mid-2007" has obviously been removed. What gives? Dchristle (talk) 21:43, 15 December 2007 (UTC)