Wikipedia:WikiProject Wikidemia/Quant/StatsProduction

Initially we intended to break things the statistics to be drawn from the data dumps into three categories:

continuously updateable stats
stochastically updateable stats, and
one-shot stats

For the time being, all listed statistics can theoretically be produced from diff feeds (i.e. continuously) unless otherwise noted. As we explore the problem of producing these statistics in a continous manner further, we will reevaluate and update this categorization. For now categorization is by data source.

Data Sources[edit]

Data dumps[edit]

The Wikimedia data dumps contain full logs of changes to site content in XML format. We can transform this data into Stata/SPSS/SAS/R - readable content by parsing a set of headers from the XML dump.

In turn, this data can be transformed along various dimensions. Given a database of revisions, we can produce an equivalent one ordered by editor, page, time, length of edit, type of edit, etc. Various metrics can be produced directly from this output:

X, Y, Z $\in \{$ Edits (revisions), Articles, Editors, Time, Unit of Content, Category, Page, $\dots \}$ : Raw data is split into fields. It is relatively straightforward to extract data from the headers of the data dump once the proper transformations have been completed, engendering interesting analyses of the form:

X per Y ... per Z : The combinatorial possibilites are quite large. For example, a recent analysis of Wikipedia's health looked at time since first edit per editor.[1]

Other measures and databases may require more specific analysis to produce:

Measures of content: Some metrics are based primarily on Wikipedia's content. One might measure the flow of phrases between connected/similar articles, or the flow of phrases/grammatical structures among editors. We could use modern statistical techniques to measure the entropy of content in various articles (looking at repetition, repeated use of similie, variations in vocabulary, etc.)

Graph-theoretic measures: Wikipedia can be described in a variety of ways using graph-theoretic descriptions of page-to-page links and implicit social networks among editors. That said, many of these links are automatically created by non-human editors, and thus we must be careful inferring things about users from link creation. Additionally, one might infer loosely defined social networks by parsing discussion threads on talk pages and by common edit histories among editors.

Diff feeds[edit]

In most afformentioned cases (perhaps not some of the content measures) we can establish techniques for incrementally updating resultant datasets, so that repetitive processes can be avoided and scarce resources saved.

You'll need to reparse all diffs on some/most algorithm updates. Unrelated: old diffs can be out of date for at least two reasons: article deletions, and article renames. Also there are legal reasons why some revisions need to be deleted instead of hidden (e.g. private data added by another person). Would the diffs be cleaned then as well? (come to think of it, older online xml dumps are not cleaned on revision delete in the live database) Erik Zachte 22:33, 20 June 2006 (UTC)[reply]

Random subsets[edit]

Methods for extracting a random subset from a given dataset will aid development. (TODO: describe)

Readership data[edit]

We are interested in using anonymized logs from the Wikimedia Squid servers. At present it appears that a reasonable degree of anonymity can be achieved by collecting a log from a random subset of the full access logs, anonymized to conceal the identity of the person accessing the WP content. See Quant:Readership Stats.

Policy changes[edit]

Understanding the effects of changes in site policy is important both to specify data (some questions don't make sense during certain time-periods because of policy changes), and to understand the effects of policy on community behavior.

We would like to establish a database of policy changes. Are there any good sources? Such a database might also include information about automated changes to Wikipedia and the behavior of administrators.

Specific Statistics and Questions[edit]

Please add statistical metrics, signing if you would like to be contacted for further explanation.

Edits per editor per age of editor: Could be used to gauge the health of the site. Are older editors continuing to contribute at a high rate? Could be used to look at the life cycle of editors; could be augmented with information about changes in the types of editing/articles edited over time. A recent thread on Wikitech-l prompted some exploration of this topic.[2] --Erik Garrison 22:11, 20 June 2006 (UTC)[reply]

Content entropy: Could be used to augment an analysis of editor life-cycle. Do people "improve" as editors over time? Can we find groups of editors who behave similarly as their number of edits increase? As their time with the site increases? Do certain articles or category areas have different content signatures (entropy, grammar, etc.)? --Erik Garrison 22:11, 20 June 2006 (UTC)[reply]