User:Dcljr/Milestones

From Wikipedia, the free encyclopedia

A long time ago (January 2005, to be more precise), I stumbled upon Wikipedia:Milestone statistics, and, well, it needed work. The colors were quite garrish and not consistent with the milestone levels—which themselves were not consistently chosen. The information was pretty up to date, but the table just didn't look very good. So I (eventually) changed the colors, standardized the milestone levels, and started helping to keep the information up to date.

For a while I used User:Davidcannon's page of interwiki links, but later I switched to use Wikipedia:Multilingual monthly statistics (both pages are no longer maintained). Eventually I started using m:List of Wikipedias, which has been more or less regularly updated for the past 4 years—especially since m:User:Mutante wrote a script to help automate the process. I built a simple spreadsheet based on these stats, which I would collect twice a month, to help track which Wikipedia languages would most likely be in need of promoting in (or adding to) the milestone table each month. I did that for over 5 years.

Unfortunately, while Mutante's script was run regularly, its output wasn't always pasted into m:List of Wikipedias/Table often enough to determine what day a given language passed a given milestone, so finally (in October 2009) I wrote my own Perl script to collect and track daily article counts for all the Wikipedias.

Actually, since February 2011, I've been tracking almost all the stats collected for almost all the Wikimedia wikis (although I certainly don't have all that information stored—see below for a brief description of how I track the stats). I'm using that information to keep m:Wikimedia News up to date with milestone announcements (not just article counts) and to maintain the milestone tables on that page.

Starting in April 2011, I tried out a new method of predicting milestone-table promotions (in the same Perl script that collects the stats). On August 1, 2011, I finally analyzed how well both methods (new script and old spreadsheet) were predicting table promotions.

The results are as follows. (Counts are the number of languages predicted to achieve their next milestone within the month, for April–July 2011, classified as correct, because the milestone was actually achieved that month, or incorrect; percents are out of row totals [not shown] for each method separately.)

Standard method New method
prediction correct incorrect prediction correct incorrect
will reach 7 (100%) 0 (0%) will reach 8 (80%) 2 (20%)
will almost certainly reach 2 (67%) 1 (33%) will most likely reach 5 (36%) 9 (64%)
will probably reach 1 (17%) 5 (83%) will probably reach 3 (50%) 3 (50%)
might reach 4 (33%) 8 (67%) might reach 2 (14%) 12 (86%)
might possibly reach 5 (23%) 17 (77%)

Some more information about these methods:

The "standard" method is the spreadsheet-based one I started using in July 2006, based on article counts downloaded twice a month from s23.org. The predictions were based on the number of times the past rates of article growth exceeded the rate of growth required to achieve the next milestone within the month. If only 1 of the last 5 semi-monthly periods have shown sufficient growth, then the language "might possibly" reach the milestone that month; if 2 of 5, it "might" reach it; 3 of 5, "will probably"; 4 of 5, "will almost certainly"; and all 5, "will".

The "new" method is similar, but more complicated because it's based on the daily article counts my Perl script gets directly through the MediaWiki API. The counts are pushed through what I call a "binary queue" (something I made up for this project, but it probably exists under some other name). Each article count that enters the queue pushes an older count out of the queue (not surprisingly), but not just the oldest one; instead, the one that's dropped is based on the number of counts that have entered the queue so far. It's done in such a way that when the 2Nth item (that's 2 to the Nth power) enters the queue — not counting the very first item, which is considered the "zeroth" item — all the other items in the queue have "ages" that are also powers of two. For example, on the 64th "day" (actually, just the 64th running of the script after the queue was "initialized" with the zeroth item), the counts in the queue are 0, 1, 2, 4, 8, 16, 32, and 64 "days" old. This is only true on "days" that are powers of two; on "day" 106, say, the items happen to be 0, 1, 2, 6, 10, 26, 42, and 106 "days" old. In any case, this allows the queue to be weighted toward recent activity and yet still contain very old article counts, as well. There are only 8 spots in the queue, so items over 127 "days" old are pushed out the queue permanently.

The nitty-gritty details (you can safely skip this part): When the Nth item is added to the queue, the item that gets pushed out is based on the "length" of the result of bitwise XOR-ing the binary representations of N and N−1. For example, because 71 and 72 in binary are:

71 = 1000111
72 = 1001000

XOR-ing these values gives:

71 XOR 72 = 0001111

which means that when the 72nd item is added to the queue, it "pushes" only the first 4 elements of the queue forward (because of the 4 ones in the XOR result), and the 5th element gets dropped. To illustrate this visually, if the queue looked like this:

71, 70, 68, 64, 56, 48, 32, 0

where the values displayed are the ordinal values of the items added to the queue (remember we begin with the "zeroth" item, not the 1st), then after the 72nd item was added to the left end of the queue, it would look like this:

72, 71, 70, 68, 64, 48, 32, 0

Since items 71, 70, 68, and 64 were advanced toward the right end of the queue, but item 56 wasn't advanced and so dropped out of the queue; the rest of the items (48, 32, and 0) were not touched. Note that the values shown here are not the "ages" of the items in the queue, as discussed above. The "ages" of the items are obtained by subtracting the ordinal value of the item most recently added to the queue (72, in the case of the last queue shown above):

0, 1, 2, 4, 8, 24, 40, 72

Anyway, on the first of each month (and now I mean literally the first day of the month), the growth rates over these various time periods (from the present back however-many days) were calculated (except those under 8 days old are ignored) and finally predictions were made much like the other method (if only 1 time period shows sufficient growth, then the language "might" achieve the milestone, etc.).

Unfortunately, now that I'm collecting daily stats, there's really no need to keep making the monthly predictions. After all, it's not like any milestones are going to go unnoticed anymore, which was the only reason I started doing the "predictions" in the first place. So, I've stopped making these monthly predictions and am now just concentrating on keeping both Wikipedia:Milestone statistics and m:Wikimedia News (announcements and tables) up to date.

I just wanted this to be documented somewhere, and Wikipedia talk:Milestone statistics didn't seem to be the place for this much detail.