User:Mr.Z-man/pageview

From Wikipedia, the free encyclopedia

It was observed that total pageviews appear to be down from ~6 months ago.

One possible reason given was that HTTPS pageviews are not counted. After doing some tests of my own, I'm not convinced that this is the case.

The tests were done using a program I'm developing that parses the raw pagecounts files, does some basic title normalization, then puts the data into a MySQL table.

Test 1[edit]

I chose 3 articles that generally get extremely few pageviews. The first article I visited 30 times while logged in with HTTPS, the second while logged out with HTTP, and the third while logged in with HTTP disabled. In all cases, hits were registered in a much higher number than would be expected due to random views, but less than the number I actually visited them:

MariaDB [p50380g50816__pop_temp]> SELECT * FROM pop_temp WHERE date='test' AND ns=0 AND title='Life_(1984_film)';
+----+------------------+------+------+
| ns | title            | hits | date |
+----+------------------+------+------+
|  0 | Life_(1984_film) |   22 | test |
+----+------------------+------+------+
1 row in set (0.00 sec)

MariaDB [p50380g50816__pop_temp]> SELECT * FROM pop_temp WHERE date='test' AND ns=0 AND title='James_Hare_(boxer)';
+----+--------------------+------+------+
| ns | title              | hits | date |
+----+--------------------+------+------+
|  0 | James_Hare_(boxer) |   20 | test |
+----+--------------------+------+------+
1 row in set (0.00 sec)

MariaDB [p50380g50816__pop_temp]> SELECT * FROM pop_temp WHERE date='test' AND ns=0 AND title='Reese_Public_Schools';
+----+----------------------+------+------+
| ns | title                | hits | date |
+----+----------------------+------+------+
|  0 | Reese_Public_Schools |   24 | test |
+----+----------------------+------+------+
1 row in set (0.00 sec)

Test 2[edit]

For this test, I used a non-existent page with a title that would never get any views normally and visited it 30 times logged in with HTTPS:

MariaDB [p50380g50816__pop_temp]> SELECT hits FROM pop_temp WHERE date='test' AND ns=0 AND title='ThisIsATestToSeeIfPageViewsAreLoggedRight';
+------+
| hits |
+------+
|   21 |
+------+
1 row in set (0.00 sec)

I also checked the views in the adjacent hourly files in case some were logged earlier/later, and both were 0.

Test 3[edit]

This was basically a repeat of test 2, later the same day:

MariaDB [p50380g50816__pop_temp]> SELECT hits FROM pop_temp WHERE date='test' AND ns=0 AND title='ThisIsAnotherPageViewTest';
+------+
| hits |
+------+
|   27 |
+------+
1 row in set (1.35 sec)

There seem to be substantially fewer missing hits here. I didn't have time to do another test, so it may be a coincidence, but this test was done much later in the day (03:30 UTC, 10:30 PM EST) when general load was likely lower. Perhaps the problem is excessive packet loss in the logging program during periods of high load?

Test 4[edit]

To test my hypothesis from test 3, I wrote a quick script to load a specific nonexistent page every 10 seconds, ~360 times per hour (due to the finite time of actually doing the request, it may be slightly less, though in tests it only took a few ms). Then I modified the program that extracts the pageviews (which is already set to run every hour as part of another test) to log views to that specific page to a file. The results are shown graphically below:

The "actual" values are the actual number of pageviews in an hour, which is a fairly constant 354-355. The "measured" values are the number from the raw pagecounts files. "Percent missing" is (actual-measured)/actual*100. "1-min load average" is the total WMF grid load from [1] The results seem to confirm my earlier hypothesis. The minimum on each day is generally around 06:00 – 08:00 UTC, which corresponds to late night in North America and early morning in western Europe, which is the time lowest activity on Wikimedia sites. The average missing percentage of views for the 3 days that have full data (11/21 – 11/23) is 14.78%. The minimum was 5.9% (11/23 08:00) and the maximum was 29.9% (11/21 19:00).

Other observations[edit]

As HTTPS is enabled by default for logged-in users, one would expect pageviews on pages visited only by logged-in users to drop to near zero. But this is not the case. From Oct 2012 to Oct 2013, views on Special:Watchlist dropped, but only from 2,353,585 to 1,887,523, a drop of ~20%, comparable to the 20-30% drops in my tests.

One would also expect virtually no change or a much smaller change on pages viewed mostly by logged-out users if HTTPS was the problem. But from Oct 2012 to Oct 2013, views on Portal:Arts (linked from the Main Page) dropped from 77,120 to 56,469, a drop of almost 27%.