User talk:West.andrew.g/Popular redlinks/Archive 1

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

An idea for later

Andrew. Thanks so much for this list. I'm seeing opportunities already. What if there was also a daily list that included anything that got over 100 hits? =) I know you're busy, but I think that would be pretty cool, whenever you have time. It could help prompt page creation more within page demand peaks. Biosthmors (talk) 19:20, 16 February 2013 (UTC)

I'll quickly note this is non-trivial. Preparing the current 1000+ list takes on the order of 30+ minutes of continuously querying the WMF servers; having to ask about *every* page that gets 1000+ views and its "redlink" status. Given what we know about the long-tail distribution of page views, I am assuming the number of pages that receive between 100 and 1000 hits per week is humongous -- and the query time/load would be huge. Thanks, West.andrew.g (talk) 19:33, 16 February 2013 (UTC)

Perhaps query time/load wouldn't be as huge as initially thought since it would be daily instead of weekly (and 1000 ÷ 7 = 143) so it would catch many more articles but maybe we could cut it off at 120 or 140 instead of 100? Biosthmors (talk) 19:50, 16 February 2013 (UTC)

Double mention?

It appears "Tha Joker" is #10 and #85. Biosthmors (talk) 02:41, 25 February 2013 (UTC)

If you look at the Wikitext for #85, you'll notice that it is [[The Joker ]], i.e., there is an explicit trailing space. Whatever automated mechanism is asking for this page is doing so with that space. This is likely to result in an auto-redirect in the title normalization stage (something I cannot speak to the subtleties of, but maybe its done by the parser?) -- but this is where the distinction comes from. Thanks, West.andrew.g (talk) 03:45, 25 February 2013 (UTC)

Is the popular redlinks page some evidence that semi-popular black rap artists are falling afoul of Wikipedia:Systemic bias? We have Tha Joker (#10 right now), Young Scooter (#14), and Rich Homie Quan (#15). However, none of these are getting a lot of mainstream news coverage either, so Wikipedia may be reflecting that fact as well. But Rich Homie Quan has 159,000 "views" in the last 90 days[1], its consistently being requested. Tha Joker has 58,000[2]. There are Brooklyn indie bands getting far less views that will survive AfDs, e.g., [3].--Milowent • ^hasspoken 06:59, 25 February 2013 (UTC)

Or could this be the work of a popular rap website that "pulls" its content from Wikipedia in a live/on-demand fashion? Perhaps whenever you visit an "artist" page on that website, it pings the Wikipedia API to get their "biography" which they reproduce for the user surrounded with some community-specific content. While these artists may be "trending" on that particular site, the have no articles here. Maybe these aren't really people directly and purposefully trying to query Wikipedia for this data. We could imagine a similar circumstance for all the guitar related articles. Thanks, West.andrew.g (talk) 07:11, 25 February 2013 (UTC)

Ah, that's a good possibility too. Does Google do the same thing (ping the Wikipedia API) when you google a person, and you get some of their wikipedia article text on the right?--Milowent • ^hasspoken 07:23, 25 February 2013 (UTC)

Evidence suggests that Google has a cache for this purpose (i.e., a change won't be *immediately* reflected). West.andrew.g (talk) 07:26, 25 February 2013 (UTC)

Maybe the systemic bias is present in the sources. Indie bands in Brooklyn might have easier ways to demonstrate notability (through established and accepted media) rather than upcoming rap artists. That said, I guess there's always the potential for people on the internet to try and get others to create an article through requesting its existence via page views. Biosthmors (talk) 20:08, 25 February 2013 (UTC)

I do think there is bias in sourcing. I started one on Rich Homie Quan and found his youtube views are pretty high, numerous videos have views in the 100-400,000 range (unofficial uploads at that). The news coverage is still a bit thin, in comparison to his social media clout. This sort of systemic bias used to exist, and probably still does exist, for popular youtubers. Ray William Johnson, for example, had been deleted every which way for quite awhile (see the amusing article milestone list on its talk page) until I started a version with sufficient work done to trump any AfD. The WP:TOPRED data may prove to be an interesting tool to track such cases.--Milowent • ^hasspoken 22:47, 25 February 2013 (UTC)

Great new addition! Thousands of people will now have an article to read instead of just being dissapointed. =) Biosthmors (talk) 23:21, 25 February 2013 (UTC)

Craziest entries

The internet is disturbing! -- List of Male Superheroes Who Have Been Raped (#97, 2885 requests).--Milowent • ^hasspoken 16:01, 1 March 2013 (UTC)
- Odd indeed. I'd like to think this is just due to one person's actions though. =) Biosthmors (talk) 20:08, 1 March 2013 (UTC)

Will someone think of the children!!

The latest TOPRED list has two high listings where articles were recently deleted at AfD. Madison Ivy is #13 this week. She is a pornstar whose article regularly got over 2500 views a day. It was deleted by the anti-fap crowd at Wikipedia:Articles for deletion/Madison Ivy (2nd nomination), not a real robust discussion, and not really a consensus to delete that I can see. I don't doubt there aren't many mainstream news articles on her. And MattyBraps is #22, deleted via "protect the children" crowd at Wikipedia:Articles for deletion/MattyBraps, although this child is a marketing machine trying to get famous and apparently succeeding to some extent.--Milowent • ^hasspoken 02:29, 13 March 2013 (UTC)

Should typo redlinks be fixed?

Currently one of the redlinks is for David Sharp (mountaineer, which presumably is a typo of David Sharp (mountaineer). Should redirects be created for links that are missing a parenthesis? I've done a few in the past, until I realized that it might become an endless and pointless task. Trivialist (talk) 18:45, 21 April 2013 (UTC)

I am really not familiar with policy matters surrounding redirects. Maybe ask at WT:REDIRECT for some clarity. Thanks, West.andrew.g (talk) 13:01, 9 June 2013 (UTC)

Relevant discussion thread at VPT

Link: Wikipedia:Village_pump_(technical)#Are_thousands_of_people_a_day_not_finding_the_articles_they_want.3F. Biosthmors (talk) 10:48, 9 June 2013 (UTC)

I have responded on the talk page there. Thanks, West.andrew.g (talk) 13:03, 9 June 2013 (UTC)

Popular redlinks with \x

User:West.andrew.g/Popular redlinks is currently dominated by entries containing \x. See Wikipedia:Village pump (technical)#Are thousands of people a day not finding the articles they want? for discussion. I suggest mentioning the issue in User:West.andrew.g/Popular redlinks header. Something like: "Entries containing \x appear to originate from percent encodings of real titles but with % incorrectly replaced by \x (discussion). The error is probably made by some external software. PrimeHunter (talk) 11:34, 9 June 2013 (UTC)

Go ahead, be bold and make the change. FWIW, these improper encodings look to me like someone is trying to in-line Wikipedia content with their own site and has seemingly screwed up the encoding code. Thanks, West.andrew.g (talk) 13:00, 9 June 2013 (UTC)

I have added the suggested text.[4] PrimeHunter (talk) 22:31, 9 June 2013 (UTC)

Non-roman-character entries

According to Google's autotranslater:-
- Entry #40 娱乐 yúlè is Chinese for "entertainment".
- Many or all of the Arabic-language entries are sexual.
  - Anthony Appleyard (talk) 08:29, 10 June 2013 (UTC)

Actually all the Arabic entries on the list are sexual. I can provide the exact translations if needed. --Meno25 (talk) 15:02, 16 June 2013 (UTC)

Any chances of lengthening the list anytime soon?

Perhaps down to 500 or 100 views per week? I know this takes significant computational resources, but I figured I'd ask about it again anyways. Thanks. Biosthmors (talk) 16:07, 13 August 2013 (UTC)

I believe a similar page was posted on my talk page User_talk:West.andrew.g and I have responded there about my willingness to do something one-off for the Signpost (or possibly just for the heck of it, I suppose). I may be able to extend the list slightly, but I am assuming the list entries will grow exponentially as the threshold is lowered. Thanks, West.andrew.g (talk) 16:43, 16 August 2013 (UTC)

What's going on?

I created Iran\xE2\x80\x93United States relations because it was listed. However, http://stats.grok.se/en/latest/Iran%5CxE2%5Cx80%5Cx93United_States_relations shows no views. Thoughts? Biosthmors (talk) 16:11, 13 August 2013 (UTC)

I cannot speak to how the "stats.grok.se" service chooses to interpret the character encodings given to its input field. From a programming perspective, this can be complex and non-trivial (obviously, since so many of these seem to end up on the red-link list). I spot checked several of the encoded red-links in their service and all showed zero views throughout August. Obviously they are doing things differently; but I am confident I am just reporting strings as they appear in the raw stats. Thanks, West.andrew.g (talk) 16:52, 16 August 2013 (UTC)

Daniel Luke Barth

Trying to spam himself back into existence, #1 on the new topred list [5]. When it briefly existed in April he almost got onto the top 25 - Wikipedia:Wikipedia_Signpost/2013-04-29/Traffic_report.--Milowent • ^hasspoken 14:21, 3 November 2013 (UTC)

oddities with plus sign

List of high school football rivalries (100 years ) is listed with 1035 hits, but stats.grok.se says it only had 11. A plus sign after "years" has been replaced with what looks like a space: List of high school football rivalries (100 years+).

The same thing is happening with N 1 redundancy/N+1 redundancy, C classes/C++ classes, Template (C )/Template (C++), Reference (C )/Reference (C++) and New (C )/New (C++).

In WP:TOPRED, O++++++++++++++++++++++++++++++++++++++++++ is shown with 1924 requests, but stats.grok.se says it had none —rybec 21:19, 3 November 2013 (UTC)

Thanks User:Rybec for noting this. This seems to be caused by a funny encoding issue. For reasons of spatial efficiency, my back-end database stores title names in plain ASCII format, encoding any special characters. Then, for analysis and presentation purposes, I convert them back to their Unicode/UTF-8 representation. This seems to work fine except for the "The plus sign "+" is converted into a space character ' '." caveat I just discovered in the documentation. A fix has been made and subsequent reports should not have this issue.

However, this is half the problem. There are two ways one can denote a "+", either by typing it or by its "%2B" encoding. The O++++++++++++++++++++++++++++++++++++++++++ example above were actually requests for the page "O%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B" (and WP's statistical logger sees these as distinct). However, you can't look up the former on stats.grok.se because it decodes all input it gets (whereas WP statistics has no such canonical representation). Basically we're at a weird corner case here regarding what software chooses to decode/encode and when (and we can also throw WP's parser and web server into this mix of confusion). I can confirm there were ~2000 requests for "O%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B%2B" last week, in that exact request format. The "+" character is particularly problematic because it has both standard and encoded representations. West.andrew.g (talk) 18:05, 4 November 2013 (UTC)

Comma?

A lot of these articles have commas at the end of them when they shouldn't be there. Is there a reason for this?--Laun chba ller 10:11, 1 September 2013 (UTC)

Hey User:Launchballer. I think it's just a bot being silly (not real human attempts). Biosthmors (talk) 10:24, 1 September 2013 (UTC)

On the current list, ISO_3166,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, is #58 with 3654 hits, whereas stats.grok.se is showing 2 requests in the last 90 days. —rybec 21:11, 4 November 2013 (UTC)

I cannot speak to how stats.grok.se is doing things. There have been plenty of posts on "TOPRED" and "5000" about how the numbers sometimes don't agree. I have spent a lot of time investigating down these instances, and my aggregation has overwhelmingly come back error free. There are slight logical differences in the way redirection and normalization are handled between the two softwares. I can confirm per the raw data that several thousand hits occurred on UTC "Thursday" of last week. Oddly, if you double the number of commas and ask stats.grok.se it produces a figure approaching that of my aggregation with half the commas. Maybe some weird encoding thing going on over there? I can also confirm that whatever was doing this also made accesses for virtually all quantity of trailing commas (though usually just 1-3 hits). West.andrew.g (talk) 21:33, 4 November 2013 (UTC)

CSD policy discussion

FYI I have also started a policy discussion at Wikipedia talk:Criteria for speedy deletion#Non-human_typo_redirect. John Vandenberg ^(chat) 02:50, 13 December 2013 (UTC)

I think all encoding errors like \x redirects should be just speedily deleted. No need to create complex proposals for a simple task. As an aside, someone has software that tries to access Suzukake no Ki no Michi de "Kimi no Hohoemi o Yume ni Miru" to Itte Shimattara Bokutachi no Kankei wa Dō Kawatte Shima redlink over 1800 times while the correct page is Suzukake no Ki no Michi de "Kimi no Hohoemi o Yume ni Miru" to Itte Shimattara Bokutachi no Kankei wa Dō Kawatte Shimau no ka, Bokunari ni Nan-nichi ka Kangaeta Ue de no Yaya Kihazukashii Ketsuron no Yō na Mono. Too long title but apparently cannot be shortened :) Or maybe this is simply truncation error in the WP:TOPRED list itself? jni (talk) 15:46, 17 December 2013 (UTC)

@John Vandenberg: Thanks for the pointer. FWIW, I have probably made about 5 redirects in my history, all related to my own user and project space. I tend to be a data steward and try to stay out of policy debates. I am happy to chime in if there is a technical question, but otherwise I am quite agnostic on the outcome.

@Jni: It would not surprise me if I truncated my database column for titles at 128 or 256 characters for spatial efficiency (although its still a VARCHAR). I can't say I am too traumatized by these very very isolated corner cases.West.andrew.g (talk) 17:38, 17 December 2013 (UTC)

trailing m and n

Any ideas why there are so many entries now with a trailing n or m, such as Central processing unitn? I suspect a bug in some bot. user:Vanisaac is creating these entries (using automation .. ironically ..), which I dont think is a good idea. John Vandenberg ^(chat) 11:53, 17 November 2013 (UTC)

John Vandenberg, what is your concern with putting in redirects from all of these? Van Isaac_{WS Vex}^contribs 11:56, 17 November 2013 (UTC)

If those items on this list are simply because of a bug in someones software, it is probably fixed now and you're not helping the bot writer anyway - they need to fix their code. The concern is that you're putting improbable typos into the page list (which is a valid WP:CSD criteria - R3), which will pop up in lots of undesirable spots, such as a space-wasting entry in the auto-complete search results. John Vandenberg ^(chat) 12:05, 17 November 2013 (UTC)

Hmm. Well, I got about halfway through the list before I saw the notice, but I'll hold off on doing anything for a few days and see if we can't figure out what's going on. I've got both this page and VPP (where I first saw this problem) on my watchlist, so we'll see if it doesn't get more opaque. I'll save up a copy of all the redirects so I can run them back through and tag them for CSD if we figure out they aren't any good. Van Isaac_{WS Vex}^contribs 12:14, 17 November 2013 (UTC)

No drama. I expect they will drop off the next refresh of this list. If not, we may have a head scratcher. John Vandenberg ^(chat) 14:11, 17 November 2013 (UTC)

It may not be one bot that is at fault, but many servers across the world storing or sending page addresses with a final junk character (m or n or occasionally p). User:Vanisaac may as well continue his good work; I am having a go at making these redirects also. Anthony Appleyard (talk) 19:44, 17 November 2013 (UTC)

Err... what is the logic behind severs randomly inserting a junk character? Quite the contrary, I imagine this is all the work of a (single) popular content scraper having a bug. With so many WP mirrors out there, it would be difficult to determine where this is without the help of the analytics team. I don't want drama either, but I'm not sure that creating redirects is the appropriate solution: we're treating symptoms instead of the cause. This site could have an ENTIRE WP mirror and we don't want to create these "m" and "p" redirects for every article (and do it all again later if a different site develops a different bug). If anything, helping these pages resolve to the correct location will obfuscate the developer's from realizing there is a problem. West.andrew.g (talk) 22:22, 17 November 2013 (UTC)

I've changed a few of the redirects that could have other uses, for example Catp now points to the same target as CATP. Peter James (talk) 23:40, 17 November 2013 (UTC)

Yeah, I avoided redirecting anything that could have been some sort of technical acronym. Basically every -p seemed to fit that criterion: URLp, catp, etc. And while I know that the sheer number of them means that many or even most should be redirected, I don't really have a litmus to tell which ones. Almost all of the -m and -n's are unambiguously redirectable. Van Isaac_{WS Vex}^contribs 02:25, 18 November 2013 (UTC)

There are links Catp, Dogp, Foxp, Lawp, IBMp, YouTubep, and suchlike: these make it more likely that the -p is a spurious addition like -m and -n, and that e.g. Catp should have been left redirecting to Cat. "what is the logic behind servers randomly inserting a junk character?" :: if many servers are using the same faulty software. Anthony Appleyard (talk) 14:56, 18 November 2013 (UTC)

Only if readers mention how they are reaching these pages. These page views could all be from spammers or bots (with a few added because of the links on this page, and creation of redirects) - this is already known to occur [6][7][8]) - and if that's the case do we want properly functioning links? Some of the page views have declined after a few days; maybe they have already moved to a new set of pages. Peter James (talk) 17:42, 18 November 2013 (UTC)

I'm wondering if we can't schedule in another bot run to see what the list looks like for the past couple days - maybe set the threshold at 200. I figure there are three possibilities here: 1) If none of those same "typos" are getting repeated, then it's probably just a bot that's gone on to other targets - should we maybe give some thought to the possibility of an utterly pitiful DDOS? 2) If none of the same type of bad calls are getting processed, then the problem has gotten fixed somewhere else. 3) If the same queries are showing up, then these may have leaked out - maybe reddit or some other external site is having issues and we should just accept that these bad queries need to be dealt with for a while. Van Isaac_{WS Vex}^contribs 19:42, 18 November 2013 (UTC)

Statistics show many of these started to appear on 1-2 November, and the number of page views was at its highest around 11-14 November; others have stayed "popular", or have increased more recently (Hallucinogenic_fishm, almost all views on 15 November[9]). There are others, such as "BBCp"[10], but there are not similar results for all titles, so maybe they are linked somewhere on external sites, but without any comments from readers or referrer data it's unclear; it could be either incoming traffic that could be directed to the correct pages (either via redirects or hatnotes) or just spam. I'd expect, if it was actual readers, they would see the incorrect title and go to the correct page. One of the more obscure pages to appear in the list has an unusual pattern of page views: [11][12][13]. I've noticed other unusual statistics. Does anyone know which time zone http://stats.grok.se uses? Peter James (talk) 21:26, 18 November 2013 (UTC)

The other unusual statistics I could find, related to at least one entry here, were probably from Reddit. Peter James (talk) 21:44, 18 November 2013 (UTC)

To answer the one question above, I would assume we are all operating on UTC time. West.andrew.g (talk) 21:50, 18 November 2013 (UTC)

Update. The bad +m/+n page requests are still coming in, and many of them are the same this week as last. I've made a list of all the redirects, along with a link to the page view stats. Just a few random ones are Master (Doctor Who)m views, List of poker handsm views, High School Musicalm views, Andrés Iniestam views, Wolf of Wall Streetm views, and Physical attractivenessn views, and you can see that they are seeing persistent hits across the weeks. I'll keep checking in every week, and keep making these guys until (reddit?) whoever stops posting the bad links. I'm looking for feedback on how long until the expiry date, even if new ones keep popping up, because I know they are not optimal redirects. VanIsaac_{WS Vex}^contribs 03:47, 26 November 2013 (UTC)
- It's absurd that these are on the list, at all. It would be nice if some software/script could check if string [whatever], if (a) the string ends with an "m" or an "n", and, if so, then (b) does an article exist that matches the string, minus its last character? If the answer to (b) is "yes", then there shouldn't be an entry in the list. Maybe there should be a separate list, for these, but they shouldn't be mixed in with valid redlinks.
- As for how long to keep the redirects (if I understand the prior comment correctly), they can (and should) be left in place forever, simply because it's more trouble to remove the redirects (requires human effort) than to leave them in place (requires a vanishing trivial amount of storage space and processing power). -- John Broughton (♫♫) 19:49, 28 November 2013 (UTC)

The problem with leaving the redirects is that they are not a neutral presence. They show up in the search bar as a possible article hit, there is server overhead, and I'm sure there are many bots, scripts, and other processes that have to deal with their existence. Deleting them is actually fairly straightforward: since I've kept a list of every one of them that I've created, I can use AWB - which I use to create all of them - to tag them all for deletion. Van Isaac_{WS Vex}^contribs 20:51, 28 November 2013 (UTC)

The problem with that is if we create these to accommodate the page-views (amusing that these are human page-views) people will bookmark them and post links to them (Reason for not deleting #4). If we're not going to keep them, we shouldn't create them. Emmette Hernandez Coleman (talk) 11:43, 29 November 2013 (UTC)

In an RFD, a bunch of these were deleted. Emmette Hernandez Coleman (talk) 11:31, 29 November 2013 (UTC)

Well, if they show up in this week's run, I'll be recreating them. Note: I didn't actually create any of those. Van Isaac_{WS Vex}^contribs 11:46, 29 November 2013 (UTC)

If you mass recreate redirects that were deleted at RfD, that would disruptive and may result in your being blocked from editing. Please don't do it. WJBscribe (talk) 13:27, 29 November 2013 (UTC)

See discussion at Wikipedia:Administrators' noticeboard/Archive257#Mass creation of very improbable redirects. Anthony Appleyard (talk) 13:17, 29 November 2013 (UTC)

I have posted an important observation at Wikipedia:Administrators' noticeboard#WP:TOPRED up to 940: The +m/+n titles very consistently report 10-13% of the page views of the corresponding real title. It seems this cannot be a coincidence. A few tests indicate there is also a strong correlation for +p titles, but the percentage is much lower. PrimeHunter (talk) 17:46, 19 December 2013 (UTC)

A naive question: is there someone who has access to server logs who can track where these clicks are coming from, either the IP address of the click or the referring web pages? – Jonesey95 (talk) 22:53, 20 December 2013 (UTC)

@Jonesey95: Certainly that capability exists. WMF machines are storing raw log information before it is aggregated in the form I operate over. What precise individuals have the permission to look at this information (assuming it is retained), or would have the ability to retain it, is less clear to me. Eric Zachte leads the analytics team. There is an analytics mailing list that includes several devs and related persons. That could be a reasonable starting point. I have long thought that referrer/source-IP information (even extremely aggregated, for privacy reasons) would serve great utility. It could troubleshoot issues like this and could explain whether WP:5000 entries are legitimate looking or not. West.andrew.g (talk) 04:37, 24 December 2013 (UTC)

duplicate entries

In the current list, List of misconceptions about illegal drugs" \l "Man slices off his face and feeds it to dogs appears at #16 and #20, whereas Milky Wink (album) shows up at #90 and #92. —rybec 23:14, 18 December 2013 (UTC)

@Rybec: -- This is an artifact of how the pages are presented in order to improve human readability of this list. The back-end database shows separate entries "Milky%20Wink_(album) 3522" and "Milky_Wink_(album) 3502". For human output we resolve the percent encoding and change underscores to spaces. Thus, even though these pages both get users to the place after redirection and/or canonical representation takes over, they represent distinct entry points. West.andrew.g (talk) 17:11, 19 December 2013 (UTC)

Thank you for explaining! I saw on your talk page where someone had asked that the underscores be suppressed. The face-slicing one looked like it could be a broken hyperlink from someone's Web page. I had tried to find such a page using the Google inurl: operator. If I had tried with the uncooked information, perhaps my search would have more chance of success. Last week's list included Rich_The_Kid_ (stats.grok.se) at #106, but this week Rich The Kid (stats.grok.se) appears twice, at #117 and #344. These different entry points may give us clues about the sources of failed requests. If I had my 'druthers, even the percent encodings would be made visible, by piping: [[title%20with%20spaces]] displays as

title with spaces

but [[title%20with%20spaces|title%20with%20spaces]] displays as

title%20with%20spaces

While I'm asking for things...the entries in the list look like this:

Rank	Article	Views
379	Riley Stearns	1,713

I usually want to search Wikipedia, and look at the graph on stats.grok.se, so if those could be linked directly from the list, like this:

Rank	Article	Search	Stats	Views
379	Riley Stearns	search	stats	1,713

it would be a little bit more convenient, for me. —rybec 21:41, 19 December 2013 (UTC)

@Rybec: I have no issue including the "search on Wikipedia" link. I also agree that providing the "literal" requested title with encodings could also be helpful, although I am inclined to do it adjacent to the more "user friendly" version rather than replacing it altogether. The only thing I am not in love with is the "stats" link. Stats.grok.se is known to choke on some encoding operations, particularly those of the "\x" variety (and nasty encoding bits are not uncommon among redlinks). Stats.grok.se is generally a helpful service, but if its numbers don't agree with mine (even when I know I am right) then the talk page messages will come in bulk. Maybe we could find a way to disclaimer this? (Even though the number of warning notes on that page is quickly stacking up). Thoughts? West.andrew.g (talk) 07:20, 24 December 2013 (UTC)

A * redirects

What is it with all these redirects that say "A *" (e.g. A bit). I created a few of them, but I stopped once it became apparent there were allot of them (except for some obvious ones like A woman). Emmette Hernandez Coleman (talk) 10:41, 5 February 2014 (UTC)

The ones that are still redlinks (i.e. I haven't created them) are: A look, A problem, ~~A while~~, A plan, and A way. Emmette Hernandez Coleman (talk) 10:47, 5 February 2014 (UTC)

Same thing seems to be happing with "The *": The hell (soft rdr to Wiktionary), The middle, The pin, The dance, The difference, The rest, The year, The traffic, The fridge, The field, The spot, The house, The loss, The mall (rdr to disambig page Mall). Emmette Hernandez Coleman (talk) 10:55, 5 February 2014 (UTC)

"A *" redirects I have created are: A sandwich (rer to sandwich), A phone (rdr to Telephone), A lot (rdr to Allot (disambiguation)), ~~A woman (rdr to A Woman)~~, A bit (soft rdr to Wiktionary. Emmette Hernandez Coleman (talk) 11:02, 5 February 2014 (UTC)

Nevermind about the "The *" redirects, all of them except for The pin and The traffic are just capitalization variations of exiting titles or redirects, and those remaining two aren't worth starting a discussion over. This still leaves the "A *" misterey. Emmette Hernandez Coleman (talk) 11:19, 5 February 2014 (UTC)

I've eliminated A while and A woman from the list, as they are alternate spellings/capitulations of Awhile and A Woman. Emmette Hernandez Coleman (talk) 11:42, 5 February 2014 (UTC)

Please read the warnings atop WP:TOPRED (and their linked discussions) regarding the mass creation of redirects based on this list (some people have been quite riled up about it recently). As a red-link phenomenon this is no different than the "trailing m and n" issue we've been dealing with for several months, i.e., probably just a misconfigured bot or script somewhere out there. Please secure consensus before proceeding with mass redirect creation. Thanks, West.andrew.g (talk) 15:10, 5 February 2014 (UTC)

@Emmette Hernandez Coleman: compare the traffic graphs on stats.grok.se. For example, look at how the November 2013 traffic for A bit, A lot, and The rest] all have a gap from the 8th to the 12th (where the traffic for the Main Page does not). That looks like most of the traffic is coming from one source. Before the recent surge in requests, these did get a fair number of requests (examples: [14] [15]). If it's desirable to create redirects for the terms which have gotten an artificial boost in requests, it would be more desirable to have articles or redirects for all the nouns in a Basic English word list [16], prefixed with the definite and indefinite articles. —rybec 20:46, 5 February 2014 (UTC)

When a English praise like that gets quite a few (legitimate) requests, it would probably be worth creating a Wiktionary redirect. Emmette Hernandez Coleman (talk) 03:42, 6 February 2014 (UTC)

I did. I didn't realize there were so many of them at first, but when I did I brought this here. I only created three of these (ignoring A woman which is just a cap variation of A Woman, and A lot which is just an alt spelling of Allot (disambiguation)). Emmette Hernandez Coleman (talk) 20:53, 5 February 2014 (UTC)

And it's not like there are dozens, let alone hundreds of these redlinks, unlike m/n. Emmette Hernandez Coleman (talk) 03:28, 6 February 2014 (UTC)

more on trailing m/n

Can the script that generates this list please check whether pagename is bluelinked-pagename + m or n, and if so not list it? — The Great Redirector 01:12, 3 February 2014 (UTC)

@The Great Redirector: It *could* be done, but I don't find it necessary. (1) The inclusion of these entries does not exclude any other entries from the list (its based on number of redlinks with views over a threshold). (2) It makes clear the fact that this "m" and "n" thing is a problem, and some bot/client out there is misconfigured. (3) I am just presenting the raw data "as is". Next month there will be some new odd phenomena that we might need an exclusion rule for. Should be excluding entries that do special character encoding incorrectly? (4) It is easy to game this list. We don't know which traffic is from humans and which is from bots. I don't want to start making editorial decisions about such things. West.andrew.g (talk) 17:43, 3 February 2014 (UTC)

It's difficult to find the legitimate redlinks (things we might want to axially create redirects/pages for) with all the noise from the m/n's. I'm not asking for them to not be listed, but could they at least be separated into their own section or something? Emmette Hernandez Coleman (talk) 08:57, 5 February 2014 (UTC)

Now that you mention it, it might be a good idea to separate (software-detectable) encoding errors into their own section, provided the false positive rate would be low. Emmette Hernandez Coleman (talk) 09:11, 5 February 2014 (UTC)

Since we are not allowed now to redirect (say) Rhubarbn or Rhubarbm or Rhubarbp to Rhubarb, these "bluelink plus m/n/p" entries are merely clutter and should be listed separately. Anthony Appleyard (talk) 08:25, 6 February 2014 (UTC)

Has trailing m/n disappeared?

I haven't done a complete analysis, but it appears the "trailing m/n" case that dominated our talk page discussions for so long has either disappeared or faded into the background. Three question marks "???" appended to titles seems like an emerging trend. Further proof that we shouldn't be too stressed out by these things and that custom written output filters wouldn't be of tremendous utility. Just my two cents. West.andrew.g (talk) 01:38, 20 May 2014 (UTC)

On the Non-Reporting of Mobile Views

A significant statistical issue has come to my attention. Quite simply, the WMF does not record/report per-article mobile views, and thus they are unavailable for my aggregation....

The complete write-up is at User_talk:West.andrew.g/Popular_pages#STICKY:_On_the_Non-Reporting_of_Mobile_Views.

Please consolidate all discussion at that location. Thanks, West.andrew.g (talk) 18:41, 4 September 2014 (UTC)

Sortability of the list

[Andrew, do you think you could...] Make the mobile list sortable? It would really help in sorting the spam from the real inquiries. Thanks. Serendi ^pod ous 17:33, 8 November 2014 (UTC)

@Serendipodous: Done in code. It wasn't there already, because prior to the introduction of mobile/stats, there was only a single "views" column. The change will be reflected when the automatic update triggers this evening. Please ping me immediately if something doesn't come out right. Thanks, West.andrew.g (talk) 18:40, 8 November 2014 (UTC)

2 early entries

Entry rank #3 X-default: see https://www.google.co.uk/search?q=%22X-default%22&ie=utf-8&oe=utf-8&gws_rd=cr&ei=5eYWVfytAaSQ7AaLlICYDg for information.
Entry rank #17 Vortaroj is Esperanto for "dictionaries".
- Anthony Appleyard (talk) 17:39, 28 March 2015 (UTC)

Popular redlinks for non-English language Wikipedias

Is it possible to generate a list of popular redlinks for Welsh (language code cy), and for other non-English Wikipedias where there is demand please? Alternatively is there a script which I can run myself? Thanks. --Oergell (talk) 14:09, 26 April 2016 (UTC)

@Oergell: Thanks for your question. I store data only for English Wikipedia and the process is not as straightforward as using a single script to generate. I have a server dedicated to my Wikipedia research. It takes over an hour a night to ingest and parse that day's data, and then there is some additional weekly report generation overhead atop that. Storage space and run time were why I chose to limit my scope. I also know there are some changes coming in the next few weeks regarding how the WMF shares raw statistical data. I do not yet understand the impacts, if any, on my ability to produce the red links report. Unfortunately, I have no plans to expand my scope, and before I help anyone else mirror my workflow/infrastructure I'd like to make sure my strategy isn't about to break. Thanks, West.andrew.g (talk) 16:31, 27 April 2016 (UTC)

@West.andrew.g: That is very interesting and I understand that you are already busy. I wonder if I could work on a partial solution to generate a sample of redlinks likely to be worthy of attention. It would not have to be as definitive as yours. Thanks again for your response. --Oergell (talk) 15:54, 29 April 2016 (UTC)

@Oergell: The vastly oversimplified process is as follows. I get the raw statistics from [17]. Currently I use the deprecated and soon-to-be discontinued "pagecount-all-sites" (the last link). I don't yet understand the preferred files, their format, or how red-link processing is impacted. The statistical dumps are produced hourly. I wait until there are 24 unprocessed hours and parse them in batch and then aggregate to create daily totals for all (non)-pages and parse these into a database for persistent storage. When I want to report red-links, I query for all pages with at least X hits in the time period of interest. I then check all these titles against the Mediawiki API to determine if the title is actually a page that exists (being careful with character encodings throughout). I output stats for only pages that do not exist (thus, red links). The statistical files are organized by country/project code so one could quickly drill down to the areas of interest. Thanks, West.andrew.g (talk) 20:30, 3 May 2016 (UTC)