User:Certes/Gene links

From Wikipedia, the free encyclopedia

This page lists gene articles which are not linked from the base name. For example, there is no obvious route from ACR to ACR (gene). FOO is used as a placeholder to denote the base name such as ACR.

Dab missing entry[edit]

FOO is (or redirects to) a dab which does not list the gene.

 Done Section completed: add an entry for FOO (gene) to existing dab FOO.

Unrelated article with dab[edit]

FOO is (or redirects to) an article about an unrelated primary topic. FOO (disambiguation) is (or redirects to) a dab which does not list the gene.

 Done: Section completed: add an entry for FOO (gene) to existing dab FOO (disambiguation).

Unrelated article without dab[edit]

FOO is (or redirects to) an article about an unrelated topic. FOO (disambiguation) does not exist.

Fix: If the incumbent article is not primary, move it to FOO (topic) and list it along with the gene on a new dab FOO. Check for incoming links to FOO and update these. If the topic is primary but the initials also denote other topics, create FOO (disambiguation). Otherwise, the primary topic article needs a hatnote to the gene.

 Done Section complete except for CTU2, which is the actual name of the C16orf84 gene: requesting a second opinion from PamD or Seppi333.

Enzyme or protein article[edit]

FOO describes an enzyme or protein related to FOO (gene) but does not link to the gene.

Fix: Expert advice is needed.

Miscellaneous[edit]

See individual entries for a description of each anomaly.

Fix: Expert advice is needed.

Merged the wikidata sitelinks for NFATC2IP, KCTD9, and NFAM1 and the corresponding (gene) pages. Will deal with the rest a bit later. Seppi333 (Insert ) 00:10, 30 November 2019 (UTC)

Re-ALG2 (gene): I think it may be worth recoding and rerunning my User:Seppi333/GeneListNLP script to detect/write a list of target pages that are wikilinked from the gene lists and that contain all 5 of the words "Set", "index" "page", "lists", and "articles" on them in order to identify links to set index articles, unless you can locate those with an SQL query. The last time I ran that script, it took 1:33:45 (1.5 hrs) to download and process all the pages, so if it's possible to locate them using another method, it'd probably best to do that instead. Seppi333 (Insert ) 01:23, 30 November 2019 (UTC)

This PetScan query identifies SIAs linked from gene lists. Certes (talk) 10:25, 30 November 2019 (UTC)

False positives[edit]

FOO links to FOO (gene) (or the target of that redirect) in a complex way not spotted by the Quarry queries.

Fix: probably no action but we may consider a more direct link.

Other links[edit]

Here are some other link issues raised by the gene lists. They need an expert to fix them because the suggested fix may be wrong, they may indicate wider problems, or the initialism redirect might merit conversion into a dab.

Direct links[edit]

The gene lists link directly to a page which is not in gene categories. These fall into two sections.

1. The target page appears not to be a gene. The link needs to be corrected. In each case, incoming links suggest that the non-gene article is the primary topic, but we could consider moving that article and creating a dab.

2. The target page appears to be a gene or closely related topic. Links may be correct but the gene page could be added to appropriate gene categories.

Redirects[edit]

The gene lists link to a redirect to a page which is not in gene categories.

Ahh. I was wondering why my NLP script didn’t locate those... it’s the hatnotes. I should probably reprogram it to fix that bug. Will fix these pages later tonight and (nothing to fix, exception maybe conversion to DABs; I think you guys are better judges of when/how to disambiguate than I though, so I'll leave it to you) revise the wikitables once we locate all these pages. Seppi333 (Insert ) 02:02, 1 December 2019 (UTC)

Looks like you're right; all of them should link to the SYMBOL (gene) page since those are all the correct articles. I moved the Syk page to the official UniProt name for the protein (Tyrosine-protein kinase SYK) since the only synonym/alias with a lowercase spelling was "p72-Syk". I'll retarget the links in the gene lists/tables once we find the rest of these since it's much less work for me to add them all at once than piecewise. I can rewrite my script to detect the multi-word expressions used on the hatnote pages and just parse the leads to identify ones like Rho tomorrow since it's fairly easy to code that; but, I get the impression that you're able to identify all of the remaining links to mistargeted by simpler means than downloading and parsing 11500 pages.
Makes me want to learn SQL. What other methods do you use to locate pages like this? I'm really curious now. Seppi333 (Insert ) 04:51, 1 December 2019 (UTC)
@Seppi333: In theory I could have located these with SQL. In practice, it might have been too complex to complete within Quarry's 30 minute limit, so I used PetScan instead with a Wikipedia search for incoming links. You mention checking 11,500 pages manually. In a way I've done that check myself, but only on the 30 or so suspicious pages that remained after filtering out cases that the queries suggest to be correct. Certes (talk) 12:57, 1 December 2019 (UTC)
Oh. Wow, that's a surprisingly useful tool then. The algorithm is actually fully-automated; it basically just iteratively goes through all ~11500 of the blue wikilinks on the four list pages one at a time, loads the page (it takes 1.5 hours to run almost entirely because it has to load 11500 pages; I can't run it on a database dump), and determines whether or not the words "gene", "genes", "protein" or "proteins" are present on the page. It missed most of the links above because those words are in the DAB hatnotes. I hadn't considered that being a possibility when I wrote it. I should have some time to revise both the wikitable script to fix the lists and mistargeted link detection script to do a second check within the next 12-24 hours; shouldn't take that long to do. Seppi333 (Insert ) 22:00, 1 December 2019 (UTC)
Finding the bad direct links is as simple as this, which takes 4 seconds. There are a few false positives such as Locus (genetics) from wikilinks not in the table, but they're obvious. The links via redirects took a little more fiddling. Certes (talk) 22:52, 1 December 2019 (UTC)
I'll have to make use of that tool; seems very handy. Going to work on the gene lists now and update it once I'm done. Seppi333 (Insert ) 10:07, 2 December 2019 (UTC)
Following up, I retargeted the links in the gene lists yesterday. Haven't quite finished reprogramming the other one yet, but will probably be tomorrow. I'll retarget the non-list gene articles with mistargeted links sometime within the next couple of hours.
Assuming neither of us find any additional pages, I suppose we're done. Thanks again for your help. Edit: I didn't notice the sections above; will get to them after I retarget the links. Seppi333 (Insert ) 10:04, 3 December 2019 (UTC)

Further progress[edit]

@Seppi333: I've fixed incoming links apart from the gene lists which should link to CHML (gene) rather than CHML, AAMP (gene) rather than AAMP, etc. I see that some of these have been done manually in the lists (though a piped link might be better) but not in the Python. Also, do you have any thoughts about AKNA, CD96 and WRAP53? Certes (talk) 00:25, 16 December 2019 (UTC)

@Certes: Hey there! I'm really sorry for falling off the grid after my last reply here; it seems rather rude of me. I've been really busy off-wiki lately and forgot to work on this. My bad about that. I'll go ahead and finish addressing the links above within the next day or so since I now have some time to work on WP. I'll fix AKNA, CD96, and WRAP53 right now though. I only need to adjust their wikidata sitelinks and add {{infobox gene}} to the article source.  Done
BTW, I finished recoding an updated version of my mistargeted link detection algorithm last week. The updated algorithm is designed to detect the type of mistargeted links you uncovered since I used all of the links that you listed in this section as a sample of testcases; I continually revised the algorithm until it had a 100% detection rate on that sample. This time around, it took 3.5 hours (originally, 1.5 hours) for the algorithm to finish processing all ~12,500 blue wikilinks in the gene lists (LOL). The likely mistargeted links it found are included in the collapse tab below. It found a few more articles with similar issues to the ones that you listed above; these articles would be included in the 2nd list in the tab below. Sometime within the next 24-48 hours, I'll manually go through all the links in the tab below and highlight the mistargeted ones I find. This is probably the last set of links in the gene lists that need to be fixed/retargeted since I think I've accounted for all possible ways that a false negative might occur. Seppi333 (Insert ) 00:39, 18 December 2019 (UTC)
Output of the updated algorithm – will follow up after I've gone through it and marked the ones that need to be addressed.
I ran the updated algorithm early last week, so you might've already found/fixed some of these. Seppi333 (Insert )

Note: immediately after each bulleted entry below, there are two index values listed: i=# and j=#. Index i is the number of distinct gene-related terms that are present in the lead's source code and index j is the number of distinct gene-related terms that are present in the input parameters of the lead's hatnote templates, provided that any were found (NB: there's no entries in either list where one index equal to 0 and the other non-zero).

My original script detected links to articles where none of 4 gene-related terms (i.e., "gene", "genes", "protein", "proteins") were found anywhere in the article's source code (NB: these links would be marked with i=0; j=0 in the 1st list below); the updated version of my algorithm checked the source code of only the lead for 5 word tokens (i.e., the original 4 and "infobox_gene") instead of searching the full article's source code, so there's additional entries in the 1st list below that weren't detected by the original algorithm.

The updated algorithm also listed all articles that included specific gene-related multi-word expressions (i.e., the following phrases: "the gene", "the genes", "the protein", "the proteins", "the enzyme", "the enzymes", "(gene)", "(enzyme)", and "(protein)") in the parameters of certain lead hatnotes if any were present – specifically, the {{about}} hatnote, {{for}} hatnote, and the family of redirect hatnotes like {{redirect}}/{{RDR}}, {{redirect2}}, etc.. These new entries are included in the 2nd list below and have corresponding index values of i>0; j>0. If an entry in that list is marked with index values of 0<i<j, it's extremely likely that the link is mistargeted.


Entries in this list are articles where none of these 5 single-word tokens – gene, genes, infobox_gene, protein, proteins – are present in the source code of the article's lead.


Entries in this list are articles where one or more of these 5 single-word tokens – gene, genes, infobox_gene, protein, proteins – are present in the source code of the article's lead (index i is the count of how many distinct tokens were found, so if the word gene is repeated 2+ times in the lead and none of the other word tokens were found, the linked entry would have an index value of i=1) AND one or more of the following tokenized multi-word expressions – the gene, the genes, (gene), the protein, the proteins, (protein), the enzyme, the enzymes, (enzyme) – are present in the parameter inputs of an {{about}}, {{for}}, or redirect-type hatnote template that the algorithm found in the lead (index j is the count of the number of distinct aforementioned expressions that were detected in the hatnote's parameter inputs):

Also, thank you so much for helping me find and address the problematic links in the gene lists! I can't adequately express just how much I appreciate your assistance thus far.
If it weren't for you, several dozen links in the gene lists probably would've continued to point to the wrong articles since I don't think I would've realized the issues with the original algorithm that were producing false negatives. Seppi333 (Insert ) 00:46, 18 December 2019 (UTC)
No problem: there is no deadline and we all have things to do offline, especially in December. I'm happy to have helped but have probably done all I can for now. I think the only outstanding issue not mentioned above is cases like CTU2, where the base name leads to a rather flimsy non-gene primary topic and we need either a {{redirect}} hatnote or a two-entry dab. (I'm not sure which is better.) However, I think all the wikilinks now lead to the right destination even in those cases. We've made a lot of improvements and it looks as if the job's almost complete. Certes (talk) 01:26, 18 December 2019 (UTC)

I went through all the links and fixed problems that I found. In addition to the 4 you identified (CHML, DR1, HPX, and PIM2), it looks like only DDT is new. I'll fix these links in the lists shortly. Seppi333 (Insert ) 15:58, 24 December 2019 (UTC)

@Seppi333: I missed DDT because it's in Category:Nonsteroidal antiandrogens, a subcategory of Hormones, which I viewed as legitimate link targets. When I stopped excluding Hormones from my Petscan query, DDT appeared and nothing else did, so I don't see any similar cases. Most links to the pesticide seem correct but please can you fix the Python for List of human protein-coding genes 1 and check Protein design, which should perhaps link to DDT (gene) instead? Certes (talk) 16:23, 24 December 2019 (UTC)
Looks like the DDT link in protein design is correctly targeted; had to read the paper to verify which page to link to (quote: Then they synthesized the 24-mcr (MIF1RPNVGAMSNFYHYPNIIIII:) designed to form a four-stranded 13-sheet and to bind the insecticide DDT. It did indeed...). Working on recoding the python script for the list pages right now. Seppi333 (Insert ) 17:23, 24 December 2019 (UTC)
 Done The lists have been updated with piped links for these genes. Seppi333 (Insert ) 18:48, 24 December 2019 (UTC)