User talk:Boud/sandbox/draft RfC Reduce advocacy in Find sources Module

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

This is the talk page for a draft RfC. Although the draft RfC and thistalk page are in user space, they should be considered non-deletable unless all authors of the page clearly agree to delete it. For transparency, I personally would object to any (hypothetical) proposals to delete the two pages. Boud (talk) 21:53, 11 November 2023 (UTC)[reply]

We shouldn't focus on Google, but on the quality of results[edit]

So I understand that you want to make sure data privacy is preserved and that's a commendable goal, but I think in toolbars like these what should count most is utility.

So I went with a search query that goes "Why is Tasmania so sparsely populated?". With respect to Google or Bing,

  • Qwant kinda passed the test, but barely. It mostly spoke of Australia and not Tasmania, but one or two links were relevant.
  • Mojeek failed it
  • Startpage gives almost exactly the same thing as Google, so I kinda get it.
  • Searx doesn't work at all. Maybe I'm dumb but just pasting the query gives no results.

Google Books can be in fact complemented with books from the Internet Archive. A lot of the books there are rubbish, but many more can actually be read free and (likely) in a legal way.

Google Scholar generally works well, Semantic Scholar has better options and is more user-friendly but its scope is narrower. Internet Archive's even smaller. So removing Google Scholar is a bad idea. For general scholarly indexing engines, you have Web of Science and Scopus, but these are not free and not available with TWL. No need to single out JSTOR though from other scholarly providers, as there are many more other providers included in The Wikipedia Library. Hiding it in the weird "TWL" acronym makes little sense and doesn't encourage its usage. (While we are at it,

the template can also advertise WP:RX and its advice on finding sources, so that folks who can't access a source but need to can go there for advice).

Finally, you didn't mention it but I think this will appear sooner or later. I personally have nothing against using Sci-Hub or Library Genesis for personal purposes, and at least in my country it's legal so long as you are not uploading or using it outside of fair use, essentially. I have to say these are great resources to use, but advertising shadow libraries this way may bring us into trouble, so out of question. Szmenderowiecki (talk) 23:21, 11 November 2023 (UTC)[reply]

Focussing on "results" alone would be difficult, since these will vary considerably for individual users, especially for the engines that most violate privacy, and are difficult to measure, especially with risks of circularity loops. The biggest risk of saying that we "only" want results is that we forget the idea of making a reasonable compromise between advocacy and "results". And privacy is an issue too, of course. I guess that some qualitative examples of search results could help create a viable RfC for people unfamiliar with the different search engines.
For Searx, the choice of which instances to choose is an open question. Ideally, Wikimedia techies would run "our own" Searx engine. They would adjust parameters based on the usual open discussion, with a mix of techie and non-techie issues. But that's not going to happen overnight, and there would need to be human + server resources to actually set up the server and run it.
The particular keywords matter. Inclusion of special characters such as ? is risky - each engine has its own choice of special sequences and symbols. But let's see:

The results seem to be about as good as for Startpage. But particular query is not really the sort of question likely to be directly useful to Wikipedia - population sparsity as an average value is the population divided by the area - so the scientific question would be why the population is low (implicitly, in comparison to the area). Blogs are findable but not usable. So something like

give science articles. I don't see links about the current low population in Tasmania, but other search engines don't do much better.

The reliability of the different instances varies - that's why https://searx.space (for example) has a list of servers and its estimates of their reliability, speed, location, type of encryption certificate, and which network they are hosted on. searx.thegpm.org isn't currently listed there - so it's presumably one of the less reliable searx instances.

As for Mojeek and Qwant, I gave those to make the RfC as neutral as possible. Personally, I generally use Startpage and Searx, and sometimes DuckDuckgo. Limiting the RfC to the ones that we personally find useful would be against the aim of being neutral.

There are some sensitivities (probably WMF and US law and the intellectual heritage of Aaron Swartz?) about Sci-Hub and Library Genesis, so I agree that there's no point proposing them in the RfC. Boud (talk) 00:46, 12 November 2023 (UTC)[reply]

The point in an apparently weird query is the possibility to retrieve rare resources, even with blogs. When you have those blogs, you have to look for in more authoritative sources. When you can't find anything, you don't know where to start. I have this problem with editing Le Touquet, a French seaside resort (for now in my sandbox). Even though the French article is an FA, it's a total mess to read. Sources about this town can be found, but there are pretty few of them and it takes a lot of queries to unearth them. With poor web engines, it becomes pretty much impossible.
Each Searx query gives vastly different results, and most of the results are really hit-and-miss or at best tangential. So if you want to include SearX, you have to find a specific instance that broadly is configured best.
As for advocacy, Wikipedia does not advocate Google or its products to use. What ultimately matters for the encyclopedia is the possibility to unearth as many usable sources as possible and let editors then apply their wisdom to weed out whatever is not good enough. We don't really care how you got your info. So long as it is relevant, reliable and properly used, we are doing great. It's not an endorsement of Google in any way, just that data privacy ≠ utility. Violations of privacy are a concern, but forcing a potentially inferior search qualitywise product down the editors' throats for the sake of their data privacy is not the way to go. If they want to share data with Google or Microsoft or whatever, it's their choice, not ours to make. Doing anything else is IMHO anti-Google advocacy.
At least my viewpoint is that this source bar should only reflect the resources with most utility to editors. But that's my view.
Also, as a piece of advice of how to make an RfC, I propose that you show before and after options and ask which support. In your !vote, not in the question, you can outline all concerns you have about Google.

Szmenderowiecki (talk) 09:25, 12 November 2023 (UTC)[reply]

As for advocacy, Wikipedia does not advocate Google or its products to use. On the contrary, in {{Find general sources}}, which is used on approximately 868,000 pages to recommend to people ways of finding sources, we are advocating for Google. Giving people a link to Google as the "main" engine and so many links to Google for specific types of questions is advocating for it.
The rest of your paragraph presents no arguments for why advocating is not advocating. I wouldn't say that we don't care how Wikipedians get their info, it's rather that we are not going to and cannot ask anyone to justify how they found the info; we know that we are likely to be biased by the dominant search engines, no matter what we recommend. Meta reviews of medical research papers must state very clearly their exact method of searching, so that others can judge the validity of the meta-review itself. Wikipedia in some sense does meta-reviews, but we cannot force people to say how they found the info; our method is (in this sense) inferior to that of Cochrane-style meta-reviews. Weeding out sources from a list that is already biased is going to miss the sources that are absent from the list. Someone looking at Google results will miss some results that are ranked low based on Google's advertising priorities and its detailed psychosocial profile of the person doing the search.
Privacy is what justifies using the privacy-protecting meta-search engines rather than Google or Bing directly. While the decision-making process for UCOC was controversial, the spirit of it is valid. Encouraging people to violate their privacy is contrary to the spirit of UCOC.
Claiming that Google necessarily gives the "best" search results gets back to the adjective "would-be totalitarian" that some people got upset about. The belief that Google is the only viable search engine is qualitative evidence consistent with Google being "would-be totalitarian". The evidence of Google paying 10s of billions of dollars to Apple and Samsung is circumstantial evidence suggesting that Google is not the "best" search engine, since it is unlikely that Apple and Samsung would even consider choosing a default search engine that makes their customers consider their products inferior. Boud (talk) 12:29, 12 November 2023 (UTC)[reply]
Another way of saying what I find confusing on the issue of the word advocacy is that your argument seems to be that Advocacy for Google is necessary based on utility; thus it is not advocacy. That doesn't make sense to me; state health authorities advocate the use of bicycle helmets and car seatbelts based on safety - this is advocacy for bike helmets and car seatbelts - it is advocacy. In any case, replacing the word "advocacy" by "recommendations" makes no significant semantic difference as I understand the words, but may have a better chance of achieving consensus, which is what we need.
Could you clarify what you mean by before and after options? Do you mean the current state of the module and template ("before", e.g. with {{oldid}} links) and what they might look like after modification? If I understand correctly, what you are proposing is a major rewrite of the initial post-statement proposed structure, which is obviously needed, given the concerns expressed. I had thought of waiting a bit longer for more feedback on the simplified initial statement/question, to see if there's a chance of consensus on it, and then get on to this part. Boud (talk) 20:51, 13 November 2023 (UTC)[reply]
I started on that part without waiting, but your thoughts (or direct edits) are welcome. Boud (talk) 21:25, 13 November 2023 (UTC)[reply]

Version oldid 1185294549[edit]

@Szmenderowiecki: Do you see any improvements to be made to the current version? Boud (talk) 21:21, 15 November 2023 (UTC)[reply]

I think that your question is kind of too open and doesn't really specify concretely what is to be changed. RfCs should state just that.
The best is a before-after comparison, and previous discussions that may relate to this topic for reference. People have to see you have engaged with other editors and, if someone questions you on that, explain why you ignored other people's advice.
Asking open-ended questions is a de facto invitation to speculation, demagoguery and generally a waste of our resources, plus RfCs with too many options may often require "run-off" RfCs, but do we really need two RfCs to resolve one issue? No.
So two, max three options (leave as it is, consensus option you worked out in your discussion with others and maybe the one you would like to see).
You can explain why you did that in your vote, and people will see what you thought, but it's only up to them to decide. Szmenderowiecki (talk) 21:38, 15 November 2023 (UTC)[reply]
The first question is very specific, about a specific module and a specific template. The second question is open, but it's about two specific Wikipedia locations, and the User:Boud/sandbox/draft RfC Reduce advocacy in Find sources Module#Additional comments below first statement and timestamp give quite concrete guidelines about what sort of concrete changes would make sense in the "yes" case. I don't understand what is too open here. Boud (talk) 22:14, 15 November 2023 (UTC)[reply]
The "if yes, what should it be" is the open part. I'd advise you to avoid it. The "clear guidelines" can be nevertheless interpreted in so many ways (i.e. what for one is a diverse enough option would be plainly not enough for others). It begs for a messy discussion, don't do it.Szmenderowiecki (talk) 22:23, 15 November 2023 (UTC)[reply]
I've simplified the text. The original was too complicated (the technical implementation [module vs template] does not matter at all, for the purpose of this proposed RFC), and some of the language was advocating for particular responses ("diversify", "notable"). WhatamIdoing (talk) 04:44, 18 November 2023 (UTC)[reply]
@WhatamIdoing: I guess you're right about removing the reference to the module. If there were consensus to replace link X by a link absent from the module, it would be uncontroversial for the techies to add that to the list of available modules. I think it still would be useful to add more search engines to the list in the module, since that would (if I understand right) make them available by adding some parameters to an invocation of find source, but I guess that would be an easier and separate proposal to make, that would not need RfC level.
I guess the other changes can't hurt either - non-Wikipedia-notable engines will probably have more difficulty convincing people, and the range of possible answers and motivations is widened, but I guess we'll see. This seems to be the spirit of RfCs. Boud (talk) 18:50, 21 November 2023 (UTC)[reply]
I agree. If someone wants to add a default-off option/parameter, then nobody else is likely to object. WhatamIdoing (talk) 19:48, 21 November 2023 (UTC)[reply]

Results of Nov/Dec 2023 discussion[edit]

Wikipedia:Village_pump (proposals)/Archive 209#Modify Module:Find sources/templates/Find general sources concerning the template {{Find sources}} and its source module:

  1. Add more individual sources - consensus against
  2. Remove all individual news outlets - strong consensus in favour; implemented (removal of NYT + AP)
  3. Replace the generic link - rough consensus against, but with acknowledgement that Participants who supported replacing it cited concerns about user privacy, potential systemic bias of Google's search engine, and some alternative websites were suggested.

Given the ongoing decay in the quality of Google search results, which are expected to get much worse due to the LLM feedback loop, revisiting this issue in a few years' time might give different results. Boud (talk) 15:04, 3 March 2024 (UTC)[reply]