User:DexDor/Categorization

From Wikipedia, the free encyclopedia


Categorization, if used correctly and consistently, may provide a useful facility for some Wikipedia editors and for some readers. However, placing articles in too many categories (overcategorization) can have negative consequences.

Ideally, the amount of categorization (i.e. the number of categories that each article is placed in) should be sufficient that the benefits of categorization can be realised whilst minimising the costs.

For most readers and for many editors categorization is an irrelevance.[1]

Benefits of categorization in Wikipedia[edit]

Uses of categorization:

  • Wikipedia categories can be used (by both readers and editors) to find an article on a specific topic (e.g. if they can't remember the name of the person they are looking for information about) or to navigate between articles. However, other facilities (e.g. templates) often provide a better way of doing this (particularly for readers).
  • Wikipedia categories provide a structured way for editors to look at all the pages of a particular type (e.g. articles on a particular subject) - including those articles that have few links from other articles (which are often articles that are in a poor state).
  • Categorization is a way for editors to spot duplicate articles and content forks.[2]

Costs of categorization in Wikipedia[edit]

Costs of categorization include:

  • Edits to the categorization of an article appear in the article's edit history and hence on the watchlist of those maintaining the article.[3] This "watchlist noise" causes extra work for those editors protecting the article against vandalism etc.[4][5]
  • Editor time is spent working on categorization, rather than on the text of articles[6]. This includes both maintenance of the category tags on articles[7] and maintenance of the category structure.[8] Several editors have asked whether that is worthwhile.[9][10]
  • Categories that are not fully populated or are badly organized may mislead readers/editors into thinking that Wikipedia does not have an article on the topic they are looking for.[11][12]
  • Readers/editors may assume that categorization is correct (when it isn't).[13]

Characteristics used for categorization[edit]

A city has lots of characteristics.

A Wikipedia article can contain hundreds, even thousands, of pieces of information - e.g. an article about a city may mention the city's opera house, football team etc. In theory, each of these could be a characteristic to categorise by (e.g. we could have a category for articles about "Cities that have had an openly gay mayor"[14]).[15] However, that sort of categorization could cause articles to be in hundreds of categories and require a huge amount of maintenance (on both the articles and the huge category trees that would result)[16]. Instead, Wikipedia categorization[17] is based on categorizing articles only by the most important[18] characteristics of the topic of the article (plus a few categories required for administrative reasons). In Wikipedia these are called "defining characteristics". The exact meaning of that term will probably never be agreed by all editors, but the principle is generally accepted. So, for example, the article about a city is normally in a category like "Cities in <country>" and a few other categories for important long-term characteristics (like being a capital city or being on the coast); of the hundreds of facts in the article only a small number are used for categorization.

Problems[edit]

Editors interested in a particular topic tend (perhaps inevitably) to view characteristics that relate to that topic as being of particular importance. For example, an editor interested in time zones[19] may think that's an important characteristic of a city. Other editors might be more interested in the types of public transport a city has, the ethnicity of its inhabitants, sporting events held in the city etc (these are all examples of categories that have been deleted).[20] Some editors even place articles in a category despite the articles not mentioning the characteristic that the category is about.[21][22][23] If all these editors got their way then a link to "their" category would appear at the bottom of lots of articles, but it would be hidden amongst hundreds of other categories and thus unlikely to be used to navigate from the article.[24]

Sometimes editors start from a (off-wiki) list and try to add all the corresponding Wikipedia articles to a category, regardless of whether the article's contents show it meeting the inclusion criteria for the category. Some examples:

  • An editor categorizes articles based on results of a census, even though the article text doesn't contain the results from that census.[25]
  • An editor finds "Hutu" on a list of Māori plant names so places the Hutu article in that category even though that article is about an ethnic group in Africa.[26]

Similarly, editors adding articles to a category based on an off-wiki list may miss articles that meet the inclusion criteria, but aren't on their list. An example:

  • An editor puts articles about places into a category about a path (that goes through the places), but doesn't categorize an article about a tunnel that the path goes through.[27]

Some examples of overcategorization:

  • A person in 5 "descent" categories (i.e. categorizing the person based on the ancestry of a great-grandparent).[28]
  • A person in 7 "people from" categories.[29]
  • A person in 5 categories relating to their suicide.[30]

Consider, for example, an article about a singer. An editor interested in awards might look at the part of the article listing awards the person has received, create categories such as "Winners of <prize>" and place the article in those categories. Someone interested in festivals might place the article in categories for "People who performed at <festival>", someone interested in personal lives might add category tags for "People who have dated <person>".... The list of categories would then be as long as the article - in fact it could be much longer as there can be categories for combinations of characteristics; so the article might be in categories for both "Singers who performed at <festival>" and "People from <city> who performed at <festival>".

A common problem in Wikipedia is that wherever there's a list which doesn't have a precise definition of what is eligible to be in the list then (well-meaning) editors keep adding "just one more" item to the list.[31] This happens both with new categories[32] and with category tags on articles[33].

Similar articles and related articles[edit]

Categories[34] are for grouping articles about similar topics; that is not quite the same thing as grouping articles about related topics. For example, an article about a soldier who was awarded a medal for his actions in a particular battle should be linked to articles about related topics (e.g. the article about the battle, his regiment, weapons used etc), but in categorization his article should be grouped with articles about similar topics (i.e. other soldiers decorated for valour) even though there are few direct links between such articles.

Another example: Charles Darwin and HMS Beagle are related topics - the articles are linked to each other, but in categorization one belongs under people categories (e.g. Category:English naturalists) and the other belongs under ship categories (e.g. Category:Ships of the Royal Navy). In this case categorizing articles because they are related can lead to a category loop (Category:Charles Darwin and Category:HMS Beagle).[35]

Solutions[edit]

Possible solutions to some/all of the problems outlined in this essay include:

  • Delete the entire category system.
  • Delete bad categories that have been created.
  • Reduce the number of bad categories being created.

(Under construction)

Cost-benefit analysis of a category[edit]

Cost-benefit analysis for (particular types of) categories -

Costs[edit]

  • Clutter making it harder to see real defining cats (often the main categories are normally at the start of the list of categories, but not always).
  • Watchlist noise (e.g. potentially hiding vandalism).
  • Encourages the creation of more bad categories.
  • Editor time could be spent more productively.[36]
  • Bandwidth for upload/download (minimal).

Specific topics[edit]

See also[edit]

Notes[edit]

  1. ^ E.g. "there are way too many categories in WP" [1], "Wikipedia would be better off, if it got rid of categorizing." [2], "categorizing should be gotten rid of" [3], "This kind of backs up what I found in researching page views for categories a few years back: readers don't use them. The main take-way I got from that is that categories aren't worth wasting editing time on.[4]
  2. ^ For example, the articles Col de la Madone transmitter and Col de la Madonne transmitter.
  3. ^ See, for example, edits to the Sue Gardner article (of the 16 edits in the month 12 are just changing categorization).
  4. ^ Note: There may, in the future, be technical solutions to this - e.g. like interlanguage links (which also caused watchlist noise) were moved to Wikidata
  5. ^ E.g. "My watchlist is groaning with your edits, most of which are the addition of various categories, which I do not consider worth the overhead. I do not want to see them, and I end up culling the number of articles on my watchlist more aggressively as a result, or just giving up on WP for a time. ..." (comment on a user talk page)
  6. ^ An example of an extreme case of adding a category tag instead of article text: "I added Chris to the College Republican category. I couldn't find a place to include this in the text of the article, so I thought the category would do. ..."[5].
  7. ^ Example edit that adds a category not supported by the article: [6]
  8. ^ Some examples of problem edits to categories: [7] and [8]
  9. ^ "I wonder whether they're worth the aggravation"[9]
  10. ^ "I [care] only about articles and their content and (perhaps too much) about correct wording/spelling/grammar. Things like lists and templates and categories are beyond my interest, partly because I don't understand them - nor do I see any need for them. And they all are obvious sources of confusion and even conflict." User:Jan olieslagers 10 June 2019 at Wikipedia_talk:WikiProject_Aviation#List_warrior
  11. ^ E.g. as of early January 2014, Category:World War II military vehicles of Germany contained only 3 articles (probably because of the similar Category:World War II German vehicles).
  12. ^ E.g. as of early February 2014, the category with text "... passenger-carrying ships designed, built, or operated in Denmark during the Cold War era (approximately 1945 to 1990)" contained just one article (CFD)
  13. ^ E.g. using italictitle on the Village article[10]
  14. ^ Or even "Cities where the mean temperature in January is 2°C"
  15. ^ Note: Under such a system the category tags on an article might present information in a more structured way than the text of an article, and this might enable some automated querying, but Wikipedia is primarily an encyclopedia aimed at human readers; it's not a knowledge base such as Wikidata.
  16. ^ Vandal-fighting if nothing else - categories are sometimes vandalised - e.g. [11], [12], [13], [14] and [15]. There are also edits to categories such as newbie editors trying to edit the category page to add someone to a category (e.g. [16], [17]).
  17. ^ Note:The categorization referred to here is the topic-based categorization, not categorization of talk pages, hidden categories etc
  18. ^ "Important" refers to long-term encyclopedic importance so, for example, a person might consider that their most important characteristic is that they're a parent or their religious beliefs, but that's not (in most cases) why there is an encyclopedia article about them (see WP:NOTABILITY).
  19. ^ See Wikipedia:WikiProject Current Local City Time
  20. ^ Wikipedia:Categories_for_discussion/Log/2007_February_26#Category:Cities_in_the_UTC-5_timezone, Wikipedia:Categories_for_deletion/Log/2005_December_4#Category:Cities_with_significant_Arab_Israeli_populations, Wikipedia:Categories_for_discussion/Log/2007_January_3#Category:Cities_with_trolleybus_system, Wikipedia:Categories_for_discussion/Log/2013_June_4#Category:Host_cities_of_the_Olympic_Games
  21. ^ E.g. Wikipedia:Categories_for_discussion/Log/2013_April_4#Category:Geocaching_in_the_United_Kingdom and [18]
  22. ^ E.g. Ship of the line was placed in an English Civil War category with this edit [19] (the category was still there in 2014).
  23. ^ Wikipedia:Categories_for_discussion/Log/2019_May_4#Category:ASEMUS_museums deleted a category where not one article in the category mentioned the characteristic (example edit adding the category tag).
  24. ^ Is this an example of tragedy of the commons ?
  25. ^ "St. Francis County, Arkansas: Difference between revisions - Wikipedia".
  26. ^ [20] (the link to the African tribe also went into List of Māori plant common names when the category was listified).
  27. ^ See discussion at Wikipedia:Categories_for_discussion/Log/2013_September_21#Category:Capital_Ring
  28. ^ Chelsea Clinton - [21], Beyonce is in 5 descent categories as of Feb 2014. Some very specific descent categories have been created (example CFD)
  29. ^ Jerry Springer as of January 2014
  30. ^ Dan White - [22]
  31. ^ Example of an edit that added a 25th entry to a list of comparable vehicles.
  32. ^ E.g. categories for alumni leads to a category for people who were homeschooled (CFD1, CFD2).
  33. ^ example
  34. ^ Note: This is referring specifically to categories used for articles, but similar principles apply to other types of pages such as those used for Wikipedia administration.
  35. ^ Another example: Richard Nixon and Gerald Ford are similar topics (Republican US presidents of the 1970s) so they should be closely linked through the category system. Nixon and Watergate are closely related topics so they should be well linked in the article text (e.g. using a "main" tag), but they are not so similar that they need to be directly linked by categorization (although both topics fit under categories for US politics etc). (related CFD).
  36. ^ E.g. from "Category talk:UK locations with ethnic minority-majority populations" (deleted by this CFD):

    I've made some additions. I thought that some of the little-known places in the Kirklees district needed a reference, so here is a table from the 2001 census http://www.kirklees.gov.uk/community/statistics/census-by-settlement/KS06settle2003.xls This includes Batley Carr, Lockwood, Mount Pleasant, Ravensthorpe and Thornton Lodge. Some might fret that the figures for Ravensthorpe shows 49.5% White population, so Whites are only just in a minority, but I am more than confident that the large numbers of Kurds who have settled in Ravensthorpe since 2003 would alter the figures for now so that there would be a clear ethnic minority-majority in Ravensthorpe. [Editor] (talk) 18:22, 27 November 2008 (UTC)