Talk:Popularity of text encodings

This page was proposed for deletion by Thumperward (talk · contribs) on 13 March 2023.

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
???	This article has not yet received a rating on the project's importance scale.

Writing systems

	Writing portal This article falls within the scope of WikiProject Writing systems, a WikiProject interested in improving the encyclopaedic coverage and content of articles relating to writing systems on Wikipedia. If you would like to help out, you are welcome to drop by the project page and/or leave a query at the project’s talk page.Writing systemsWikipedia:WikiProject Writing systemsTemplate:WikiProject Writing systemsWriting system articles
???	This article has not yet received a rating on the project's importance scale.

Typography

	This article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.TypographyWikipedia:WikiProject TypographyTemplate:WikiProject TypographyTypography articles
???	This article has not yet received a rating on the importance scale.

Proposed deletion[edit]

Probably ok to delete this, it was created to remove a block of bloat people kept adding to the UTF-8 page, pretty much covering what is the second-most-popular encoding in the world behind UTF-8 in various countries. The rest of the text is filler I added to try to make this article have an actual subject. Something must be done to prevent people from re-adding all this to UTF-8 however. Spitzak (talk) 19:28, 16 May 2023 (UTC)[reply]

yep 217.174.52.77 (talk) 15:53, 5 August 2023 (UTC)[reply]

I think this is good as it's own topic. The distribution of text encodings is an interesting subject. 50.46.252.164 (talk) 21:29, 12 September 2023 (UTC)[reply]

I don't favor deletion. The topic is important, and, as pointed out, not appropriate to be shoved into a UTF-8 topic.

However, I'm not very certain of the data quality. The figures cited for UTF-8 are a little higher than other sources I've seen, for example. 50.46.252.164 (talk) 21:34, 12 September 2023 (UTC)[reply]

The Cyrillic Comment about Being 2x as efficient as UTF-8 is misleading[edit]

The statement says that the native Cyrillic codepage is twice as efficient as UTF-8, however most Cyrillic websites still use UTF-8 despite that.

However, website content primarily consists of markup and tags that are not in the target language of the page. The markup is usually primarily ASCII. So, a Cyrillic web page is only very slightly less efficient in UTF-8 than a native codepage. This is true of most scripts/languages and UTF-8 vs a native codepage. 50.46.252.164 (talk) 21:28, 12 September 2023 (UTC)[reply]

The GB18030 statement is also misleading[edit]

Typically, Chinese webpages are using GB2312/GBK, or possibly effectively Windows 936, and not GB18030. 50.46.252.164 (talk) 21:32, 12 September 2023 (UTC)[reply]

The Argument for UTF-8 over UTF-16 internally is subjective.[edit]

"Recently it has become clear that the overhead of translating from/to UTF-8 on input and output, and dealing with potential encoding errors in the input UTF-8, vastly overwhelms any savings UTF-16 could offer" seems to be an unsupported opinion.

For example, "dealing with potential encoding errors in the input UTF-8" is just words. If the input UTF-8 is corrupt, then natively handling UTF-8 will also have to deal with the corrupted UTF-8 stream.

Additionally, most character property processing libraries, such as ICU, depend on data tables that are UTF-16. If you want to sort a bunch of Unicode strings linguistically, you're going to be converting them to UTF-1 to discover the sort weights. (or your library will need to do it for you.) Same thing if you're interested in character properties or normalization of the strings.

UTF-8 is certainly a valid choice, and good for many applications. However, I find the statement "vastly overwhelms any savings UTF-16 could offer" to be narrowminded. 50.46.252.164 (talk) 21:41, 12 September 2023 (UTC)[reply]