Talk:Variant Chinese characters

	Writing portal This article falls within the scope of WikiProject Writing systems, a WikiProject interested in improving the encyclopaedic coverage and content of articles relating to writing systems on Wikipedia. If you would like to help out, you are welcome to drop by the project page and/or leave a query at the project’s talk page.Writing systemsWikipedia:WikiProject Writing systemsTemplate:WikiProject Writing systemsWriting system articles
High	This article has been rated as High-importance on the project's importance scale.

China Mid‑importance

	China portal This article is within the scope of WikiProject China, a collaborative effort to improve the coverage of China related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ChinaWikipedia:WikiProject ChinaTemplate:WikiProject ChinaChina-related articles
Mid	This article has been rated as Mid-importance on the project's importance scale.

Relevant image

A variant character of "國" used exclusively within Korea.

Would this image be relevant in improving the article? Just placing it here in case someone needs it. -- 李博杰 | —Talk contribs email guestbook complaints 13:44, 1 June 2009 (UTC)[reply]

Better definition?

Variants aren't really just allographs. People should be able to read all allographs of the same grapheme, but some people can't read certain variants, e.g. 㠯 is a variant of 以, but some people can't read it, so we can't call it an allograph. Can someone come up with a better definition for variants? Asoer (talk) 01:17, 29 January 2011 (UTC)[reply]

Question about Kangxi and Traditional characters in Taiwan

I would like to know whether there are some Chinese characters when there is some significant difference between the KANGXI form and the current Taiwanese form. I am not talking about very small design differences like for 亡. Thank you! Maidodo (talk) 15:10, 7 May 2016 (UTC)[reply]

Hi Maidodo, I’ve produced and uploaded a graphic showing the character 望 for you. This character also contains 亡 which you mentioned, but the major difference is the “moon-flesh” element.

P.S. The rumour is that Korean Hanja are most similar to Kāngxī style Chinese characters because their shapes didn’t get re-standardized after the Kāngxī Dictionary was published. Love —LiliCharlie (talk) 17:09, 7 May 2016 (UTC)[reply]

P.P.S. Talking about “moon-flesh,” the Kāngxī Dictionary doesn’t observe different shapes of 月 as a radical, i.e. whether they are variants of “moon” or of “flesh” 肉 they look the same (as in Mainland China), not different (as in Taiwan). Love —LiliCharlie (talk) 17:20, 7 May 2016 (UTC)[reply]

Thank you so much LiliCharlie, I am very happy with your reply. I didn't know that such "big" differences exist. Could I ask you two additional things if you have time?

In the Wiktionary, I looked up 望. The form showed in Translingual section is different both from 望 and from the KANGXI form. It looks more like the Japanese form but I am pretty sure it is a Chinese font used in the template. Do you know what is the form used in the Translingual?
I seek a way to display the KANGXI forms with a combination of a Unicode character (or entity number) and a tag. Is it impossible? The reason is I would like to propose on the Japanese or French Wiktionary to display the Chinese characters in the following manner: KANGXI form (translingual); Taiwanese norm (tradit.); PRC norm (simpl.); variants with Japanese font. I have an issue with the KANGXI part, because I don't know how to display it.

--Maidodo (talk) 13:50, 8 May 2016 (UTC)[reply]

Hi Maidodo, the problem with Kāngxī display is that 1. hardly any (and certainly no common) fonts have been produced in this style and 2. there is no markup for a defined Kāngxī locale either, so in the “Translingual” (as in the “Chinese”) section of 望 on Wiktionary the character carries the markup lang=zh for (unspecified) Chinese, which will be displayed according to one’s default system/browser settings for Chinese. (On my machine it’s Simplified Chinese.)
Ken Lunde of Adobe has written a proposal to the UTC that would allow Unicode encoding of different CJKV locales in plain text (=using only Unicode characters without markup), and this proposal includes a Kāngxī (pseudo-)locale. For further details, see The “PanCJKV” IVD Collection—Unregistered and Proposal to accept the submission to register the “PanCJKV” IVD collection. — Because you mentioned tags: Please read Unicode’s web page Language Tagging to learn that language tag characters were not created for HTML and similar protocols rich in markup, therefore you can’t expect any browser to understand and handle them as such. Love —LiliCharlie (talk) 15:22, 8 May 2016 (UTC)[reply]

Thanks a lot LiliCharlie. Your explanation and references are really nice. I read the proposal. Some part are too technical for me, but it is amazing. If I understand well, it could allow to display in plain text mode all ideographs x 11 regions (including the Kāngxī pseudo-region). I am a bit confused with the "alias" concept, but, in this proposal, at the end, is it reasonnable to say that the genuine Kāngxī forms (not the Korean ones) will be available? Because for 曜 Korean form shows something close enough to the Kāngxī form, but with 望 it is not the case (to me), the first stroke of 王 is a significant difference, and there is also the form of 亡.

Thank you about your remark on tags. I should have said markup. --Maidodo (talk) 05:45, 9 May 2016 (UTC)[reply]

Hi Maidodo, thanks for your thanks and don’t worry about the "alias" concept. This has nothing to do with later implementation, but with the fonts that Ken Lunde has by now developed: currently Source Han Sans ≈ Noto Sans CJK exist in versions for Mainland China (CN), Taiwan (TW), South Korea (KR), and Japan (JP), so the font he uses to illustrate his proposal (SourceHanSansR11-Regular.otf) contains only glyphs from these four locales (the darker columns in this graphic). The other seven locales are currently illustrated by (frequently incorrect) “alias” glyphs taken en bloc from one of these four, for example the Kāngxī (XK) locale uses glyphs that actually reflect KR usage. You correctly observed that Korean 望 is different from Kāngxī 望, but in the by now existent illustrative font/test font they look the same. Ken has plans to change this over time, and the next font will probably be the one for Hong Kong (HK), which he and his team will produce after the forthcoming HKSCS revision. Ken actually hopes the HK font will be done sometime this year, but it is not yet clear when the Kāngxī font (and a “PanCJKV” font containing genuine Kāngxī glyphs) will follow.

As far as display of Kāngxī is concerned the timeline for actual usability in Wiki projects is unfortunately rather long. I suppose it will take years or a decade or even longer before all major operating systems ship with appropriate fonts (Kāngxī is certainly not on their priority list), and another 5–10 years before the vast majority of users are equipped with such an OS. Désolé ! Love —LiliCharlie (talk) 17:36, 9 May 2016 (UTC) -- Thank you so much! I just hope I will not take 20 years ;) Until then, we will still rely on pictures / scan of the Kāngxī... --Maidodo (talk) 01:43, 10 May 2016 (UTC)[reply]

Right, Maidodo. I produced such graphics of the Kāngxī radicals in early 2014, and if you email me I will be pleased to send you a font containing about 17,000 scanned Kāngxī characters. Love —LiliCharlie (talk) 13:33, 10 May 2016 (UTC)[reply]

OR

@Verdy p please stop adding your uncited paragraph. I am well aware much of the content of the article is uncited—I didn't add any of it, but I had added the article's first citations, and I would like to make it a Good Article once I'm finished working on Chinese characters and Classical Chinese. You are directly working against policy and are pretty directly giving me more work to do in the future. You are required not to readd uncited content without a citation when asked to provide one. Please remove it. Remsense诉 17:44, 25 April 2024 (UTC)[reply]

Stop your completely lazy instant reverts. Everything is sourceable, but you removed even the sources and links I was adding. Nothing in this article is sourced, which is full of assumptions everywhere (including the assumtion that language tagging may work, it does not in this article with any modern browser). You have not added ANY citation in your article. We can work on improving this, but all you do is ONLY destructive and against policy. verdy_p (talk) 17:48, 25 April 2024 (UTC)[reply]

Again, I would like to work on this article and have added its only sourced content to date. Its present state is not a reason to make it worse. I'm likely the one who's going to have to cite every claim regardless. Remsense诉 17:49, 25 April 2024 (UTC)[reply]

WP:BURDEN. Remsense诉 17:50, 25 April 2024 (UTC)[reply]

This is what you don't what to see, but all is true (and in fact essential to this article), I made this to really improve the article which is defective:

Instead, the Unicode standard allows encoding these variants as variation sequences^[1], by appending a variation selector format control to the standard CJK unified ideograph (it also works directly inside plain text, without needing to use any rich format to select the appropriate language or script, and allows easier and more selective control when the same language/script combination needs several variants). The list of valid variation sequences is standardized by Unicode, defined in the Ideographic Variation Database (IVD)^[2]^[3], part of the Unicode characters database, and it is expansible without reencoding new code points in the UCS (and since the Unicode version where variation selectors were encoded, it's no longer needed to encode any new compatibility ideograph to render them; the two blocks CJK Compatibility Ideographs and CJK Unified Ideographs Extension A in the BMP are now frozen since Unicode 4.1, except to fix a few past mistakes that were forgotten during the Han unification process for the review of normative sources).^[4]

verdy_p (talk) 17:53, 25 April 2024 (UTC)[reply]

Thank you, that's all I ask for. Make sure all your claims are cited inline as such.Remsense 诉 17:54, 25 April 2024 (UTC)[reply]

But you reverted it, instead of improving it. That's why your revert was just too fast, and really lazy, just destructive and against policies. verdy_p (talk) 17:56, 25 April 2024 (UTC)[reply]

The additional statement about using language tagging is also essential, the table in the article just demonstrates that this does not work in practice (this was suggested many years ago, but since the adoption of the IVD, language tagging is deprecated in Unicode for this use (Unicode language tag characters have also been deprecated since long).

Another trick is to use specific fonts that have different default mappings to their own preferred variant, however it is not maintainable on the long term as these fonts are also evolving to support more variants (and sometime they even need to support several ones simultaneously in the same text needed in the same language according to the same standard source). This solution only works when tagging individual characters to select the expected variant, however this severely limits the choice of font styles and these specific fonts may not be widely or no longer available. Additionally it is too complex to maintain in large corpus of texts as it requires rich-text tagging of encoded documents, and does not allow easy reuse of their content, tied to a specific presentation and specific document format technologies.

Variation sequences have even been extended to more than just the IVD, they are also used for emojis whose development is very active.

Variation sequences are the reason why Unicode and ISO TC have stated that there will no longer be any new blocks added to the UCS for compatibility ideographs (the last time that significant sets of compatibility ideographs were encoded was in a now very old version of Unicode, many years ago, before the adoption of VS characters and standardization of the IVD): they are no longer needed, language tagging was a bad interim solution. Variation sequences was the accurate solution found for those that criticized the UCS encoding of Han ideographs (and have attempted to develop their own, glyph-based and largely unsourced, competing encoding "standard", with failed attempt that was compatibly incompatible with the UCS and was left largely behind the UCS evolution, with many more characters and variants forgotten). Since this adoption of the IVD, the Unicode and ISO charts are "sourcing" all standardized variants of CJK unified ideographs (including for all the many new extensions added after CJK Extension extension A in the BMP, and in the two added ideographic planes).

Variation sequences are supported today and work reliably in modern browsers for ideographs (that have removed their earlier, failed, experimental support using language tagging, either with the deprecated language tag characters or with rich text language attributes, not really designed for that variants purpose) and emojis, with modern renderers (e.g. Chrome/Chromium/Edge), with fonts that define glyphs for these standard IVS (e.g. "Noto Sans/Serif CJK SC/TC/JP/KR") and "emoji sequences".

Ideographic variation sequences even allow developing and deploying universal ideographic fonts, that can support multiple source standards, instead of maintaining region-specific fonts (so it may be possible in the future to have a single "Noto Sans CJK" font working with SC/TC/JP/KR for a large common subset supported by active national standards (the default mapping table may be sensitive to the language/script selection, from user preferences, but variation sequences will allow specific rendering of all other variants interoperably). Additional fonts (e.g. "Noto Sans CJK 2") will then provide mapping for additional rare ideographs (that can't fit all in the base universal font, such as specific ideographs for uncommon place names, or people names, or for specific concepts in science and technologies, or for possibly controversed terms in politics, sociology, social medias, and some popular vulgar terms), but also with the same system of variation sequences (if there are still variants needed in their supported subset, if these variants have at least one active open standard supporting them). verdy_p (talk) 18:00, 25 April 2024 (UTC)[reply]

^ "Variation Sequences; FAQ". Unicode Consortium.
^ "Ideographic Variation Database". Unicode Consortium.
^ "UTS #37, Unicode Ideographic Variation Database". Unicode Consortium.
^ "Unicode® Standard Annex #45, U-Source Ideograph". Unicode Consortium.

[1] "Variation Sequences; FAQ". Unicode Consortium.

[IVD-2] "Ideographic Variation Database". Unicode Consortium.

[UTS37-3] "UTS #37, Unicode Ideographic Variation Database". Unicode Consortium.

[TR35-4] "Unicode® Standard Annex #45, U-Source Ideograph". Unicode Consortium.

[1]

[2]

[3]

[4]