This page primarily concerns the following concepts:
toUpperCase and toLowerCase methods. (See UnicodeData.txt.) Importantly, this is not the same as case folding (used in JavaScript by Unicode-aware case-insensitive matching), which is a canonicalization process. It is also not locale-sensitive, as provided by the toLocaleUpperCase and toLocaleLowerCase methods.Uppercase_Letter (Lu) or Lowercase_Letter (Ll) general category, respectively.Uppercase or Lowercase property, respectively.There's no great way to list all existing Unicode code points in JavaScript. However, most of these code points, even if they exist, are things like & which are (1) not letters, (2) not cased, (3) invariant under case conversion, and therefore not relevant to this discussion. The non-existent code points also check all these boxes are therefore thrown out. So, we can simply list all the code points and check if they are either cased or case-variant. We call these code points "interesting".
At the time of writing (Unicode v17), there are 4632 such interesting code points.
But that's not the end of the story. Some single code point characters map to multiple code points under case conversion, which then can be further case-converted, and so on. (We'll see more closely later.) So, we also need to include all the code points that are reachable from these 4632 code points by applying toUpperCase and toLowerCase repeatedly until we reach a fixed point.
This brings us to a total of 4778 strings. To confirm that this set is closed under case conversion:
By definition "uninteresting" characters cannot change into interesting characters because they are invariant. Can we have an interesting character that changes into an uninteresting character? We check if all of candidates are still interesting.
So the candidates set remains fully interesting; the only difference is that we have added some multi-code point strings that are reachable from interesting code points by case conversion.
The Uppercase_Letter and Lowercase_Letter general categories are supposedly subsets of the Uppercase and Lowercase properties, respectively. Furthermore a character cannot be simultaneously uppercase and lowercase. We shall verify this:
Indeed this is true. Other than the 4778-4632=146 multi-code point strings, the remaining ones may have a maximum of 24=16 combinations of these 4 boolean properties, but since Uppercase_Letter implies Uppercase, Lowercase_Letter implies Lowercase, Uppercase implies not Lowercase and vice versa, we are left with only 5 valid combinations:
| Uppercase_Letter | Lowercase_Letter | Uppercase | Lowercase | Category | Count |
|---|---|---|---|---|---|
| 1 | 0 | 1 | 0 | Uppercase letter | 1858 |
| 0 | 1 | 0 | 1 | Lowercase letter | 2258 |
| 0 | 0 | 1 | 0 | Uppercase non-letter | 120 |
| 0 | 0 | 0 | 1 | Lowercase non-letter | 311 |
| 0 | 0 | 0 | 0 | Uncased | 31 |
Here's the data structure we are going to be working with: a mapping graph. The node set is the candidates set. Each node has exactly two outgoing edges, one labeled toUpperCase and the other labeled toLowerCase, which point to the nodes corresponding to the result of applying the respective case conversion method to the node. Edges can be self-referential, and there can be any number of incoming edges.
Nodes in this graph can be classified into the following categories:
It's apparent that this graph isn't going to be very connected: it's going to have many isolated components. We identify these components by their shapes.
To begin with, we build a reverse graph for easy lookup:
There are some nodes that are both toUpperCase-invariant and toLowerCase-invariant. These are still interesting because they are cased, but they have 2 self-loop edges. Furthermore, all of them are isolated, meaning that no other node points to them.
There are 1651 such isolated nodes. Among these: there 499 uppercase letters, 805 lowercase letters, 78 uppercase non-letters, and 269 lowercase non-letters.
Presumably most of the nodes satisfy the following:
toLowerCase edge points to the lowercase, and the lowercase's toUpperCase edge points to the uppercase.There are 1496 such pairs; together with the isolated nodes, they account for 4643 nodes, which is 97.2% of the entire graph's 4778 nodes. Furthermore, all of these pairs satisfy that either they are an uppercase/lowercase letter pair (1381 pairs), or they are an uppercase/lowercase non-letter pair (42 pairs), or they both have multiple code points (73 pairs).
There cannot exist cycle kinds other than isolated nodes and pairs. If a triangle exists: A → B → C → A, then two of these three adjacent edges must be labeled the same, violating idempotence. (Note that this is about a circular triangle; there can be a non-circular triangle, where A uppercases to B and lowercases to C, while B and C form a pair. We'll see these later.) There could theoretically be other even-numbered cycles (A uppercases to B, B lowercases to C, C uppercases to D, D lowercases to A), but that would be exceedingly unlikely. Therefore, the remaining 135 nodes must eventually arrive at one of these pairs (since the isolated nodes don't have incoming edges) via some single-directional path.
Since these 135 nodes must be pointing to some pair via a single-directional path, we just traverse each one and see where we end up.
There are two kinds of cases here: (1) the node is invariant under one mapping; (2) the node is variant under both mappings. In the first case, there's one simple path; in the second case, there are two paths that may eventually reach different places. However, we can be sure about one thing: this node must have no incoming edges, because that would break idempotence. So we just need to find all nodes with no incoming edges, and then traverse them.
The results form a few obvious groups:
Latin-1 Supplement, Latin Extended-A, Greek and Coptic, Cyrillic Extended-C, Latin Extended Additional, Greek Extended (1+2+8+9+1+1=22 groups, 22 non-pair extra nodes)
An extra alias for the lowercase form: A → B, B ↔ C. A is a lowercase letter, while B and C an upper/lowercase letter pair.
| A | B | C |
|---|---|---|
| µ (U+00B5) | Μ (U+039C) | μ (U+03BC) |
| ı (U+0131) | I (U+0049) | i (U+0069) |
| ſ (U+017F) | S (U+0053) | s (U+0073) |
| ς (U+03C2) | Σ (U+03A3) | σ (U+03C3) |
| ϐ (U+03D0) | Β (U+0392) | β (U+03B2) |
| ϑ (U+03D1) | Θ (U+0398) | θ (U+03B8) |
| ϕ (U+03D5) | Φ (U+03A6) | φ (U+03C6) |
| ϖ (U+03D6) | Π (U+03A0) | π (U+03C0) |
| ϰ (U+03F0) | Κ (U+039A) | κ (U+03BA) |
| ϱ (U+03F1) | Ρ (U+03A1) | ρ (U+03C1) |
| ϵ (U+03F5) | Ε (U+0395) | ε (U+03B5) |
| ᲀ (U+1C80) | В (U+0412) | в (U+0432) |
| ᲁ (U+1C81) | Д (U+0414) | д (U+0434) |
| ᲂ (U+1C82) | О (U+041E) | о (U+043E) |
| ᲃ (U+1C83) | С (U+0421) | с (U+0441) |
| ᲄ (U+1C84) | Т (U+0422) | т (U+0442) |
| ᲅ (U+1C85) | Т (U+0422) | т (U+0442) |
| ᲆ (U+1C86) | Ъ (U+042A) | ъ (U+044A) |
| ᲇ (U+1C87) | Ѣ (U+0462) | ѣ (U+0463) |
| ᲈ (U+1C88) | Ꙋ (U+A64A) | ꙋ (U+A64B) |
| ẛ (U+1E9B) | Ṡ (U+1E60) | ṡ (U+1E61) |
| ι (U+1FBE) | Ι (U+0399) | ι (U+03B9) |
Latin Extended-A, Latin Extended-B, Greek and Coptic, Armenian, Latin Extended Additional, Greek Extended, Alphabetic Presentation Forms (1+1+2+1+5+25+12=47 groups, 47 non-pair extra nodes)
An extra alias for the lowercase form: A → B, B ↔ C. A is a lowercase letter, while B and C are multi-code-point strings.
| A | B | C |
|---|---|---|
| ʼn (U+0149) | ʼN (U+02BC U+004E) | ʼn (U+02BC U+006E) |
| ǰ (U+01F0) | J̌ (U+004A U+030C) | ǰ (U+006A U+030C) |
| ΐ (U+0390) | Ϊ́ (U+0399 U+0308 U+0301) | ΐ (U+03B9 U+0308 U+0301) |
| ΰ (U+03B0) | Ϋ́ (U+03A5 U+0308 U+0301) | ΰ (U+03C5 U+0308 U+0301) |
| և (U+0587) | ԵՒ (U+0535 U+0552) | եւ (U+0565 U+0582) |
| ẖ (U+1E96) | H̱ (U+0048 U+0331) | ẖ (U+0068 U+0331) |
| ẗ (U+1E97) | T̈ (U+0054 U+0308) | ẗ (U+0074 U+0308) |
| ẘ (U+1E98) | W̊ (U+0057 U+030A) | ẘ (U+0077 U+030A) |
| ẙ (U+1E99) | Y̊ (U+0059 U+030A) | ẙ (U+0079 U+030A) |
| ẚ (U+1E9A) | Aʾ (U+0041 U+02BE) | aʾ (U+0061 U+02BE) |
| ὐ (U+1F50) | Υ̓ (U+03A5 U+0313) | ὐ (U+03C5 U+0313) |
| ὒ (U+1F52) | Υ̓̀ (U+03A5 U+0313 U+0300) | ὒ (U+03C5 U+0313 U+0300) |
| ὔ (U+1F54) | Υ̓́ (U+03A5 U+0313 U+0301) | ὔ (U+03C5 U+0313 U+0301) |
| ὖ (U+1F56) | Υ̓͂ (U+03A5 U+0313 U+0342) | ὖ (U+03C5 U+0313 U+0342) |
| ᾲ (U+1FB2) | ᾺΙ (U+1FBA U+0399) | ὰι (U+1F70 U+03B9) |
| ᾴ (U+1FB4) | ΆΙ (U+0386 U+0399) | άι (U+03AC U+03B9) |
| ᾶ (U+1FB6) | Α͂ (U+0391 U+0342) | ᾶ (U+03B1 U+0342) |
| ᾷ (U+1FB7) | Α͂Ι (U+0391 U+0342 U+0399) | ᾶι (U+03B1 U+0342 U+03B9) |
| ῂ (U+1FC2) | ῊΙ (U+1FCA U+0399) | ὴι (U+1F74 U+03B9) |
| ῄ (U+1FC4) | ΉΙ (U+0389 U+0399) | ήι (U+03AE U+03B9) |
| ῆ (U+1FC6) | Η͂ (U+0397 U+0342) | ῆ (U+03B7 U+0342) |
| ῇ (U+1FC7) | Η͂Ι (U+0397 U+0342 U+0399) | ῆι (U+03B7 U+0342 U+03B9) |
| ῒ (U+1FD2) | Ϊ̀ (U+0399 U+0308 U+0300) | ῒ (U+03B9 U+0308 U+0300) |
| ΐ (U+1FD3) | Ϊ́ (U+0399 U+0308 U+0301) | ΐ (U+03B9 U+0308 U+0301) |
| ῖ (U+1FD6) | Ι͂ (U+0399 U+0342) | ῖ (U+03B9 U+0342) |
| ῗ (U+1FD7) | Ϊ͂ (U+0399 U+0308 U+0342) | ῗ (U+03B9 U+0308 U+0342) |
| ῢ (U+1FE2) | Ϋ̀ (U+03A5 U+0308 U+0300) | ῢ (U+03C5 U+0308 U+0300) |
| ΰ (U+1FE3) | Ϋ́ (U+03A5 U+0308 U+0301) | ΰ (U+03C5 U+0308 U+0301) |
| ῤ (U+1FE4) | Ρ̓ (U+03A1 U+0313) | ῤ (U+03C1 U+0313) |
| ῦ (U+1FE6) | Υ͂ (U+03A5 U+0342) | ῦ (U+03C5 U+0342) |
| ῧ (U+1FE7) | Ϋ͂ (U+03A5 U+0308 U+0342) | ῧ (U+03C5 U+0308 U+0342) |
| ῲ (U+1FF2) | ῺΙ (U+1FFA U+0399) | ὼι (U+1F7C U+03B9) |
| ῴ (U+1FF4) | ΏΙ (U+038F U+0399) | ώι (U+03CE U+03B9) |
| ῶ (U+1FF6) | Ω͂ (U+03A9 U+0342) | ῶ (U+03C9 U+0342) |
| ῷ (U+1FF7) | Ω͂Ι (U+03A9 U+0342 U+0399) | ῶι (U+03C9 U+0342 U+03B9) |
| ff (U+FB00) | FF (U+0046 U+0046) | ff (U+0066 U+0066) |
| fi (U+FB01) | FI (U+0046 U+0049) | fi (U+0066 U+0069) |
| fl (U+FB02) | FL (U+0046 U+004C) | fl (U+0066 U+006C) |
| ffi (U+FB03) | FFI (U+0046 U+0046 U+0049) | ffi (U+0066 U+0066 U+0069) |
| ffl (U+FB04) | FFL (U+0046 U+0046 U+004C) | ffl (U+0066 U+0066 U+006C) |
| ſt (U+FB05) | ST (U+0053 U+0054) | st (U+0073 U+0074) |
| st (U+FB06) | ST (U+0053 U+0054) | st (U+0073 U+0074) |
| ﬓ (U+FB13) | ՄՆ (U+0544 U+0546) | մն (U+0574 U+0576) |
| ﬔ (U+FB14) | ՄԵ (U+0544 U+0535) | մե (U+0574 U+0565) |
| ﬕ (U+FB15) | ՄԻ (U+0544 U+053B) | մի (U+0574 U+056B) |
| ﬖ (U+FB16) | ՎՆ (U+054E U+0546) | վն (U+057E U+0576) |
| ﬗ (U+FB17) | ՄԽ (U+0544 U+053D) | մխ (U+0574 U+056D) |
Combining Diacritical Marks (1 group, 1 non-pair extra node)
An extra alias for the lowercase form: A → B, B ↔ C. A is uncased (combining ypogegrammeni), while B and C are an upper/lowercase letter pair.
| A | B | C |
|---|---|---|
| ◌ͅ (U+0345) | Ι (U+0399) | ι (U+03B9) |
Greek and Coptic, Letterlike characters (1+3=4 groups, 4 non-pair extra nodes)
An extra alias for the uppercase form: A → C, B ↔ C. A is an uppercase letter, while B and C are an upper/lowercase letter pair.
| A | B | C |
|---|---|---|
| ϴ (U+03F4) | Θ (U+0398) | θ (U+03B8) |
| Ω (U+2126) | Ω (U+03A9) | ω (U+03C9) |
| K (U+212A) | K (U+004B) | k (U+006B) |
| Å (U+212B) | Å (U+00C5) | å (U+00E5) |
Latin Extended-A (1 group, 1 non-pair extra node)
An extra alias for the uppercase form: A → C, B ↔ C. A is an uppercase letter, while B and C are multi-code-point strings.
| A | B | C |
|---|---|---|
| İ (U+0130) | İ (U+0049 U+0307) | i̇ (U+0069 U+0307) |
Latin Extended-B (4 groups, 4 non-pair extra nodes)
4 triangles of the form A → B, A → C, B ↔ C. A is uncased (contains a capital and a small letter in a ligature), while B and C are an upper/lowercase letter pair.
| A | B | C |
|---|---|---|
| Dž (U+01C5) | DŽ (U+01C4) | dž (U+01C6) |
| Lj (U+01C8) | LJ (U+01C7) | lj (U+01C9) |
| Nj (U+01CB) | NJ (U+01CA) | nj (U+01CC) |
| Dz (U+01F2) | DZ (U+01F1) | dz (U+01F3) |
Latin-1 Supplement and Latin Extended Additional (1 group, 2 non-pair extra nodes)
A path of the form A → B, B → C, C ↔ D. A is an uppercase letter, B is a lowercase letter, while C and D are both multi-code-point strings.
| A | B | C | D |
|---|---|---|---|
| ẞ (U+1E9E) | ß (U+00DF) | SS (U+0053 U+0053) | ss (U+0073 U+0073) |
This seems to be an unfortunate historical artifact: had ß been transformed by toUpperCase into ẞ instead of SS, we would have two trivial pairs.
Greek Extended (27 groups, 54 non-pair extra nodes)
4 pair-and-triangle groups of the form A → B, A → C, B → C, C ↔ D. A is uncased (contains a capital letter and a prosgegrammeni), B is a lowercase letter (contains a lowercase letter and a ypogegrammeni), while C and D are both multi-code-point strings, where the ypogegrammeni is replaced by a real iota.
| A | B | C | D |
|---|---|---|---|
| ᾈ (U+1F88) | ᾀ (U+1F80) | ἈΙ (U+1F08 U+0399) | ἀι (U+1F00 U+03B9) |
| ᾉ (U+1F89) | ᾁ (U+1F81) | ἉΙ (U+1F09 U+0399) | ἁι (U+1F01 U+03B9) |
| ᾊ (U+1F8A) | ᾂ (U+1F82) | ἊΙ (U+1F0A U+0399) | ἂι (U+1F02 U+03B9) |
| ᾋ (U+1F8B) | ᾃ (U+1F83) | ἋΙ (U+1F0B U+0399) | ἃι (U+1F03 U+03B9) |
| ᾌ (U+1F8C) | ᾄ (U+1F84) | ἌΙ (U+1F0C U+0399) | ἄι (U+1F04 U+03B9) |
| ᾍ (U+1F8D) | ᾅ (U+1F85) | ἍΙ (U+1F0D U+0399) | ἅι (U+1F05 U+03B9) |
| ᾎ (U+1F8E) | ᾆ (U+1F86) | ἎΙ (U+1F0E U+0399) | ἆι (U+1F06 U+03B9) |
| ᾏ (U+1F8F) | ᾇ (U+1F87) | ἏΙ (U+1F0F U+0399) | ἇι (U+1F07 U+03B9) |
| ᾘ (U+1F98) | ᾐ (U+1F90) | ἨΙ (U+1F28 U+0399) | ἠι (U+1F20 U+03B9) |
| ᾙ (U+1F99) | ᾑ (U+1F91) | ἩΙ (U+1F29 U+0399) | ἡι (U+1F21 U+03B9) |
| ᾚ (U+1F9A) | ᾒ (U+1F92) | ἪΙ (U+1F2A U+0399) | ἢι (U+1F22 U+03B9) |
| ᾛ (U+1F9B) | ᾓ (U+1F93) | ἫΙ (U+1F2B U+0399) | ἣι (U+1F23 U+03B9) |
| ᾜ (U+1F9C) | ᾔ (U+1F94) | ἬΙ (U+1F2C U+0399) | ἤι (U+1F24 U+03B9) |
| ᾝ (U+1F9D) | ᾕ (U+1F95) | ἭΙ (U+1F2D U+0399) | ἥι (U+1F25 U+03B9) |
| ᾞ (U+1F9E) | ᾖ (U+1F96) | ἮΙ (U+1F2E U+0399) | ἦι (U+1F26 U+03B9) |
| ᾟ (U+1F9F) | ᾗ (U+1F97) | ἯΙ (U+1F2F U+0399) | ἧι (U+1F27 U+03B9) |
| ᾨ (U+1FA8) | ᾠ (U+1FA0) | ὨΙ (U+1F68 U+0399) | ὠι (U+1F60 U+03B9) |
| ᾩ (U+1FA9) | ᾡ (U+1FA1) | ὩΙ (U+1F69 U+0399) | ὡι (U+1F61 U+03B9) |
| ᾪ (U+1FAA) | ᾢ (U+1FA2) | ὪΙ (U+1F6A U+0399) | ὢι (U+1F62 U+03B9) |
| ᾫ (U+1FAB) | ᾣ (U+1FA3) | ὫΙ (U+1F6B U+0399) | ὣι (U+1F63 U+03B9) |
| ᾬ (U+1FAC) | ᾤ (U+1FA4) | ὬΙ (U+1F6C U+0399) | ὤι (U+1F64 U+03B9) |
| ᾭ (U+1FAD) | ᾥ (U+1FA5) | ὭΙ (U+1F6D U+0399) | ὥι (U+1F65 U+03B9) |
| ᾮ (U+1FAE) | ᾦ (U+1FA6) | ὮΙ (U+1F6E U+0399) | ὦι (U+1F66 U+03B9) |
| ᾯ (U+1FAF) | ᾧ (U+1FA7) | ὯΙ (U+1F6F U+0399) | ὧι (U+1F67 U+03B9) |
| ᾼ (U+1FBC) | ᾳ (U+1FB3) | ΑΙ (U+0391 U+0399) | αι (U+03B1 U+03B9) |
| ῌ (U+1FCC) | ῃ (U+1FC3) | ΗΙ (U+0397 U+0399) | ηι (U+03B7 U+03B9) |
| ῼ (U+1FFC) | ῳ (U+1FF3) | ΩΙ (U+03A9 U+0399) | ωι (U+03C9 U+03B9) |
Among these 107 groups, most pairs only participate in one group, except:
Therefore there are 101 unique pairs which host vestigial nodes; all other 1496-101=1395 pairs are trivial.
To summarize, we have the following partition of the mapping graph:
Everything can be summarized by these two tables:
| Input \ Output | Identity | lowercase letter | uppercase letter | lowercase non-letter | uppercase non-letter | uncased | multi-code-point |
|---|---|---|---|---|---|---|---|
| lowercase letter | 805 | 1403 | 75 | ||||
| uppercase letter | 1886 | ||||||
| lowercase non-letter | 269 | 42 | |||||
| uppercase non-letter | 120 | ||||||
| uncased | 4 | 27 | |||||
| multi-code-point | 73 | 73 |
| Input \ Output | Identity | lowercase letter | uppercase letter | lowercase non-letter | uppercase non-letter | uncased | multi-code-point |
|---|---|---|---|---|---|---|---|
| lowercase letter | 2283 | ||||||
| uppercase letter | 499 | 1386 | 1 | ||||
| lowercase non-letter | 312 | ||||||
| uppercase non-letter | 78 | 42 | |||||
| uncased | 31 | ||||||
| multi-code-point | 73 | 73 |
Yes. Looking horizontally across the "uppercase letter" and "uppercase non-letter" rows in the first table and "lowercase letter" and "lowercase non-letter" rows in the second table, all characters are identity-mapped.
As a corollary, given a character is cased, lower(upper)case variance implies upper(lower) case.
No. The isolated nodes are cased but case-invariant. Excluding them, the answer is yes (but is almost trivial).
As a corollary, given a character is cased, upper(lower)case invariance does not imply upper(lower) case.
No. There are characters that are uncased, but are case-mapping variant. These are exactly the intermediate nodes in triangles:
No. There are characters that are both uppercase and lowercase variant. These are exactly the intermediate nodes in triangles:
No, never. In our mapping graph, the only sink nodes are the isolated nodes. You can never enter a node without being able to exit.
Yes; specifically, the triangle intermediate nodes are uncased but can change into cased characters. But no, a cased character can never change into an uncased character.
The first question is obviously yes (see the tables above). Namely, there are 75 lowercase letters (the 47 lowercase aliases of multi-code-point pairs, ß, and the Greek lowercase letters with ypogegrammeni) and 27 uncased characters (the Greek uppercase letters with prosgegrammeni) that become multi-code-point after uppercasing; on the other hand, only a single uppercase letter (İ) becomes multi-code-point after lowercasing.
Most of the times, as soon as you change from a single code point to multiple code points, you never go back (resulting in the pair-and-alias groups), so the second question is no (in the tables, the "multi-code-point" row only maps to "identity" or "multi-code-point").
However, the answer to the second question changes slightly if you perform NFC normalization after case mapping. Some of these sequences remain multi-code-point after normalization before case mapping, but become single-code-point after case mapping and normalization.
(Note that this isn't meant to be exhaustive, since the candidates list only includes character sequences that are derived from single code points by case mapping in the first place.)
| Sequence | Lowercase |
|---|---|
| J̌ (U+004A U+030C) | ǰ (U+01F0) |
| Ϊ́ (U+03AA U+0301) | ΐ (U+0390) |
| Ϋ́ (U+03AB U+0301) | ΰ (U+03B0) |
| H̱ (U+0048 U+0331) | ẖ (U+1E96) |
| T̈ (U+0054 U+0308) | ẗ (U+1E97) |
| W̊ (U+0057 U+030A) | ẘ (U+1E98) |
| Y̊ (U+0059 U+030A) | ẙ (U+1E99) |
| Υ̓ (U+03A5 U+0313) | ὐ (U+1F50) |
| Υ̓̀ (U+03A5 U+0313 U+0300) | ὒ (U+1F52) |
| Υ̓́ (U+03A5 U+0313 U+0301) | ὔ (U+1F54) |
| Υ̓͂ (U+03A5 U+0313 U+0342) | ὖ (U+1F56) |
| Α͂ (U+0391 U+0342) | ᾶ (U+1FB6) |
| Η͂ (U+0397 U+0342) | ῆ (U+1FC6) |
| Ϊ̀ (U+03AA U+0300) | ῒ (U+1FD2) |
| Ι͂ (U+0399 U+0342) | ῖ (U+1FD6) |
| Ϊ͂ (U+03AA U+0342) | ῗ (U+1FD7) |
| Ϋ̀ (U+03AB U+0300) | ῢ (U+1FE2) |
| Ρ̓ (U+03A1 U+0313) | ῤ (U+1FE4) |
| Υ͂ (U+03A5 U+0342) | ῦ (U+1FE6) |
| Ϋ͂ (U+03AB U+0342) | ῧ (U+1FE7) |
| Ω͂ (U+03A9 U+0342) | ῶ (U+1FF6) |
| Sequence | Uppercase |
|---|---|
| i̇ (U+0069 U+0307) | İ (U+0130) |
toUpper(Lower)Case?No because of the existence of case-invariant characters. Also no because of cases where a single-code-point character is mapped to a multi-code-point string. However, if the character is case-variant and the result is single-code-point, then yes.
No (if you ignore multi-code-point cases). Letters stay letters, and non-letters stay non-letters.
toUpper(Lower)Case always produce upper(lower)case characters? Can it produce lower(upper)case characters?Yes (if you ignore invariant and multi-code-point cases) and no (always).
toUpperCase and toLowerCase (ignoring identity mapping)?No. By idempotence: if A → C by toUpperCase and B → C by toLowerCase, then C is invariant under both operations, but invariant characters are never the result of case mapping from other characters.
toUpper(Lower)Case from two different characters (ignoring identity mapping)?Yes. By the existence of aliases, there are many cases where two characters map to the same one. Namely, the 18 groups with a lowercase letter alias, 3 groups with an uppercase letter alias, 2 groups with two lowercase letter aliases, 1 group with an uppercase letter alias and a lowercase letter alias
toUpperCase and toLowerCase reverse operations?This is to ask, is it true that either x === x.toUpperCase().toLowerCase() or x === x.toLowerCase().toUpperCase() for all single-code-point x?
The answer is no. If x is not a part of an isolated node or a pair (the other 135 characters), then you always end up different from where you started. However these characters are actually quite rare, so for 97% of the time, the answer is yes.
toUpperCase and toLowerCase become reverse operations after applying a toUpperCase? After applying a toLowerCase?This is to ask, is it true that either x.toUpperCase() === x.toUpperCase().toLowerCase().toUpperCase() or x.toLowerCase() === x.toLowerCase().toUpperCase().toLowerCase() for all single-code-point x?
The answer is unfortunately still no, because a single toUpperCase or toLowerCase is not sufficient to move into a steady state. Namely, if x is ẞ (the end of the path-and-pair group), you must apply x.toLowerCase().toUpperCase() to arrive at steady state "SS". For all other characters, the answer is yes.
toUpperCase and toLowerCase on a single character?The answer is 4, and is always in the order of x, x.toLowerCase(), x.toLowerCase().toUpperCase(), x.toLowerCase().toUpperCase().toLowerCase(). x is either ẞ (ẞ → ß → SS → ss) or one of the 27 uppercase Greek letters with prosgegrammeni (e.g. ᾈ → ᾀ → ἈΙ → ἀι).