Unicode case mapping study

This page primarily concerns the following concepts:

Single code point character: Any grapheme that is a single Unicode code point. This includes astral characters, but excludes grapheme clusters, including certain accented characters.
Upper-/lowercase mapping: The default case conversion algorithm implemented in JavaScript by the toUpperCase and toLowerCase methods. (See UnicodeData.txt.) Importantly, this is not the same as case folding (used in JavaScript by Unicode-aware case-insensitive matching), which is a canonicalization process. It is also not locale-sensitive, as provided by the toLocaleUpperCase and toLocaleLowerCase methods.
Upper-/lowercase letter: Any character that has the Unicode Uppercase_Letter (Lu) or Lowercase_Letter (Ll) general category, respectively.
Upper-/lowercase symbol: Any character that has the Unicode Uppercase or Lowercase property, respectively.

There's no great way to list all existing Unicode code points in JavaScript. However, most of these code points, even if they exist, are things like & which are (1) not letters, (2) not cased, (3) invariant under case conversion, and therefore not relevant to this discussion. The non-existent code points also check all these boxes are therefore thrown out. So, we can simply list all the code points and check if they are either cased or case-variant. We call these code points "interesting".

const candidates = new Set();

for (let i = 0; i <= 0x10ffff; i++) {
  const char = String.fromCodePoint(i);
  if (
    char.toUpperCase() !== char ||
    char.toLowerCase() !== char ||
    /^[\p{Uppercase_Letter}\p{Lowercase_Letter}\p{Uppercase}\p{Lowercase}]$/u.test(
      char,
    )
  )
    candidates.add(char);
}

console.log(candidates.size);

At the time of writing (Unicode v17), there are 4632 such interesting code points.

But that's not the end of the story. Some single code point characters map to multiple code points under case conversion, which then can be further case-converted, and so on. (We'll see more closely later.) So, we also need to include all the code points that are reachable from these 4632 code points by applying toUpperCase and toLowerCase repeatedly until we reach a fixed point.

for (const char of [...candidates]) {
  let newChar = char.toUpperCase();
  let isUpper = true;
  while (!candidates.has(newChar)) {
    candidates.add(newChar);
    newChar = isUpper ? newChar.toLowerCase() : newChar.toUpperCase();
    isUpper = !isUpper;
  }

  newChar = char.toLowerCase();
  isUpper = false;
  while (!candidates.has(newChar)) {
    candidates.add(newChar);
    newChar = isUpper ? newChar.toLowerCase() : newChar.toUpperCase();
    isUpper = !isUpper;
  }
}

console.log(candidates.size);

This brings us to a total of 4778 strings. To confirm that this set is closed under case conversion:

for (const char of candidates) {
  if (!candidates.has(char.toUpperCase()))
    console.log(`Failed to satisfy closure under toUpperCase for ${char}`);
  if (!candidates.has(char.toLowerCase()))
    console.log(`Failed to satisfy closure under toLowerCase for ${char}`);
}

By definition "uninteresting" characters cannot change into interesting characters because they are invariant. Can we have an interesting character that changes into an uninteresting character? We check if all of candidates are still interesting.

for (const char of candidates) {
  if (
    char.toUpperCase() === char &&
    char.toLowerCase() === char &&
    !/^[\p{Uppercase_Letter}\p{Lowercase_Letter}\p{Uppercase}\p{Lowercase}]$/u.test(
      char,
    )
  )
    console.log(`Failed to satisfy closure under case conversion for ${char}`);
}

So the candidates set remains fully interesting; the only difference is that we have added some multi-code point strings that are reachable from interesting code points by case conversion.

Relationship between Uppercase_Letter/Lowercase_Letter and Uppercase/Lowercase

The Uppercase_Letter and Lowercase_Letter general categories are supposedly subsets of the Uppercase and Lowercase properties, respectively. Furthermore a character cannot be simultaneously uppercase and lowercase. We shall verify this:

let c1 = 0,
  c2 = 0,
  c3 = 0,
  c4 = 0,
  c5 = 0,
  c6 = 0;

for (const char of candidates) {
  const isLu = /^\p{Uppercase_Letter}$/u.test(char);
  const isLl = /^\p{Lowercase_Letter}$/u.test(char);
  const isUppercase = /^\p{Uppercase}$/u.test(char);
  const isLowercase = /^\p{Lowercase}$/u.test(char);
  if (isLu && !isUppercase)
    console.log(`Failed to satisfy Uppercase_Letter ⊆ Uppercase for ${char}`);
  if (isLl && !isLowercase)
    console.log(`Failed to satisfy Lowercase_Letter ⊆ Lowercase for ${char}`);
  if (isUppercase && isLowercase)
    console.log(`Failed to satisfy Uppercase ∩ Lowercase = ∅ for ${char}`);
  if (isLu) c1++;
  else if (isLl) c2++;
  else if (isUppercase) c3++;
  else if (isLowercase) c4++;
  else if ([...char].length === 1) c5++;
  else c6++;
}

console.log(c1, c2, c3, c4, c5, c6);

Indeed this is true. Other than the 4778－4632=146 multi-code point strings, the remaining ones may have a maximum of 2⁴=16 combinations of these 4 boolean properties, but since Uppercase_Letter implies Uppercase, Lowercase_Letter implies Lowercase, Uppercase implies not Lowercase and vice versa, we are left with only 5 valid combinations:

Uppercase_Letter	Lowercase_Letter	Uppercase	Lowercase	Category	Count
1	0	1	0	Uppercase letter	1858
0	1	0	1	Lowercase letter	2258
0	0	1	0	Uppercase non-letter	120
0	0	0	1	Lowercase non-letter	311
0	0	0	0	Uncased	31

Mapping graph

Here's the data structure we are going to be working with: a mapping graph. The node set is the candidates set. Each node has exactly two outgoing edges, one labeled toUpperCase and the other labeled toLowerCase, which point to the nodes corresponding to the result of applying the respective case conversion method to the node. Edges can be self-referential, and there can be any number of incoming edges.

Nodes in this graph can be classified into the following categories:

Multi-code-point
Uppercase letter
Uppercase non-letter
Lowercase letter
Lowercase non-letter
Uncased

It's apparent that this graph isn't going to be very connected: it's going to have many isolated components. We identify these components by their shapes.

To begin with, we build a reverse graph for easy lookup:

const lowersTo = new Map();
const uppersTo = new Map();

for (const s of candidates) {
  const lower = s.toLowerCase();
  const upper = s.toUpperCase();
  if (s !== lower)
    lowersTo.set(lower, (lowersTo.get(lower) ?? new Set()).add(s));
  if (s !== upper)
    uppersTo.set(upper, (uppersTo.get(upper) ?? new Set()).add(s));
}

function formatChar(char) {
  const points = [...char]
    .map(
      (c) =>
        `U+${c.codePointAt(0).toString(16).toUpperCase().padStart(4, "0")}`,
    )
    .join(" ");
  return `${char} (${points})`;
}

function charType(char) {
  if ([...char].length > 1) return "multi-code-point";
  if (/^\p{Uppercase_Letter}$/u.test(char)) return "uppercase letter";
  if (/^\p{Lowercase_Letter}$/u.test(char)) return "lowercase letter";
  if (/^\p{Uppercase}$/u.test(char)) return "uppercase non-letter";
  if (/^\p{Lowercase}$/u.test(char)) return "lowercase non-letter";
  return "uncased";
}

Isolated nodes

There are some nodes that are both toUpperCase-invariant and toLowerCase-invariant. These are still interesting because they are cased, but they have 2 self-loop edges. Furthermore, all of them are isolated, meaning that no other node points to them.

const isolated = [];

for (const char of candidates) {
  if (char.toUpperCase() === char && char.toLowerCase() === char) {
    if (lowersTo.has(char) || uppersTo.has(char))
      console.log(`Failed to satisfy isolation for ${char}`);
    isolated.push(char);
  }
}

const g1 = [],
  g2 = [],
  g3 = [],
  g4 = [];
for (const char of isolated) {
  const type = charType(char);
  if (type === "uppercase letter") g1.push(char);
  else if (type === "lowercase letter") g2.push(char);
  else if (type === "uppercase non-letter") g3.push(char);
  else if (type === "lowercase non-letter") g4.push(char);
  else console.log(`Bad category: ${char} (${type})`);
}

console.table(g1.map(formatChar));
console.table(g2.map(formatChar));
console.table(g3.map(formatChar));
console.table(g4.map(formatChar));

There are 1651 such isolated nodes. Among these: there 499 uppercase letters, 805 lowercase letters, 78 uppercase non-letters, and 269 lowercase non-letters.

Pairs

Presumably most of the nodes satisfy the following:

One of them is the uppercase: it's toUpperCase-invariant.
The other is the lowercase: it's toLowerCase-invariant.
They point to each other: the uppercase's toLowerCase edge points to the lowercase, and the lowercase's toUpperCase edge points to the uppercase.

const pairs = [];

for (const upper of candidates) {
  if (upper.toUpperCase() !== upper) continue;
  const lower = upper.toLowerCase();
  if (lower.toUpperCase() !== upper) continue;
  pairs.push({ upper, lower });
}

console.table(
  pairs.map(({ upper, lower }) => ({
    upper: formatChar(upper),
    lower: formatChar(lower),
  })),
);

There are 1496 such pairs; together with the isolated nodes, they account for 4643 nodes, which is 97.2% of the entire graph's 4778 nodes. Furthermore, all of these pairs satisfy that either they are an uppercase/lowercase letter pair (1381 pairs), or they are an uppercase/lowercase non-letter pair (42 pairs), or they both have multiple code points (73 pairs).

c1 = 0;
c2 = 0;
c3 = 0;
for (const { upper, lower } of pairs) {
  const upperType = charType(upper);
  const lowerType = charType(lower);
  if (upperType === "uppercase letter" && lowerType === "lowercase letter") {
    c1++;
  } else if (
    upperType === "uppercase non-letter" &&
    lowerType === "lowercase non-letter"
  ) {
    c2++;
  } else if (
    upperType === "multi-code-point" &&
    lowerType === "multi-code-point"
  ) {
    c3++;
  } else {
    console.log(
      `Bad pair: ${formatChar(upper)} (${upperType}) and ${formatChar(lower)} (${lowerType})`,
    );
  }
}
console.log(c1, c2, c3);

There cannot exist cycle kinds other than isolated nodes and pairs. If a triangle exists: A → B → C → A, then two of these three adjacent edges must be labeled the same, violating idempotence. (Note that this is about a circular triangle; there can be a non-circular triangle, where A uppercases to B and lowercases to C, while B and C form a pair. We'll see these later.) There could theoretically be other even-numbered cycles (A uppercases to B, B lowercases to C, C uppercases to D, D lowercases to A), but that would be exceedingly unlikely. Therefore, the remaining 135 nodes must eventually arrive at one of these pairs (since the isolated nodes don't have incoming edges) via some single-directional path.

Aliases

Since these 135 nodes must be pointing to some pair via a single-directional path, we just traverse each one and see where we end up.

There are two kinds of cases here: (1) the node is invariant under one mapping; (2) the node is variant under both mappings. In the first case, there's one simple path; in the second case, there are two paths that may eventually reach different places. However, we can be sure about one thing: this node must have no incoming edges, because that would break idempotence. So we just need to find all nodes with no incoming edges, and then traverse them.

const isolatedSet = new Set(isolated);
const pairLookup = new Map(
  pairs.flatMap(({ upper, lower }) => [
    [upper, lower],
    [lower, upper],
  ]),
);

const rest = candidates.difference(isolatedSet).difference(pairLookup);

for (const char of rest) {
  if (lowersTo.has(char) || uppersTo.has(char)) continue;
  const nextNodes = (
    [
      ["-U->", char.toUpperCase()],
      ["-L->", char.toLowerCase()],
    ]
  ).filter((c) => c[1] !== char);
  if (nextNodes.length === 0)
    throw new Error(`Can't happen because ${char} is not isolated`);
  for (const [label, next] of nextNodes) {
    const path = [char, label, next];
    let current = next;
    while (!pairLookup.has(current)) {
      if (isolatedSet.has(current))
        throw new Error(`Can't happen because ${current} is isolated`);
      const nextNodes = (
        [
          ["-U->", current.toUpperCase()],
          ["-L->", current.toLowerCase()],
        ]
      ).filter((c) => c[1] !== current);
      if (nextNodes.length === 0)
        throw new Error(`Can't happen because ${current} is not isolated`);
      if (nextNodes.length > 1) {
        throw new Error(
          `Can't happen because ${current} shouldn't have multiple outgoing edges`,
        );
      }
      const [label, next] = nextNodes[0]!;
      path.push(label, next);
      current = next;
    }
    console.log(
      path
        .map((x, i) => {
          if (i === path.length - 1)
            return `${formatChar(x)} <--> ${formatChar(pairLookup.get(x)!)}`;
          else if (i % 2 === 0) return formatChar(x);
          return x;
        })
        .join(" "),
    );
  }
}

The results form a few obvious groups:

Latin-1 Supplement, Latin Extended-A, Greek and Coptic, Cyrillic Extended-C, Latin Extended Additional, Greek Extended (1+2+8+9+1+1=22 groups, 22 non-pair extra nodes)

An extra alias for the lowercase form: A → B, B ↔ C. A is a lowercase letter, while B and C an upper/lowercase letter pair.

A	B	C
µ (U+00B5)	Μ (U+039C)	μ (U+03BC)
ı (U+0131)	I (U+0049)	i (U+0069)
ſ (U+017F)	S (U+0053)	s (U+0073)
ς (U+03C2)	Σ (U+03A3)	σ (U+03C3)
ϐ (U+03D0)	Β (U+0392)	β (U+03B2)
ϑ (U+03D1)	Θ (U+0398)	θ (U+03B8)
ϕ (U+03D5)	Φ (U+03A6)	φ (U+03C6)
ϖ (U+03D6)	Π (U+03A0)	π (U+03C0)
ϰ (U+03F0)	Κ (U+039A)	κ (U+03BA)
ϱ (U+03F1)	Ρ (U+03A1)	ρ (U+03C1)
ϵ (U+03F5)	Ε (U+0395)	ε (U+03B5)
ᲀ (U+1C80)	В (U+0412)	в (U+0432)
ᲁ (U+1C81)	Д (U+0414)	д (U+0434)
ᲂ (U+1C82)	О (U+041E)	о (U+043E)
ᲃ (U+1C83)	С (U+0421)	с (U+0441)
ᲄ (U+1C84)	Т (U+0422)	т (U+0442)
ᲅ (U+1C85)	Т (U+0422)	т (U+0442)
ᲆ (U+1C86)	Ъ (U+042A)	ъ (U+044A)
ᲇ (U+1C87)	Ѣ (U+0462)	ѣ (U+0463)
ᲈ (U+1C88)	Ꙋ (U+A64A)	ꙋ (U+A64B)
ẛ (U+1E9B)	Ṡ (U+1E60)	ṡ (U+1E61)
ι (U+1FBE)	Ι (U+0399)	ι (U+03B9)

Latin Extended-A, Latin Extended-B, Greek and Coptic, Armenian, Latin Extended Additional, Greek Extended, Alphabetic Presentation Forms (1+1+2+1+5+25+12=47 groups, 47 non-pair extra nodes)

An extra alias for the lowercase form: A → B, B ↔ C. A is a lowercase letter, while B and C are multi-code-point strings.

A	B	C
ŉ (U+0149)	ʼN (U+02BC U+004E)	ʼn (U+02BC U+006E)
ǰ (U+01F0)	J̌ (U+004A U+030C)	ǰ (U+006A U+030C)
ΐ (U+0390)	Ϊ́ (U+0399 U+0308 U+0301)	ΐ (U+03B9 U+0308 U+0301)
ΰ (U+03B0)	Ϋ́ (U+03A5 U+0308 U+0301)	ΰ (U+03C5 U+0308 U+0301)
և (U+0587)	ԵՒ (U+0535 U+0552)	եւ (U+0565 U+0582)
ẖ (U+1E96)	H̱ (U+0048 U+0331)	ẖ (U+0068 U+0331)
ẗ (U+1E97)	T̈ (U+0054 U+0308)	ẗ (U+0074 U+0308)
ẘ (U+1E98)	W̊ (U+0057 U+030A)	ẘ (U+0077 U+030A)
ẙ (U+1E99)	Y̊ (U+0059 U+030A)	ẙ (U+0079 U+030A)
ẚ (U+1E9A)	Aʾ (U+0041 U+02BE)	aʾ (U+0061 U+02BE)
ὐ (U+1F50)	Υ̓ (U+03A5 U+0313)	ὐ (U+03C5 U+0313)
ὒ (U+1F52)	Υ̓̀ (U+03A5 U+0313 U+0300)	ὒ (U+03C5 U+0313 U+0300)
ὔ (U+1F54)	Υ̓́ (U+03A5 U+0313 U+0301)	ὔ (U+03C5 U+0313 U+0301)
ὖ (U+1F56)	Υ̓͂ (U+03A5 U+0313 U+0342)	ὖ (U+03C5 U+0313 U+0342)
ᾲ (U+1FB2)	ᾺΙ (U+1FBA U+0399)	ὰι (U+1F70 U+03B9)
ᾴ (U+1FB4)	ΆΙ (U+0386 U+0399)	άι (U+03AC U+03B9)
ᾶ (U+1FB6)	Α͂ (U+0391 U+0342)	ᾶ (U+03B1 U+0342)
ᾷ (U+1FB7)	Α͂Ι (U+0391 U+0342 U+0399)	ᾶι (U+03B1 U+0342 U+03B9)
ῂ (U+1FC2)	ῊΙ (U+1FCA U+0399)	ὴι (U+1F74 U+03B9)
ῄ (U+1FC4)	ΉΙ (U+0389 U+0399)	ήι (U+03AE U+03B9)
ῆ (U+1FC6)	Η͂ (U+0397 U+0342)	ῆ (U+03B7 U+0342)
ῇ (U+1FC7)	Η͂Ι (U+0397 U+0342 U+0399)	ῆι (U+03B7 U+0342 U+03B9)
ῒ (U+1FD2)	Ϊ̀ (U+0399 U+0308 U+0300)	ῒ (U+03B9 U+0308 U+0300)
ΐ (U+1FD3)	Ϊ́ (U+0399 U+0308 U+0301)	ΐ (U+03B9 U+0308 U+0301)
ῖ (U+1FD6)	Ι͂ (U+0399 U+0342)	ῖ (U+03B9 U+0342)
ῗ (U+1FD7)	Ϊ͂ (U+0399 U+0308 U+0342)	ῗ (U+03B9 U+0308 U+0342)
ῢ (U+1FE2)	Ϋ̀ (U+03A5 U+0308 U+0300)	ῢ (U+03C5 U+0308 U+0300)
ΰ (U+1FE3)	Ϋ́ (U+03A5 U+0308 U+0301)	ΰ (U+03C5 U+0308 U+0301)
ῤ (U+1FE4)	Ρ̓ (U+03A1 U+0313)	ῤ (U+03C1 U+0313)
ῦ (U+1FE6)	Υ͂ (U+03A5 U+0342)	ῦ (U+03C5 U+0342)
ῧ (U+1FE7)	Ϋ͂ (U+03A5 U+0308 U+0342)	ῧ (U+03C5 U+0308 U+0342)
ῲ (U+1FF2)	ῺΙ (U+1FFA U+0399)	ὼι (U+1F7C U+03B9)
ῴ (U+1FF4)	ΏΙ (U+038F U+0399)	ώι (U+03CE U+03B9)
ῶ (U+1FF6)	Ω͂ (U+03A9 U+0342)	ῶ (U+03C9 U+0342)
ῷ (U+1FF7)	Ω͂Ι (U+03A9 U+0342 U+0399)	ῶι (U+03C9 U+0342 U+03B9)
ﬀ (U+FB00)	FF (U+0046 U+0046)	ff (U+0066 U+0066)
ﬁ (U+FB01)	FI (U+0046 U+0049)	fi (U+0066 U+0069)
ﬂ (U+FB02)	FL (U+0046 U+004C)	fl (U+0066 U+006C)
ﬃ (U+FB03)	FFI (U+0046 U+0046 U+0049)	ffi (U+0066 U+0066 U+0069)
ﬄ (U+FB04)	FFL (U+0046 U+0046 U+004C)	ffl (U+0066 U+0066 U+006C)
ﬅ (U+FB05)	ST (U+0053 U+0054)	st (U+0073 U+0074)
ﬆ (U+FB06)	ST (U+0053 U+0054)	st (U+0073 U+0074)
ﬓ (U+FB13)	ՄՆ (U+0544 U+0546)	մն (U+0574 U+0576)
ﬔ (U+FB14)	ՄԵ (U+0544 U+0535)	մե (U+0574 U+0565)
ﬕ (U+FB15)	ՄԻ (U+0544 U+053B)	մի (U+0574 U+056B)
ﬖ (U+FB16)	ՎՆ (U+054E U+0546)	վն (U+057E U+0576)
ﬗ (U+FB17)	ՄԽ (U+0544 U+053D)	մխ (U+0574 U+056D)

Combining Diacritical Marks (1 group, 1 non-pair extra node)
- An extra alias for the lowercase form: A → B, B ↔ C. A is uncased (combining ypogegrammeni), while B and C are an upper/lowercase letter pair.
  
  A B C
  ◌ͅ (U+0345) Ι (U+0399) ι (U+03B9)
Greek and Coptic, Letterlike characters (1+3=4 groups, 4 non-pair extra nodes)
- An extra alias for the uppercase form: A → C, B ↔ C. A is an uppercase letter, while B and C are an upper/lowercase letter pair.
  
  A B C
  ϴ (U+03F4) Θ (U+0398) θ (U+03B8)
  Ω (U+2126) Ω (U+03A9) ω (U+03C9)
  K (U+212A) K (U+004B) k (U+006B)
  Å (U+212B) Å (U+00C5) å (U+00E5)
Latin Extended-A (1 group, 1 non-pair extra node)
- An extra alias for the uppercase form: A → C, B ↔ C. A is an uppercase letter, while B and C are multi-code-point strings.
  
  A B C
  İ (U+0130) İ (U+0049 U+0307) i̇ (U+0069 U+0307)
Latin Extended-B (4 groups, 4 non-pair extra nodes)
- 4 triangles of the form A → B, A → C, B ↔ C. A is uncased (contains a capital and a small letter in a ligature), while B and C are an upper/lowercase letter pair.
  
  A B C
  ǅ (U+01C5) Ǆ (U+01C4) ǆ (U+01C6)
  ǈ (U+01C8) Ǉ (U+01C7) ǉ (U+01C9)
  ǋ (U+01CB) Ǌ (U+01CA) ǌ (U+01CC)
  ǲ (U+01F2) Ǳ (U+01F1) ǳ (U+01F3)
Latin-1 Supplement and Latin Extended Additional (1 group, 2 non-pair extra nodes)
- A path of the form A → B, B → C, C ↔ D. A is an uppercase letter, B is a lowercase letter, while C and D are both multi-code-point strings.
  
  A B C D
  ẞ (U+1E9E) ß (U+00DF) SS (U+0053 U+0053) ss (U+0073 U+0073)
This seems to be an unfortunate historical artifact: had ß been transformed by toUpperCase into ẞ instead of SS, we would have two trivial pairs.

A	B	C
◌ͅ (U+0345)	Ι (U+0399)	ι (U+03B9)

A	B	C
ϴ (U+03F4)	Θ (U+0398)	θ (U+03B8)
Ω (U+2126)	Ω (U+03A9)	ω (U+03C9)
K (U+212A)	K (U+004B)	k (U+006B)
Å (U+212B)	Å (U+00C5)	å (U+00E5)

A	B	C
İ (U+0130)	İ (U+0049 U+0307)	i̇ (U+0069 U+0307)

A	B	C
ǅ (U+01C5)	Ǆ (U+01C4)	ǆ (U+01C6)
ǈ (U+01C8)	Ǉ (U+01C7)	ǉ (U+01C9)
ǋ (U+01CB)	Ǌ (U+01CA)	ǌ (U+01CC)
ǲ (U+01F2)	Ǳ (U+01F1)	ǳ (U+01F3)

A	B	C	D
ẞ (U+1E9E)	ß (U+00DF)	SS (U+0053 U+0053)	ss (U+0073 U+0073)

Greek Extended (27 groups, 54 non-pair extra nodes)

4 pair-and-triangle groups of the form A → B, A → C, B → C, C ↔ D. A is uncased (contains a capital letter and a prosgegrammeni), B is a lowercase letter (contains a lowercase letter and a ypogegrammeni), while C and D are both multi-code-point strings, where the ypogegrammeni is replaced by a real iota.

A	B	C	D
ᾈ (U+1F88)	ᾀ (U+1F80)	ἈΙ (U+1F08 U+0399)	ἀι (U+1F00 U+03B9)
ᾉ (U+1F89)	ᾁ (U+1F81)	ἉΙ (U+1F09 U+0399)	ἁι (U+1F01 U+03B9)
ᾊ (U+1F8A)	ᾂ (U+1F82)	ἊΙ (U+1F0A U+0399)	ἂι (U+1F02 U+03B9)
ᾋ (U+1F8B)	ᾃ (U+1F83)	ἋΙ (U+1F0B U+0399)	ἃι (U+1F03 U+03B9)
ᾌ (U+1F8C)	ᾄ (U+1F84)	ἌΙ (U+1F0C U+0399)	ἄι (U+1F04 U+03B9)
ᾍ (U+1F8D)	ᾅ (U+1F85)	ἍΙ (U+1F0D U+0399)	ἅι (U+1F05 U+03B9)
ᾎ (U+1F8E)	ᾆ (U+1F86)	ἎΙ (U+1F0E U+0399)	ἆι (U+1F06 U+03B9)
ᾏ (U+1F8F)	ᾇ (U+1F87)	ἏΙ (U+1F0F U+0399)	ἇι (U+1F07 U+03B9)
ᾘ (U+1F98)	ᾐ (U+1F90)	ἨΙ (U+1F28 U+0399)	ἠι (U+1F20 U+03B9)
ᾙ (U+1F99)	ᾑ (U+1F91)	ἩΙ (U+1F29 U+0399)	ἡι (U+1F21 U+03B9)
ᾚ (U+1F9A)	ᾒ (U+1F92)	ἪΙ (U+1F2A U+0399)	ἢι (U+1F22 U+03B9)
ᾛ (U+1F9B)	ᾓ (U+1F93)	ἫΙ (U+1F2B U+0399)	ἣι (U+1F23 U+03B9)
ᾜ (U+1F9C)	ᾔ (U+1F94)	ἬΙ (U+1F2C U+0399)	ἤι (U+1F24 U+03B9)
ᾝ (U+1F9D)	ᾕ (U+1F95)	ἭΙ (U+1F2D U+0399)	ἥι (U+1F25 U+03B9)
ᾞ (U+1F9E)	ᾖ (U+1F96)	ἮΙ (U+1F2E U+0399)	ἦι (U+1F26 U+03B9)
ᾟ (U+1F9F)	ᾗ (U+1F97)	ἯΙ (U+1F2F U+0399)	ἧι (U+1F27 U+03B9)
ᾨ (U+1FA8)	ᾠ (U+1FA0)	ὨΙ (U+1F68 U+0399)	ὠι (U+1F60 U+03B9)
ᾩ (U+1FA9)	ᾡ (U+1FA1)	ὩΙ (U+1F69 U+0399)	ὡι (U+1F61 U+03B9)
ᾪ (U+1FAA)	ᾢ (U+1FA2)	ὪΙ (U+1F6A U+0399)	ὢι (U+1F62 U+03B9)
ᾫ (U+1FAB)	ᾣ (U+1FA3)	ὫΙ (U+1F6B U+0399)	ὣι (U+1F63 U+03B9)
ᾬ (U+1FAC)	ᾤ (U+1FA4)	ὬΙ (U+1F6C U+0399)	ὤι (U+1F64 U+03B9)
ᾭ (U+1FAD)	ᾥ (U+1FA5)	ὭΙ (U+1F6D U+0399)	ὥι (U+1F65 U+03B9)
ᾮ (U+1FAE)	ᾦ (U+1FA6)	ὮΙ (U+1F6E U+0399)	ὦι (U+1F66 U+03B9)
ᾯ (U+1FAF)	ᾧ (U+1FA7)	ὯΙ (U+1F6F U+0399)	ὧι (U+1F67 U+03B9)
ᾼ (U+1FBC)	ᾳ (U+1FB3)	ΑΙ (U+0391 U+0399)	αι (U+03B1 U+03B9)
ῌ (U+1FCC)	ῃ (U+1FC3)	ΗΙ (U+0397 U+0399)	ηι (U+03B7 U+03B9)
ῼ (U+1FFC)	ῳ (U+1FF3)	ΩΙ (U+03A9 U+0399)	ωι (U+03C9 U+03B9)

Among these 107 groups, most pairs only participate in one group, except:

Θ (U+0398) and θ (U+03B8): has ϴ (U+03F4) as another uppercase and ϑ (U+03D1) as another lowercase.
Т (U+0422) and т (U+0442): has both ᲄ (U+1C84) and ᲅ (U+1C85) as alternative lowercase.
Ι (U+0399) and ι (U+03B9): has both ι (U+1FBE) and ◌ͅ (U+0345) as alternative lowercase.
ST (U+0053 U+0054) and st (U+0073 U+0074): has both ﬅ (U+FB05) and ﬆ (U+FB06) as alternative lowercase.
Ϊ́ (U+0399 U+0308 U+0301) and ΐ (U+03B9 U+0308 U+0301): has both ΐ (U+0390) and ΐ (U+1FD3) as alternative lowercase.
Ϋ́ (U+03A5 U+0308 U+0301) and ΰ (U+03C5 U+0308 U+0301): has both ΰ (U+03B0) and ΰ (U+1FE3) as alternative lowercase.

Therefore there are 101 unique pairs which host vestigial nodes; all other 1496－101=1395 pairs are trivial.

Graph summary

To summarize, we have the following partition of the mapping graph:

1651 isolated nodes:
- 499 uppercase letters
- 805 lowercase letters
- 78 uppercase non-letters
- 269 lowercase non-letters
1395 trivial pairs (2790 nodes):
- 1353 pairs of uppercase and lowercase letters (2706 nodes)
- 42 pairs of uppercase and lowercase non-letters (84 nodes)
59 pair-and-alias groups with a lowercase alias (177 nodes):
- 18 groups of upper/lowercase letter pairs with a lowercase letter alias (54 nodes)
- 41 groups of multi-code-point string pairs with a lowercase letter alias (123 nodes)
4 pair-and-alias groups with an uppercase alias (12 nodes):
- 3 groups of upper/lowercase letter pairs with an uppercase letter alias (9 nodes)
- 1 group of multi-code-point string pairs with an uppercase letter alias (3 nodes)
5 pair-and-alias groups with two lowercase aliases (20 nodes):
- 2 groups of upper/lowercase letter pairs with two lowercase letter aliases (16 nodes)
- 3 groups of multi-code-point string pairs with two lowercase letter aliases (4 nodes)
1 pair-and-alias group with both an uppercase and lowercase alias (4 nodes):
- 1 group of upper/lowercase letter pairs with an uppercase letter alias and a lowercase letter alias (4 nodes)
4 triangular groups where A → B, A → C, B ↔ C (12 nodes)
- 4 groups where A is uncased, B and C are an upper/lowercase letter pair (12 nodes)
1 pair-and-path group where A → B, B → C, C ↔ D (4 nodes)
- 1 group where A is an uppercase letter, B is a lowercase letter, while C and D are both multi-code-point strings (4 nodes)
27 pair-and-triangle groups where A → B, A → C, B → C, C ↔ D (108 nodes)
- 27 groups where A is uncased, B is a lowercase letter, C and D are both multi-code-point strings

Input/output of case mapping

Everything can be summarized by these two tables:

Input and output cases of `toUpperCase`
Input \ Output	Identity	uppercase letter	uppercase non-letter	multi-code-point
lowercase letter	805	1403		75
uppercase letter	1886
lowercase non-letter	269		42
uppercase non-letter	120
uncased		4		27
multi-code-point	73			73

Input and output cases of `toLowerCase`
Input \ Output	Identity	lowercase letter	lowercase non-letter	multi-code-point
lowercase letter	2283
uppercase letter	499	1386		1
lowercase non-letter	312
uppercase non-letter	78		42
uncased		31
multi-code-point	73			73

Does upper(lower) case imply upper(lower)case invariance?

Yes. Looking horizontally across the "uppercase letter" and "uppercase non-letter" rows in the first table and "lowercase letter" and "lowercase non-letter" rows in the second table, all characters are identity-mapped.

As a corollary, given a character is cased, lower(upper)case variance implies upper(lower) case.

Does upper(lower) case imply lower(upper)case variance?

No. The isolated nodes are cased but case-invariant. Excluding them, the answer is yes (but is almost trivial).

As a corollary, given a character is cased, upper(lower)case invariance does not imply upper(lower) case.

Does case-mapping variance imply casedness?

No. There are characters that are uncased, but are case-mapping variant. These are exactly the intermediate nodes in triangles:

The 4 ligatures: ǅ (U+01C5), ǈ (U+01C8), ǋ (U+01CB), ǲ (U+01F2)
All the 27 uppercase greek letters with prosgegrammeni

Are uppercase variance and lowercase variance mutually exclusive?

No. There are characters that are both uppercase and lowercase variant. These are exactly the intermediate nodes in triangles:

The 4 ligatures: ǅ (U+01C5), ǈ (U+01C8), ǋ (U+01CB), ǲ (U+01F2)
All the 27 uppercase greek letters with prosgegrammeni

const bothVariant = [];
for (const char of candidates) {
  if (char.toUpperCase() !== char && char.toLowerCase() !== char)
    bothVariant.push(char);
}
console.table(bothVariant.map(formatChar));

Can a case-variant character change into a case-invariant character?

No, never. In our mapping graph, the only sink nodes are the isolated nodes. You can never enter a node without being able to exit.

Can an uncased character change into a cased character, or vice versa?

Yes; specifically, the triangle intermediate nodes are uncased but can change into cased characters. But no, a cased character can never change into an uncased character.

Can a single-code-point character change into a multi-code-point string, or vice versa?

The first question is obviously yes (see the tables above). Namely, there are 75 lowercase letters (the 47 lowercase aliases of multi-code-point pairs, ß, and the Greek lowercase letters with ypogegrammeni) and 27 uncased characters (the Greek uppercase letters with prosgegrammeni) that become multi-code-point after uppercasing; on the other hand, only a single uppercase letter (İ) becomes multi-code-point after lowercasing.

Most of the times, as soon as you change from a single code point to multiple code points, you never go back (resulting in the pair-and-alias groups), so the second question is no (in the tables, the "multi-code-point" row only maps to "identity" or "multi-code-point").

However, the answer to the second question changes slightly if you perform NFC normalization after case mapping. Some of these sequences remain multi-code-point after normalization before case mapping, but become single-code-point after case mapping and normalization.

const lowerToSingle = [];
const upperToSingle = [];

for (const char of candidates) {
  const normChar = char.normalize("NFC");
  if ([...normChar].length > 1) {
    const upper = char.toUpperCase().normalize("NFC");
    const lower = char.toLowerCase().normalize("NFC");
    if ([...upper].length === 1) upperToSingle.push({ char: normChar, upper });
    if ([...lower].length === 1) lowerToSingle.push({ char: normChar, lower });
  }
}

console.table(
  lowerToSingle.map(({ char, lower }) => ({
    char: formatChar(char),
    lower: formatChar(lower),
  })),
);

console.table(
  upperToSingle.map(({ char, upper }) => ({
    char: formatChar(char),
    upper: formatChar(upper),
  })),
);

(Note that this isn't meant to be exhaustive, since the candidates list only includes character sequences that are derived from single code points by case mapping in the first place.)

Sequence	Lowercase
J̌ (U+004A U+030C)	ǰ (U+01F0)
Ϊ́ (U+03AA U+0301)	ΐ (U+0390)
Ϋ́ (U+03AB U+0301)	ΰ (U+03B0)
H̱ (U+0048 U+0331)	ẖ (U+1E96)
T̈ (U+0054 U+0308)	ẗ (U+1E97)
W̊ (U+0057 U+030A)	ẘ (U+1E98)
Y̊ (U+0059 U+030A)	ẙ (U+1E99)
Υ̓ (U+03A5 U+0313)	ὐ (U+1F50)
Υ̓̀ (U+03A5 U+0313 U+0300)	ὒ (U+1F52)
Υ̓́ (U+03A5 U+0313 U+0301)	ὔ (U+1F54)
Υ̓͂ (U+03A5 U+0313 U+0342)	ὖ (U+1F56)
Α͂ (U+0391 U+0342)	ᾶ (U+1FB6)
Η͂ (U+0397 U+0342)	ῆ (U+1FC6)
Ϊ̀ (U+03AA U+0300)	ῒ (U+1FD2)
Ι͂ (U+0399 U+0342)	ῖ (U+1FD6)
Ϊ͂ (U+03AA U+0342)	ῗ (U+1FD7)
Ϋ̀ (U+03AB U+0300)	ῢ (U+1FE2)
Ρ̓ (U+03A1 U+0313)	ῤ (U+1FE4)
Υ͂ (U+03A5 U+0342)	ῦ (U+1FE6)
Ϋ͂ (U+03AB U+0342)	ῧ (U+1FE7)
Ω͂ (U+03A9 U+0342)	ῶ (U+1FF6)

Latin Extended-B
- LATIN SMALL LETTER J WITH CARON (U+01F0)
Greek and Coptic
- GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS (U+0390)
- GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS (U+03B0)
Latin Extended Additional
- 4 characters
Greek Extended
- 14 characters

Sequence	Uppercase
i̇ (U+0069 U+0307)	İ (U+0130)

Latin Extended-A
- LATIN CAPITAL LETTER I WITH DOT ABOVE (U+0130)

Are lower(upper)case characters always mapped to upper(lower)case characters by `toUpper(Lower)Case`?

No because of the existence of case-invariant characters. Also no because of cases where a single-code-point character is mapped to a multi-code-point string. However, if the character is case-variant and the result is single-code-point, then yes.

Can case-mapping convert a letter to a non-letter, or vice versa?

No (if you ignore multi-code-point cases). Letters stay letters, and non-letters stay non-letters.

Does `toUpper(Lower)Case` always produce upper(lower)case characters? Can it produce lower(upper)case characters?

Yes (if you ignore invariant and multi-code-point cases) and no (always).

Functional properties

Can the same character be produced by both `toUpperCase` and `toLowerCase` (ignoring identity mapping)?

No. By idempotence: if A → C by toUpperCase and B → C by toLowerCase, then C is invariant under both operations, but invariant characters are never the result of case mapping from other characters.

Can the same character be produced by `toUpper(Lower)Case` from two different characters (ignoring identity mapping)?

Yes. By the existence of aliases, there are many cases where two characters map to the same one. Namely, the 18 groups with a lowercase letter alias, 3 groups with an uppercase letter alias, 2 groups with two lowercase letter aliases, 1 group with an uppercase letter alias and a lowercase letter alias

Are `toUpperCase` and `toLowerCase` reverse operations?

This is to ask, is it true that either x === x.toUpperCase().toLowerCase() or x === x.toLowerCase().toUpperCase() for all single-code-point x?

The answer is no. If x is not a part of an isolated node or a pair (the other 135 characters), then you always end up different from where you started. However these characters are actually quite rare, so for 97% of the time, the answer is yes.

Do `toUpperCase` and `toLowerCase` become reverse operations after applying a `toUpperCase`? After applying a `toLowerCase`?

This is to ask, is it true that either x.toUpperCase() === x.toUpperCase().toLowerCase().toUpperCase() or x.toLowerCase() === x.toLowerCase().toUpperCase().toLowerCase() for all single-code-point x?

The answer is unfortunately still no, because a single toUpperCase or toLowerCase is not sufficient to move into a steady state. Namely, if x is ẞ (the end of the path-and-pair group), you must apply x.toLowerCase().toUpperCase() to arrive at steady state "SS". For all other characters, the answer is yes.

What's the maximum number of strings one can reach by repeatedly applying `toUpperCase` and `toLowerCase` on a single character?

The answer is 4, and is always in the order of x, x.toLowerCase(), x.toLowerCase().toUpperCase(), x.toLowerCase().toUpperCase().toLowerCase(). x is either ẞ (ẞ → ß → SS → ss) or one of the 27 uppercase Greek letters with prosgegrammeni (e.g. ᾈ → ᾀ → ἈΙ → ἀι).