On pinyin and Chinese romanization systems

I recently saw the following question one more time:

Why do Chinese people like making up new English names instead of using their real names?

This question has been around for a few decades, and it was mostly taken as a cultural issue, but I recently had a new theory: this is the result of pinyin being such a damn bad system for phonetic transcription.

When we say "romanization", there are three different goals it could fulfill:

I'm going to do a step-by-step breakdown of the different transcription systems and judge which one makes the best choice for each phone. My source is Wikipedia Comparison of Standard Chinese transcription systems (excluding EFEO and Lessing-Othmer, which are based on French and German orthographies instead of English). If this section is too technical for you, jump to the conclusion at the end.

Consonants

Let's start with the standard Mandarin phonetic inventory and the English phonetic inventory for onset consonants:

LabialDentalAlveolar1Post-alveolarRetroflexPalatalVelarGlottal
Nasalm🇨🇳🇺🇸2n🇨🇳🇺🇸
Plosivep/b🇨🇳🇺🇸3t/d🇨🇳🇺🇸k/g🇨🇳🇺🇸
Affricatets/dz🇨🇳tʃ/dʒ🇺🇸tʂ/dʐ🇨🇳tɕ/dʑ🇨🇳
Fricativef🇨🇳🇺🇸/v🇺🇸θ/ð🇺🇸s🇨🇳🇺🇸/z🇺🇸ʃ/ʒ🇺🇸

ʂ🇨🇳
ʐ ~ ɻ🇨🇳

ɕ🇨🇳

x🇨🇳 ~ h🇨🇳🇺🇸

Approximantw🇨🇳🇺🇸l🇨🇳🇺🇸ɹ🇺🇸j🇨🇳🇺🇸

Many consonants are shared, so it's no wonder that basically no one does these differently:

Consensus consonants
PYTYWGYaleGRGR2IPA
mmmmmmm
nnnnnnn
ppp'pppp
bbpbbbb
ttt'tttt
ddtdddd
kkk'kkkk
ggkgggg
fffffff
sssssss
hhhhhhh
lllllll
wwwwuww ~ u
yyyyiyj ~ i

For the ones that are not shared, post-alveolar sounds sound really similar to retroflex ones, so everyone establishes the convention to approximate Chinese retroflexes as English post-alveolars: [tʂ] ≈ [tʃ] = ch, [ʂ] ≈ [ʃ] = sh, [ɻ] ≈ [ɹ] = r (WG uses [ʐ] ≈ [ʒ] = j instead, and since [ɻ] ~ [ʐ] is a free variation, this is not a bad choice either, although j is ambiguous in its pronunciation in English). However, this rule breaks down for [dʐ], which is supposed to be approximated as [dʒ]. I think it's English to blame for not having a canonical spelling for [dʒ] (g is already taken for "hard g", otherwise there are still j and dg). Yale and GR/GR2 still choose j in recognition of its [dʒ] pronunciation. WG chooses ch because it already picked ch' for [tʂ] and it keeps up with its principle of aspiration minimal pairs using the same spelling. Pinyin and Tongyong choose zh and jh respectively; I think both are jarring, but jh is marginally more pronounceable because it at least has a j in it.

Pinyin is the only system that maintains internal consistency between alveolar and retroflex consonants, which is a nice principle to have, but it leads to some really awkward choices. Because we've decided that [s] = s and [ʂ] = sh, we've committed ourselves to the rule that "alveolar + h = retroflex". This gives us [ts] = c by virtue of [tʂ] = ch, and [dʐ] = zh by virtue of [dz] = z. But the results, c and zh, are both really jarring for English speakers.

On the other hand, it gives up on maintaining internal consistency for fortis/lenis pairs, because c and z are not a fortis/lenis pair in English, but s and z, ts and dz are. Yale maintains this consistency at least for [ts]/[dz], but still gives up for retroflex/palatal pairs. WG maintains the consistency throughout, but at the cost of using apostrophes.

The real chaos happens with the remaining phones: [ts], [dz], [tɕ], [dʑ], [ɕ], transcribed in pinyin as c, z, q, j, x respectively. These are also where my biggest problems are: I think none of these spellings (potentially except z and j) make sense.

Rhymes

Now let's look at rhyme transcription. The simple ones without medial glides and [y] are mostly consistent across all systems:

Consensus rhymes
PYTYWGYaleGRGR2IPANote
aaaaaaa
aiaiaiaiaiaiai
ananananananan
angangangangang
aoaoaoauauauau
eeo/eh/ê4eeeɤ ~ e
eieieieieieiei
enenênenenenən
engengêngengengengəŋ
iiiiiii
ininininininin
inginginginginging
ooooooo ~ ɔRarely by itself
ongongungungongungʊŋ
ououououououou
uuuuuuu

[y] doesn't exist in English, so everyone uses some different notation. Pinyin and WG use ü (but pinyin has the rule to drop the umlaut with no ambiguity, i.e., after the palatals [tɕ] q, [dʑ] j, [ɕ] x, [j] y, which is confusing). Tongyong and Yale use yu. GR/GR2 use iu. Personally I think ü represents it best if you speak German, but yu is better for the average English speakers and also easier to type (and indeed, falls back to lyu if ASCII is required, like on passports). [yn] is derived by adding n.

Now think about the rhymes with a medial glide: [j], [w], [ɥ] (pinyin i, u, ü respectively). Overall, [j] is represented as i and [w] as u, except Yale which uses y and w respectively, recognizing that they are glides instead of vowels. [ɥ] is again divergent just like [y] is. Pinyin and WG keep using ü (again, pinyin drops the umlaut after the palatals); Tongyong keeps using yu; Yale switches to yw; GR/GR2 keep using iu.

The seven alveolar and retroflex affricates and fricatives can all form syllables on their own: [ts], [dz], [s], [tʂ], [dʐ], [ʂ], [ɻ] (with a neutral [ɨ] nucleus). Everyone recognizes this syllabicity by using a spelling distinct from i, except pinyin.

System[ts], [dz], [s][tʂ], [dʐ], [ʂ][ɻ]
Pinyiniii
Tongyongihihih
Wade–Gilesŭ (spells the consonant differently: tz', tz, ss)ihih
Yalezrr
Gwoyeu Romatzyhyyy
GR2zrr

This part also explains why pinyin has to pick different consonants for the palatals: because it acknowledges the allomorphicity of [i] and [ɨ], it therefore can't conflate j with zh; otherwise there would be no way to tell [dʐɨ] (zhi) apart from [dʑi] (ji)!

Conclusion

Now I've completed a run-through of all the different transcription systems. My takeaway is this: when no clear analog exists, pinyin consistently makes the least intuitive choice.

In today's world, I would rank the three use cases in the order above by decreasing importance, and pinyin fails the most important one the hardest, which is to help English speakers pronounce Chinese words correctly. (At the time of its creation, pedagogy and literacy were more important, but that has gradually faded; and as I said, it also doesn't succeed as the neatest pedagogical tool.) I recall a period when municipalities pushed really hard for naming all geographic landmarks in pinyin (subway stations, street names, etc.), like (contrived example):

  1. 人民公园站
  2. People's Park Station (full meaning preserved)
  3. Renmin Gongyuan Station (proper nouns not translated)
  4. Renmin Gongyuan Zhan (full phrase transliterated)

It received a lot of backlash from foreigners and locals alike, because virtually no one can figure out what it means. Locals have a hard time reading pinyin, and foreigners can't pronounce most of these. Of course, meaning-preserving translations may be better than transliterations, but if you want to transliterate, you should at least pick a system that people can pronounce. The awkward design of pinyin puts it in a situation where no one would gladly use it in full English contexts, let it be people's names, places' names, or quotes. Granted, its simplicity and rigor make it still apt for education and computer input methods, but I don't think it stands up to the purpose of communication with foreign language speakers.

Worth noting that I think most other transcription systems also have their own quirks, although most of them are more foreign-language-friendly than pinyin. With our understanding of Chinese phonology developed over the past half-century, it might be time to smooth the rough edges and create a more rational transcription system for Chinese.

Footnotes

  1. Wikipedia refers to the Mandarin alveolar consonants as "denti-alveolar" due to the tongue touching the lower teeth. Phonemically, I think they are insignificant.

  2. It's another interesting question about what flag to use for each language. Especially English—there's an endless debate about whether to use the UK flag or the US flag. I don't think there's an ambiguity though; new Intl.Locale("en").maximize().toString() always returns en-Latn-US according to the Unicode Add Likely Subtags algorithm. But I digress.

  3. The fortis/lenis distinction is implemented in Mandarin via aspiration and in English via voicing. I just consistently use voicing for simplicity.

  4. o if onset is velar ([g], [k], [x]), eh if surface form is [e] (i.e., after y). WG in general uses ê for mid [ə]/[ɤ], eh for front [e], e for front [ɛ].