In search for a good programming language: R
This is a long-overdue piece, in follow-up to the previous one, written 16 months ago. I had grand plans at that time to cover all languages I was fluent in, but it turned out that I didn't have enough exposure to form sharp opinions on most of them. Times have changed—after being in statistics for 4 years and also retroactively using R for the 2023 Advent of Code, I can now confidently talk about R. I still don't write enough R to claim that I know it well, but I doubt more than a dozen people in the world can claim themselves as R experts, so it's still meaningful to offer opinions from the POV of a casual user.
My first impression of R is that it's Python, batteries-included. Where you would need NumPy, Pandas, Matplotlib, or Jupyter, you can just use R's built-in data frames, plotting functions, and REPL. This does avoid issues with ecosystem fragmentation. But on the other hand, R manages to introduce fragmentation elsewhere: for example, it's the only language I know where you can find all of stopifnot, setNames, data.frame, and seq_len in the same standard library; no one knows the correct casing of functions. There are also multiple object systems (S3, S4, R6) with different semantics and use cases. Overall, this seems to be a language randomly thrown together with no particular design philosophy, and the diversity of the standard library is merely a consequence of that.
Speaking of the standard library, one might think that it has everything needed for data analysis—but no, if you try to do anything slightly more complicated than transforming data frames and plotting, you will find some fundamental limitations of the core language. Here were some of my rants from my AoC experience:
- No built-in hash maps or sets. String keys into lists are secretly linear under the hood. Even when you use hacks like
new.env(), you still have to serialize complex keys into strings.- No Int64, like at all?? I had to load gmp to use int64 (of which there were a lot that year!).
- Compared to all of the above, problems like "no priority queues" and "no queues" seem such trivial problems, especially since they are solved by the collections package.
R has a very interesting story with types. I'm no foreigner to dynamically-typed script languages, but R is the first in which I truly feel lost with tracking types. Remarkable is the idea that "everything is a vector", which both makes some code very elegant and others very hard to reason about. For example, I can never figure out if something is a list, a vector, a matrix, a data frame, or a mix of all of them. The best solution in today's world is to paste the code (or the printed output) into ChatGPT and say "what's the type of this variable and how do I index into it".
Then there are quirky behaviors and semantics that freak out experienced coders but statisticians seem to embrace with no problem:
- NSE. There are ample resources online explaining how it works. Personally, I find callbacks much more explicit (for example, in JavaScript you would do
subset(df, (row) => row.x > 5)instead ofsubset(df, x > 5)), but if you take each function call as a template instead of something you need to understand piecemeal, it does make your code more straightforward to read. - 1-indexing. This is not a problem in itself, but it makes translating algorithms a bit harder, especially ones where the list indices are intended as keys rather than arbitrary positions.
- Secret CoW. Hidden performance costs and subtle bugs everywhere. I always have to deep-modify like
a[[i]][[j]][k] <- foowithout saving any intermediate variables.
One cannot talk about R without talking about its ecosystem. I've constantly asked people why they choose R over Python, and the answer is always "because there's package X which I need in R and which doesn't exist in Python". ggplot2 and Tidyverse, even though their function is largely substitutable in Python with pandas, seaborn, matplotlib, etc., the syntactic beauty of the former is unmatched. Causal inference; mixed-effect models; time series analysis; domain-specific data; and more, simply have no good Python alternatives.1 CRAN is a fairly high-quality registry compared to common targets of criticism like PyPI, npm, or crates.io—a major reason why academia prefers R is because the packages' quality is endorsed by registry policies. On the other hand, R's project management practices (e.g., renv) are fairly limited in adoption (just like everything else that involves engineering), and everyone just uses install.packages, making reproducibility unreliable across the board. Nevertheless, each day that R still possesses these irreplaceable packages is a day that people will still use R.
R is an example of what clueless language designers can do when they have a specific use case in mind but little engineering background. It does hit all the right notes for data analysis, but clearly not enough thought had gone into its philosophy2. It builds its moat on early academia adoption and path dependency, but I have doubts about it securing its niche into the future. It so happens that R climbed back to TIOBE top 10 in January and February 2026, so I really hope I stand corrected.