The problem is that all kinds of common operations on strings, such as counting the amount of characters in a string, or converting a string to upper case, become a horrible mess when you want to support all the characters in the world (and then some). People discuss how well their favourite programming language solves this problem for them. The conclusion seems to be that no programming language does it perfectly, because that’s impossible without additional information (such as which language the string is in, and who it is for).
Text is for humans
I daresay the entire discussion is misguided, because strings can be used for different purposes, and you shouldn’t mix those purposes. In code I’ve seen, there’s three categories of string use:
- Strings intended primarily for machine consumption (JSON keys, enum values being sent over the line, dynamic method calls, URLs, etc)
- Strings intended primarily for human consumption (UI texts, user input such as comments on a blog, names, and so on)
- Strings used for both at the same time
Category 3 is the problem. That’s where this entire discussion comes from. A programmer got names of places or people, titles of texts, phone book entries, and turned them into identifiers. He wants to ensure that if another user types the same word again, the same thing is found, so he thinks, “I know! I’ll lowercase everything and replace non-alphanumeric characters by ‘_'”. This goes fine until someone enters “ანბანი”.
What’s really going on here is that the programmer got text from category 2, and wanted to transform it into something from category 1. This is perfectly OK, as long as you stick to two rules:
- You can only go from human-only to machine-only, not back. Essentially, you’re writing a hash function.
- Like with any hash function, you need to think about its uniqueness properties. If you need uniqueness but cannot reasonably it (which happens very quickly once you start converting characters to ‘_’), you need an additional unique identifier. This may have security implications, too.
Any code that does not follow these two rules automatically ends up in category 3, which is a code smell.
Transforming text from category 2 to category 1 is a rather common operation. For example, maybe you want to derive a pretty blog post URL from its title. If a user might write about C one day, and about C++ the other day, you either need to keep the pluses in the URL and end up with “
C%2B%2B_rocks“, or you need additional information. This is why most blog and newspaper URLs contain text and an identifier.
Google does something similar when you search for misspelled words. “Łódź” doesn’t sound like “Lodz” at all, but Google doesn’t care, to great joy of the Łódź tourist board and all Poles who find themselves behind a non-Polish keyboard. Google needed to support a near-perfect conversion from category 2 (user input) to category 1 (indexed keywords). Because this is impossible, Google accepts that sometimes you get results that you don’t want.
It’s a one-way street
Any attempt to predictably go from machine-only text to human-only text is futile. Once a string has turned into an identifier, don’t try to get the original back. It’s a hash, you lose data. You may be able to find multiple human-readable texts that match a single identifier (such as Łódź and Lodz), and this might be useful if you’re building a search engine. In many cases, just don’t try.
The fun thing about this one-way street is that it’s a one-way street that you control. Whatever the user enters, you can set the assumptions for any strings that fall in category 1. You can ensure that machine-only strings contain only alphanumeric characters, or only ASCII characters, or only valid identifiers for your favourite programming language. You can clearly set these assumptions and then work with them. Once you work with a limited string alphabet, you can go wild on substrings and lowercasing and comparisons and all that, without much going wrong.
There is no requirement that JSON keys contain only
[a-zA-Z_-]. Yet, nearly everybody appears to stick to this convention. Why? It’s machine-only data, so no need to make things complicated. String types in nearly every language, even ones with horrible Unicode support, are fine for use in category 1. Go wild! Strings are fine!
Human-readable text should not be touched.
In an odd kind of duality, there is often little need to change or analyse strings that fall in category 2. If you have a user interface in many languages, don’t “intelligently” uppercase words. It’ll go wrong. Have translators produce a string for both “Ok” and “ok”, if you need both. Trust people, not brittle string classes. Similarly, don’t be smart about transforming words, names and sentences that users input. Unless of course this is a core aspect of your product, like when you’re coding Google Maps and you want users to be able to search for both Tokyo and 東京.
For human-only text, you don’t want to do manipulations or analysis. Need a string length for correctly rendering a UI? You can use
String.Length, but be aware that it can produce inaccurate results. If you need to be sure, you need to use the underlying rendering library and measure pixels or centimeters and not characters. Similarly, why would you ever need to take a substring of someone’s name, or a poem, or the Russian word for “Banana”?
The moment you feel like you need to perform these kinds of operations on strings for humans, there might be something the matter. Probably, you need to hash the string first. Go to machine-only strings.
If human-only strings, however, are only read and then displayed, then any sufficiently modern Unicode-supporting string class suffices, again. Indeed, yet again, strings are fine. Go wild!
If there is any takeaway from this entire discussion, it may be that there is a need for multiple string types in strongly-typed languages: one for machine-only text, and at least one for human-only text. Such a human-only string could contain no common string operations at all, except for converting from and to byte streams in various encodings. Similarly, UI frameworks and template engines could make it difficult to display machine-only text, just like how modern HTML template engines help avoid XSS attacks.
Note: I read on Hacker News that Ruby actually does something like this: it has one class per encoding. Declare a law in your Ruby shop that ASCII strings (plain old Ruby strings) are to be treated as machine-only, and you’re pretty far.
Epilogue: There’s no free lunch
Unfortunately, all of the above holds until you want to print a phone book for the entire world. Does Орёл sort before or after Oryel? They’re the same place name, just written differently. Any sort of human-understandable sorted list of things written in multiple languages gets really messy real fast. Fortunately, phone books have been largely replaced by search, and if you accept some false positives, search works better anyway. And always go to category 1 when searching.
Of course, if you only need ordering for some internal algorithm you have, you can probably afford to go to category 1 first. If not, maybe the actual ordering does not matter, as long as it is consistent.
If you got this far, you’ll probably want to hire me as a consultant.