String types are fine. How about your code?

Someone on the internet says that strings are broken, and some more people on the internet disagree.

The problem is that all kinds of common operations on strings, such as counting the amount of characters in a string, or converting a string to upper case, become a horrible mess when you want to support all the characters in the world (and then some). People discuss how well their favourite programming language solves this problem for them. The conclusion seems to be that no programming language does it perfectly, because that’s impossible without additional information (such as which language the string is in, and who it is for).

Text is for humans

I daresay the entire discussion is misguided, because strings can be used for different purposes, and you shouldn’t mix those purposes. In code I’ve seen, there’s three categories of string use:

  1. Strings intended primarily for machine consumption (JSON keys, enum values being sent over the line, dynamic method calls, URLs, etc)
  2. Strings intended primarily for human consumption (UI texts, user input such as comments on a blog, names, and so on)
  3. Strings used for both at the same time

Category 3 is the problem. That’s where this entire discussion comes from. A programmer got names of places or people, titles of texts, phone book entries, and turned them into identifiers. He wants to ensure that if another user types the same word again, the same thing is found, so he thinks, “I know! I’ll lowercase everything and replace non-alphanumeric characters by ‘_'”. This goes fine until someone enters “ანბანი”.

Hashing prose

What’s really going on here is that the programmer got text from category 2, and wanted to transform it into something from category 1. This is perfectly OK, as long as you stick to two rules:

  • You can only go from human-only to machine-only, not back. Essentially, you’re writing a hash function.
  • Like with any hash function, you need to think about its uniqueness properties. If you need uniqueness but cannot reasonably it (which happens very quickly once you start converting characters to ‘_’), you need an additional unique identifier. This may have security implications, too.

Any code that does not follow these two rules automatically ends up in category 3, which is a code smell.

Transforming text from category 2 to category 1 is a rather common operation. For example, maybe you want to derive a pretty blog post URL from its title. If a user might write about C one day, and about C++ the other day, you either need to keep the pluses in the URL and end up with “C%2B%2B_rocks“, or you need additional information. This is why most blog and newspaper URLs contain text and an identifier.

Google does something similar when you search for misspelled words. “Łódź” doesn’t sound like “Lodz” at all, but Google doesn’t care, to great joy of the Łódź tourist board and all Poles who find themselves behind a non-Polish keyboard. Google needed to support a near-perfect conversion from category 2 (user input) to category 1 (indexed keywords). Because this is impossible, Google accepts that sometimes you get results that you don’t want.

It’s a one-way street

Any attempt to predictably go from machine-only text to human-only text is futile. Once a string has turned into an identifier, don’t try to get the original back. It’s a hash, you lose data. You may be able to find multiple human-readable texts that match a single identifier (such as Łódź and Lodz), and this might be useful if you’re building a search engine. In many cases, just don’t try.

The fun thing about this one-way street is that it’s a one-way street that you control. Whatever the user enters, you can set the assumptions for any strings that fall in category 1. You can ensure that machine-only strings contain only alphanumeric characters, or only ASCII characters, or only valid identifiers for your favourite programming language. You can clearly set these assumptions and then work with them. Once you work with a limited string alphabet, you can go wild on substrings and lowercasing and comparisons and all that, without much going wrong.

There is no requirement that JSON keys contain only [a-zA-Z_-]. Yet, nearly everybody appears to stick to this convention. Why? It’s machine-only data, so no need to make things complicated. String types in nearly every language, even ones with horrible Unicode support, are fine for use in category 1. Go wild! Strings are fine!

Human-readable text should not be touched.

In an odd kind of duality, there is often little need to change or analyse strings that fall in category 2. If you have a user interface in many languages, don’t “intelligently” uppercase words. It’ll go wrong. Have translators produce a string for both “Ok” and “ok”, if you need both. Trust people, not brittle string classes. Similarly, don’t be smart about transforming words, names and sentences that users input. Unless of course this is a core aspect of your product, like when you’re coding Google Maps and you want users to be able to search for both Tokyo and 東京.

For human-only text, you don’t want to do manipulations or analysis. Need a string length for correctly rendering a UI? You can use String.Length, but be aware that it can produce inaccurate results. If you need to be sure, you need to use the underlying rendering library and measure pixels or centimeters and not characters. Similarly, why would you ever need to take a substring of someone’s name, or a poem, or the Russian word for “Banana”?

The moment you feel like you need to perform these kinds of operations on strings for humans, there might be something the matter. Probably, you need to hash the string first. Go to machine-only strings.

If human-only strings, however, are only read and then displayed, then any sufficiently modern Unicode-supporting string class suffices, again. Indeed, yet again, strings are fine. Go wild!

Built-in datatypes

If there is any takeaway from this entire discussion, it may be that there is a need for multiple string types in strongly-typed languages: one for machine-only text, and at least one for human-only text. Such a human-only string could contain no common string operations at all, except for converting from and to byte streams in various encodings. Similarly, UI frameworks and template engines could make it difficult to display machine-only text, just like how modern HTML template engines help avoid XSS attacks.

Note: I read on Hacker News that Ruby actually does something like this: it has one class per encoding. Declare a law in your Ruby shop that ASCII strings (plain old Ruby strings) are to be treated as machine-only, and you’re pretty far.

Epilogue: There’s no free lunch

Unfortunately, all of the above holds until you want to print a phone book for the entire world. Does Орёл sort before or after Oryel? They’re the same place name, just written differently. Any sort of human-understandable sorted list of things written in multiple languages gets really messy real fast. Fortunately, phone books have been largely replaced by search, and if you accept some false positives, search works better anyway. And always go to category 1 when searching.

Of course, if you only need ordering for some internal algorithm you have, you can probably afford to go to category 1 first. If not, maybe the actual ordering does not matter, as long as it is consistent.

If you got this far, you’ll probably want to hire me as a consultant.


14 thoughts on “String types are fine. How about your code?

  1. You seem to be saying that there should be two distinct types: “immutable text” and “sequence of characters for automatic processing”. I agree. But this means that string types are not fine since they are neither of these types (while trying to be both).

    • True, but with programmer discipline, you can still use them for both distinct purposes. What I mean is the fact that string types in most languages can’t correctly reverse “noël” is not a problem, and as such, they’re functionally sufficient for building any kind of software.

      I agree that it is often a good idea to build a good type system so that you need less programmer discipline. This takes time, though. Just like “goto” was first commonplace, then considered a code smell, and only then got removed from popular languages, we can’t go and demand string classes that allow and disallow exactly the right thing right away. We have to start by calling things a code smell, learn the problems with that, and then, finally, at some point, design better string types.

      I’d love to see an attempt at a string library that enforces rules like this, and I’d definitely try it, but I bet that it’s not as trivial to make the right choices as I hope. The problem is that once the string library authors want to correct a mistake, they’ll often need to break backwards compatibility.

  2. Similarly, why would you ever need to take a substring of someone’s name, or a poem, or the Russian word for “Banana”?

    Because I’m writing a program that requires the user to answer “What is the Russian word for ‘Banana?'” and want to give them a hint.

    • You can just write it yourself backwards and don’t care about having your broken language make the effort of reversing a complex Unicode string

  3. “If there is any takeaway from this entire discussion, it may be that there is a need for multiple string types in strongly-typed languages: one for machine-only text, and at least one for human-only text. Such a human-only string could contain no common string operations at all…”

    What you’re describing is called a symbol, which McCarthy put in his programming language in 1958, and almost every language since then (notable exception: Ruby) has completely ignored. You’re absolutely correct that this would help make the situation much better.

    Similarly, “character” once meant “a small fixint”, but today in almost all modern programming languages, it’s a distinct type for storing character data. Splitting semantically different types into syntactically different types is progress. Using strings for symbols was always a bit of a kluge.

    In a sense, saying that “string types are broken” is isomorphic to saying that “virtually every programming language in the world lacks a symbol type, and thus tries to shoehorn its string type into both a string and a symbol, and does both poorly”. But is that helpful?

    The problem here is that saying “X is broken”, and providing a simple case where it clearly fails (e.g., “noël” reversed is not what any sane person would expect, unless they knew a lot about Unicode encodings and how the compiler/runtime stores strings) is exactly how 99% of bugs in the real world get fixed. Saying “your programming language should make a philosophical change about how it stores data” (i.e., add a symbol type, and retrofit the standard library to use it) is how you get ignored. Especially (voice of experience here) if you’re suggesting that it be more like Lisp, even if you don’t use the L-word.

    You seem to be proposing a third option, “everybody just get more disciplined”. I don’t have enough fingers to count the number of times I’ve heard that proposed (opcode selection, register allocation, malloc/free, integer overflow, syscall return values, …). Has it ever worked? I only see programmers running away from such things, and a very few apologists insisting that there’s some unspoken benefit in staying in a painfully low level on the abstraction scale. It’s a technical solution that ignores all social factors. When it comes down to something that a computer can do, the computer should just do it. (And these Unicode issues can be fixed: you may not see a big need to reverse “noël” today, but it’s not that hard to make it return “lëon”, in accordance with the Principle of Least Surprise.) It’s much easier to fix this once in a compiler, than require everybody to fix it in their programs.

    Don’t get me wrong. I’d love to see everybody add an actual symbol type to their programming languages, which would avoid 99% of these issues in the first place, but I’ll bet you all the bits on my SSD that we’ll see many/most of these Unicode issues fixed within the next year, and zero progress made on adding symbols to any existing languages. Technical progress is always much easier than social progress.

  4. So, how does this fit for processing user input? Input is 2, but it necessarily needs to be converted to 1 in order to be manipulated. Am I right in taking away from this is that you need to keep 1 and 2 separate and additionally ensure you define the transformation explicitly?

    Example: writing a game where a person enters a word, you scramble or break up that word and present it to another user and they have to guess the word. Is this impossible to do generally?

    • “Is this impossible to do generally?” It is impossible to use a single language-agnostic algorithm to do that correctly.

      If you restrict what languages are accepted or use multiple algorithms, then you could still write a program to correctly handle as many languages as you wish.

  5. For the most part I agree with you. But there are some cases where you’d genuinely need to do a substring on user text. For example, the case of automatically paginating a huge article. Think about displaying it on a small screen, limited memory devices (read mobile phones) and laptops. There are ways to do it right (well, almost right), but all I’m just saying there are cases when you’d need to manipulate human text.

  6. It would be foolish on any one’s part to expect String behaving perfectly with all those special characters from alien languages, and that’s why they have character encoding and all. I agree that you should look at your code than complaining String is fine or not.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s