Leaving .NET is a clueless thing to do

Every so often, someone writes an extensive blog post about why they left the .NET framework for greener pastures.

Invariably, the blog post contains complaints about other .NET devs’ inability to see beyond Microsoft Wonderland.

I agree with that complaint. Indeed, many people who code C# are wildly uncomfortable using open source projects, text editors, command lines, and so on. This is a problem and it’s bad for the .NET ecosystem.

What people should be doing, instead, is use the Microsoft goodies when it makes sense and something else when it doesn’t. For example, instead of using ASP.NET MVC, it’s perfectly possible to make a web app with a C# backend running on Mono, talking to a Postgres server and a custom Javascript frontend. Nobody says that if you code on .NET, you have to do it exactly like Scott Hanselman does it.

Similarly, when C# or .NET are simply not the best tools for a job, don’t use them. Quick prototypes may be much easier to churn out in Ruby. Self-hacked TCP protocols might be much easier to serve with Node.JS.

What itches me is that all these people who complain about this in their .NET-post mortems are doing exactly the same. What in the world does it mean to “leave” .NET? I didn’t code Python in the last 5 months. Does it mean I left Python? Should I now write an annoyed blog post about all the things that are wrong with Python and its community?

Of course not – despite its shortcomings, a lot of stuff is good about Python, and exactly the same holds for .NET. In fact, as Microsoft itself is opening, the navel-gazing parts of the .NET ecosystem may very well follow, at some point.

C# is great for many purposes. For example, with C# 5’s async/await syntax, it’s one of the best options out there for high-performance asynchronous code. Just like Go has become a go-to language for high concurrency, C# may very well become a go-to language for high asynchronicity. Don’t ditch it just because an earlier career mistake means you never want to write WebForms again.

Advertisements

String types are fine. How about your code?

Someone on the internet says that strings are broken, and some more people on the internet disagree.

The problem is that all kinds of common operations on strings, such as counting the amount of characters in a string, or converting a string to upper case, become a horrible mess when you want to support all the characters in the world (and then some). People discuss how well their favourite programming language solves this problem for them. The conclusion seems to be that no programming language does it perfectly, because that’s impossible without additional information (such as which language the string is in, and who it is for).

Text is for humans

I daresay the entire discussion is misguided, because strings can be used for different purposes, and you shouldn’t mix those purposes. In code I’ve seen, there’s three categories of string use:

  1. Strings intended primarily for machine consumption (JSON keys, enum values being sent over the line, dynamic method calls, URLs, etc)
  2. Strings intended primarily for human consumption (UI texts, user input such as comments on a blog, names, and so on)
  3. Strings used for both at the same time

Category 3 is the problem. That’s where this entire discussion comes from. A programmer got names of places or people, titles of texts, phone book entries, and turned them into identifiers. He wants to ensure that if another user types the same word again, the same thing is found, so he thinks, “I know! I’ll lowercase everything and replace non-alphanumeric characters by ‘_'”. This goes fine until someone enters “ანბანი”.

Hashing prose

What’s really going on here is that the programmer got text from category 2, and wanted to transform it into something from category 1. This is perfectly OK, as long as you stick to two rules:

  • You can only go from human-only to machine-only, not back. Essentially, you’re writing a hash function.
  • Like with any hash function, you need to think about its uniqueness properties. If you need uniqueness but cannot reasonably it (which happens very quickly once you start converting characters to ‘_’), you need an additional unique identifier. This may have security implications, too.

Any code that does not follow these two rules automatically ends up in category 3, which is a code smell.

Transforming text from category 2 to category 1 is a rather common operation. For example, maybe you want to derive a pretty blog post URL from its title. If a user might write about C one day, and about C++ the other day, you either need to keep the pluses in the URL and end up with “C%2B%2B_rocks“, or you need additional information. This is why most blog and newspaper URLs contain text and an identifier.

Google does something similar when you search for misspelled words. “Łódź” doesn’t sound like “Lodz” at all, but Google doesn’t care, to great joy of the Łódź tourist board and all Poles who find themselves behind a non-Polish keyboard. Google needed to support a near-perfect conversion from category 2 (user input) to category 1 (indexed keywords). Because this is impossible, Google accepts that sometimes you get results that you don’t want.

It’s a one-way street

Any attempt to predictably go from machine-only text to human-only text is futile. Once a string has turned into an identifier, don’t try to get the original back. It’s a hash, you lose data. You may be able to find multiple human-readable texts that match a single identifier (such as Łódź and Lodz), and this might be useful if you’re building a search engine. In many cases, just don’t try.

The fun thing about this one-way street is that it’s a one-way street that you control. Whatever the user enters, you can set the assumptions for any strings that fall in category 1. You can ensure that machine-only strings contain only alphanumeric characters, or only ASCII characters, or only valid identifiers for your favourite programming language. You can clearly set these assumptions and then work with them. Once you work with a limited string alphabet, you can go wild on substrings and lowercasing and comparisons and all that, without much going wrong.

There is no requirement that JSON keys contain only [a-zA-Z_-]. Yet, nearly everybody appears to stick to this convention. Why? It’s machine-only data, so no need to make things complicated. String types in nearly every language, even ones with horrible Unicode support, are fine for use in category 1. Go wild! Strings are fine!

Human-readable text should not be touched.

In an odd kind of duality, there is often little need to change or analyse strings that fall in category 2. If you have a user interface in many languages, don’t “intelligently” uppercase words. It’ll go wrong. Have translators produce a string for both “Ok” and “ok”, if you need both. Trust people, not brittle string classes. Similarly, don’t be smart about transforming words, names and sentences that users input. Unless of course this is a core aspect of your product, like when you’re coding Google Maps and you want users to be able to search for both Tokyo and 東京.

For human-only text, you don’t want to do manipulations or analysis. Need a string length for correctly rendering a UI? You can use String.Length, but be aware that it can produce inaccurate results. If you need to be sure, you need to use the underlying rendering library and measure pixels or centimeters and not characters. Similarly, why would you ever need to take a substring of someone’s name, or a poem, or the Russian word for “Banana”?

The moment you feel like you need to perform these kinds of operations on strings for humans, there might be something the matter. Probably, you need to hash the string first. Go to machine-only strings.

If human-only strings, however, are only read and then displayed, then any sufficiently modern Unicode-supporting string class suffices, again. Indeed, yet again, strings are fine. Go wild!

Built-in datatypes

If there is any takeaway from this entire discussion, it may be that there is a need for multiple string types in strongly-typed languages: one for machine-only text, and at least one for human-only text. Such a human-only string could contain no common string operations at all, except for converting from and to byte streams in various encodings. Similarly, UI frameworks and template engines could make it difficult to display machine-only text, just like how modern HTML template engines help avoid XSS attacks.

Note: I read on Hacker News that Ruby actually does something like this: it has one class per encoding. Declare a law in your Ruby shop that ASCII strings (plain old Ruby strings) are to be treated as machine-only, and you’re pretty far.

Epilogue: There’s no free lunch

Unfortunately, all of the above holds until you want to print a phone book for the entire world. Does Орёл sort before or after Oryel? They’re the same place name, just written differently. Any sort of human-understandable sorted list of things written in multiple languages gets really messy real fast. Fortunately, phone books have been largely replaced by search, and if you accept some false positives, search works better anyway. And always go to category 1 when searching.

Of course, if you only need ordering for some internal algorithm you have, you can probably afford to go to category 1 first. If not, maybe the actual ordering does not matter, as long as it is consistent.

If you got this far, you’ll probably want to hire me as a consultant.

Using time? Inject a clock.

Your favourite language’s way to get the current time (e.g. new Date() in JS or DateTimeOffset.Now in C#) is lovely. It’s also a nasty global singleton, in that you can’t mock or stub it from tests.

The problem with directly referring to “the current time” from application code is that you often end up doing Thread.Sleep(...)s all over your automated test code. This makes tests slow and brittle. Now, an automated test with some busy waiting an ugly thread magic is better than no test at all, but if you have the chance, better avoid it.

I found that, in nearly all cases, it helps to inject a clock:

public interface IClock
{
    DateTimeOffset Now { get; }
}

This interface will allow you to very easily simulate time-derived behaviour. I usually use a RealClock that the application uses, and a FakeClock used by tests. They’re ridiculously simple too:

public class RealClock : IClock
{
    public DateTimeOffset Now { get { return DateTimeOffset.Now; } }
}

public class FakeClock
{
    private DateTimeOffset now;

<pre><code>public FakeClock(DateTimeOffset startTime)
{
    now = startTime;
}

public FakeClock()
{
    now = DateTimeOffset.Now;
}

/// &amp;lt;summary&amp;gt;
/// Gets or updates the clock's current time.
/// &amp;lt;/summary&amp;gt;
public DateTimeOffset Now
{
    get { return now; }
    set
    {
        if (value &amp;lt; now)
        {
            throw new InvalidOperationException(&amp;quot;Can&#039;t decrease time.&amp;quot;);
        }

        now = value;
    }
}
</code></pre>

}

The FakeClock could have been a lot simpler, but I choose to enforce that time never rewinds. Code assumes this more often than you'd think, and there's nothing wrong with that.

Using an injected clock, you get real nice test code, for anything that depends on time changes, such as UI animations or hardware simulators.

Imagine a screen that has an alarm trigger, which when activate is on for exactly 5 minutes:

[Test]
public void AlarmShouldBeVisibleFrom5Minutes()
{
    var clock = new FakeClock();
    var screen = new SomethingScreen();

<pre><code>screen.ActivateAlarm();
screen.Alarm.Enabled.ShouldBe(true);

// verify that the alarm is still going 1 second before
// the 5 minutes have passed.
clock.Now += TimeSpan.FromSeconds(299);
screen.Alarm.Enabled.ShouldBe(true);

clock.Now += TimeSpan.FromSeconds(1);
screen.Alarm.Enabled.ShouldBe(false);
</code></pre>

}

The key point is, of course, in code like clock.Now += [something];. Even if you hate mutable state, you have to accept that time is inherently mutable. Simply updating the time from automated tests allows for a pretty simple way to deal with that.