Tuesday 10 February 2015

Madness from the C

So, I've decided to take the plunge.

C is so widely used that not being quite intimately acquainted with it is a definite hinderance. I can read C comfortably for the most part, ably wielding the source of a few kernels or utilities to track down bugs and determine exactly how features are used is something that's not beyond my remit.

I might actually try to write some C. Originally I was going to be all hipster and show how to build your web 2.0 service in C using FastCGI, because the 90s are still alive and well. Also, you can do some pretty awesome things relating to jailing strategies (e.g. chroot+systrace, chroot+seccomp or jail+capsicum), and the performance can be good.

Unfortunately, when it comes to writing C, I freeze up. I know there is so much left to the imagination; so much to fear; so much lurking in the shadows, waiting to consume your first born. And I hate giving out advice that could lead to some one waking up to find ETIMEDOUT written in blood on the walls; even if the "advice" is given in jest. (Thanks to Mickens for the wording)

I speak of the dreaded undefined behaviour.

Undefined behaviour is a double edged sword. In tricky situations (such as INT_MAX + 1), it allows the compiler to do as it pleases, for the purposes of simplicity and performance. However, this often leads to "bad" things happening to those who transgress in the areas of undefined behaviour.

I suggest that if you are a practicing C developer, and you don't think this is much of a problem, or you don't know much about it, you read both Regehr's "A Guide to Undefined Behaviour in C and C++" and Lattner's "What Every C Programmer Should Know About Undefined Behaviour" in full.

I was in the camp of "undefined behaviour's bad, but not too much of a problem since it can be easily avoided" camp, since I am more security-oriented than performance-oriented. I much prefer Ada to C.

That was, until I started in on Hatton's "Safer C".

The book is well written, clear, and direct.

Broadly speaking, all was going well, until I got to page 49. Page 49 contains a listing of undefined behaviour, as do pages 50, 51, 52, 53, 54, 55, and about one third of 56. That's over 7 pages of undefined behaviour.

It could be made better, if these were all weird, corner cases, that the compiler clearly could not detect and throw out as "obviously bad," and not even just semantically bad (for example, dereferencing a null pointer); but those that are syntactically bad.

Things like "The same identifier is used more than once as a label in the same function" (Table 2.2, entry 5) and "An unmatched ' or " character is encountered on a logical source line during tokenization" (Table 2.2, entry 4) and my current favourite, "A non-empty source file does not end in a new-line character, ends in a new-line character immediately preceded by a backslash character, or ends in partial preprocessing token or comment" (Table 2.2, entry 1).

Just look at those three. Syntactic errors, which, according to the standard, could bring forth the nasal demons.

Let me put it more bluntly. A conforming compiler may accept syntactically invalid C programs, and then emit (in effect) any code it wants.

Now, clearly, most compilers do not do this; they give define these undefined behaviours as syntax errors. The thing which really scares me is that the errors presented are from the first page of the table, and there's another six pages like that to go.

Further, I think I must come from a really luxurious background. I expect compilers to do everything reasonable to help program writers avoid obvious bugs, rather than simply making the compiler writers life easier.

A compiler is to be written for a finite set of hardware platforms. It needs to be used by countless other (probably less experienced) programmers to produce software which may then go on to be used by an unfathomable number of people. It's no small thing to claim that a large fraction of the world's population have probably interacted, in one way or another, with the OpenSSL code base, or the Linux kernel.

The reliability of these programs is directly correlated to the "helpfulness" of the language they are written in, and as such, C needs to be revisited with a view to "pinning down" many of the undefined behaviours; especially those which commonly lead to serious safety or security vulnerabilities.

No comments:

Post a Comment