PrettyC

A utility to present C source code as HTML with syntax hilighting

News

There's now a program from the Free Software Foundation that does exactly the same thing as PrettyC. It's called ``GNU Source-highlight.''

Distribution

Inspiration

I'm very fond of several features found in modern editors; amoung them, syntax hilighting and bracket matching. I've often wished for a little program that would take C source code as input and output valid HTML incorporating syntax hilighting. This would aid in the presentation of source code examples in webpages. At the very least, there is a need to replace certain symbols with escape codes; for instance, because HTML encloses tags in triangle brackets ('>' and '<'), it's necessary to replace all occurances of these characters with their respective escape codes, '&gt;' and '&lt;'. Similarly, sequences such as these must also be escaped. The full list of HTML "character entities" may be found in the HTML 4.0 specification.

Recently I had reason to learn about GNU flex, a utility for generating lexers, programs that scan text looking for particular patterns. As an exercise in both learning flex, and to produce a useful little utility, I decided to make my C-to-HTML program and thus complete a project first conceived on page 42 (Sunday, May 14, 2000) of my notebook. Flex might be overkill for this little project, but it's a good exercise, and it's easy to do.

Example Output

#include <stdio.h>

/* The canonical first program */

int main(int argc, char **argv) {
  printf("Hello,  World!\n");
  return 0;
}

Flex

Flex, from the Free Software Foundation, like its older brother Lex, from AT&T, is a tool for generating lexers, sometimes known as scanners. A lexer is a program which scans its input for specific patterns and performs a specific action when each pattern is detected. In our case, we want the lexer to look for elements like comments, numbers, strings, identiferiers, reserved words, and the other syntactic elements of C programs.

To use flex, we make a file listing the patterns we want our scanner to look for, and the actions to be taken when patterns are matched. Then, we supply this file to flex, and flex outputs a file called lex.yy.c. This C program contains a subroutine called yylex which, each time it is called, looks for the next pattern (usually called a token) and performs the desired action. In our case, the desired action is to return a numeric constant telling us what kind of token the most recently encountered text is. Is it a string? a comment? etc. The text itself (called the lexene) is returned in a variable called yytext.

All this and more is explained in the Flex manual.

HTML 4.0 Cascading Style Sheets

Once we identify the syntactic elements of the C program, how do we hilight them? We could just embed a whole bunch of <font> tags, but that strikes me as a horrible kludge. Fortunately, modern HTML provides a very cool facility nknown as cascading style sheets, which allow you to separate the content of an HTML document from its presentation, which is just the way it was intended to be. Not only that, but you can store presentation information in a single place, called a style sheet. Subsequently you can change that style sheet, and magically all your content which references that style sheet will be altered in appearance. Furthermore you can have a heirarchy of style sheets which can be very useful too.

How does all this work? Well, I'll only cover the little bit that I use here. First thing is the <span> tag, which lets you assign a class to bits of text:

<span class="foo">this text is of class foo</span>

Next, you must make a style sheet. The style sheet resides in a file usually with a .css extension, but definitely with a mime type of text/css. The contents of this file might look something like this:

.foo { color: red; }

Finally, you must reference the style sheet from your HTML document by using the <link> tag:

<head>
<title>the title of this document</title>
<link rel="stylesheet" href="mystylesheet.css">
</head>

So, what I did, is to create a stylesheet (called pretty.css) defining properties for classes such as identifier, number, bracket, comment, string, whitespace, etc. All prettyc has to do is add the <span> tags to the text based upon the return value of yylex().

The stylesheet prettyc.css might look something like this:

.comment {color: green}
.keyword {color: blue}
.preprocess {color: red}
.identifier {color: black}
.number {color: orange}
.string {color: orange}
.symbol {color: black}
.bracket {color : red}
.whitespace {color: black}