Tru64 UNIX
Compaq C Language Reference Manual


Previous Contents Index


Chapter 1
Lexicon

C, like any language, uses a standard grammar and character set. The specific elements that comprise this grammar and character set are described in the following sections:

C compilers interpret source code as a stream of characters from the source file. These characters are grouped into tokens, which can be punctuators, operators, identifiers, keywords, string literals, or constants. Tokens are the smallest lexical element of the language. The compiler forms the longest token possible from a given string of characters; the token ends when white space is encountered, or when the next character could not possibly be part of the token.

White space can be a space character, new-line character, tab character, form-feed character, or vertical tab character. Comments are also considered white space. Section 1.1 lists all the white space characters. White space is used as a token separator (except within quoted strings), but is otherwise ignored in the character stream, and is used mainly for human readability. White space may also be significant in preprocessor directives (see Chapter 8).

Consider the following source code line:


static int x=0;  /* Could also be written "static int x = 0;"   */ 

The compiler breaks the previous line into the following tokens (shown one per line):


static 
int 
x 
= 
0 
; 

As the compiler processes the input character stream, it identifies tokens and locates error conditions. The compiler can identify three types of errors:

Logical errors are not identified by the compiler.

An important concept throughout C is the idea of a compilation unit, which is one or more files compiled by the compiler.

Note

The ANSI C standard refers to compilation units as translation units. This text treats these terms as equivalent.

The smallest acceptable compilation unit is one external definition. The ANSI C standard defines several key concepts in terms of compilation units. Section 2.2 discusses compilation units in detail.

A compilation unit with no declarations is accepted with a compiler warning in all modes except for the strict ANSI standard mode.

1.1 Character Set

A character set defines the valid characters that can be used in source programs or interpreted when a program is running. The source character set is the set of characters available for the source text. The execution character set is the set of characters available when executing a program. The source character set does not necessarily match the execution character set; for example, when the execution character set is not available on the devices used to produce the source code.

Different character sets exist; for example, one character set is based on the American Standard Code for Information Interchange (ASCII) definition of characters, while another set includes the Japanese kanji characters. The character set in use makes no difference to the compiler; each character simply has a unique value. C treats each character as a different integer value. The ASCII character set has fewer than 255 characters, and these characters can be represented in 8 bits or less. However, in some extended character sets, so many characters exist that some characters' representation requires more than 8 bits. A special type was created to accommodate these larger characters, called the wchar_t (or wide character) type. Section 1.8.3.1 discusses wide characters further.

Most ANSI-compatible C compilers accept the following ASCII characters for both the source and execution character sets. Each ASCII character corresponds to a numeric value. Appendix C lists the ASCII characters and their numeric values.

In character constants and string literals, characters from the execution character set can also be represented by character or numeric escape sequences. Section 1.8.3.3 and Section 1.8.3.4 describe these escape sequences.

The ASCII execution character set also includes the following control characters:

The null character is a byte or wide character with all bits set to 0. It is used to mark the end of a character string. Section 1.7 discusses character strings in more detail.

The new-line character splits the source character stream into separate lines for greater legibility and for proper operation of the preprocessor.

Sometimes a line longer than the terminal or window width must be interpreted by the compiler as one logical line. One logical line can be typed as two or more lines by appending the backslash character ( \ ) to the end of the continued lines. The backslash must be immediately followed by a new-line character. The backslash signifies that the current logical line continues on the next line. For example:


#define ERROR_TEXT "Your entry was outside the range of \
0 to 100." 

The compiler deletes the backslash character and the adjacent new-line character during processing, so that this line becomes one logical line, as follows:


#define ERROR_TEXT "Your entry was outside the range of 0 to 100." 

A long string can be continued across multiple lines by using the backslash-newline line continuation feature, but the continuation of the string must start in the first position of the next line. In some cases, this destroys the indentation scheme of the program. The ANSI C standard introduces another string continuation mechanism to avoid this problem. Two string literals, with only white space separating them, are combined to form one logical string literal. For example:


printf ("Your entry was outside the range of " 
        "0 to 100.\n"); 

The maximum logical line length is 32,767 characters.

1.1.1 Trigraph Sequences

To write C programs using character sets that do not contain all of C's punctuation characters, ANSI C allows the use of nine trigraph sequences in the source file. These three-character sequences are replaced by a single character in the first phase of compilation. (See Section 2.16 for an explanation of compilation phases.) Table 1-1 lists the valid trigraph sequences and their character equivalents.

Table 1-1 Trigraph Sequences
Trigraph Sequence Character Equivalent
??= #
??( [
??/ \
??) ]
??' ^
??< {
??! |
??> }
??- ~

No other trigraph sequences are recognized. A question mark (?) that does not begin a trigraph sequence remains unchanged during compilation. For example, consider the following source line:


printf ("Any questions???/n"); 

After the ??/ sequence is replaced, this line is translated as follows:


printf ("Any questions?\n"); 

1.1.2 Digraph Sequences

Digraph processing is supported when compiling in ISO C 94 mode (/STANDARD=ISOC94 on OpenVMS systems).

Digraphs are pairs of characters that translate into a single character, much like trigraphs, except that trigraphs get replaced inside string literals, but digraphs do not. Table 1-2 lists the valid digraph sequences and their character equivalents.

Table 1-2 Digraph Sequences
Digraph Sequence Character Represented
<: [
:> ]
<% {
%> }
%: #
%:%: ##

1.2 Identifiers

An identifier is a sequence of characters that represents a name for the following:

The following rules apply to identifiers:

An identifier without external linkage has at most 32,767 significant characters. An identifier with external linkage has 1023 significant characters on Tru64 UNIX systems and 31 significant characters for OpenVMS platforms. ( Section 2.8 describes linkage in more detail.) Case is not significant in external identifiers on OpenVMS systems.

Identifiers that differ within their significant characters are different identifiers. If two or more identifiers differ in nonsignificant characters only, they are treated as the same identifier.

1.3 Comments

The /* character combination introduces a comment and the */ character combination ends a comment, except within a character constant or string literal.

Comments cannot be nested; once a comment is started, the compiler treats the first occurrence of */ as the end of the comment.

To comment out sections of code, avoid using the /* and */ sequences. Using the /* and */ sequences works only for code sections containing no comments, because comments do not nest. A better method is to use the #if and #endif preprocessor directives, as in the following example:


#if 0 
/*  This code is excluded from execution because ...  */ 
code_to_be_excluded (); 
#endif 

See Chapter 8 for more information on the preprocessing directives #if and #endif .

Comments cannot span source files. Within a source file, comments can be of any length and are interpreted as white space by both the compiler and the preprocessor.

1.4 Keywords

C defines several keywords, each with special meaning to the compiler. Keywords identify statement constructs and specify basic types and storage classes. Keywords cannot be used as identifiers and cannot be declared.

Table 1-3 lists the C keywords.

Table 1-3 Keywords
auto double int struct
break else long switch
case enum register typedef
char extern return union
const float short unsigned
continue for signed void
default goto sizeof volatile
do if static while

In addition to the keywords listed in Table 1-3, the compiler reserves all identifiers that begin with two underscores (__) or with an underscore followed by an uppercase letter. User variable names must never begin with one of these sequences.

Keywords are used as follows:

The following VAX C keywords are also sometimes 1 recognized by the compiler:


_align 
globaldef 
globalref 
globalvalue 
noshare 
readonly 
variant_struct 
variant_union 

The following C99 Standard keywords are also sometimes 2 recognized by the compiler:


inline 
restrict 

Use of a keyword as a superfluous macro name is not recommended, but is legal; for example, to change the default size of a basic data type:


#define int short 

Here, the keyword int has been redefined as short , which causes all data objects declared with the int data type to be stored as short objects.

Note

1 Recognized on OpenVMS systems when /STANDARD=RELAXED_ANSI (the default), /STANDARD=VAXC or /ACCEPT=VAXC_KEYWORDS is specified on the compiler command line. Recognized on Tru64 UNIX systems when -vaxc or -accept vaxc_keywords is specified on the compiler command line.

2 Recognized on OpenVMS systems when /STANDARD=RELAXED_ANSI (the default), or /ACCEPT=C99_KEYWORDS is specified on the compiler command line. Recognized on Tru64 UNIX systems when -std (the default) or -accept c99_keywords is specified on the compiler command line.


Previous Next Contents Index