Tru64 UNIX
Compaq C Language Reference Manual

Chapter 1
Lexicon

C, like any language, uses a standard grammar and character set. The specific elements that comprise this grammar and character set are described in the following sections:

Character set ( Section 1.1)
Rules for identifiers in C ( Section 1.2)
Use of comments in a program ( Section 1.3)
Keywords ( Section 1.4)
Use of C operators ( Section 1.5)
Use of punctuation characters ( Section 1.6)
Use of character strings in a program ( Section 1.7)
Interpretation of constant values ( Section 1.8)
Inclusion of function declarations and other definitions, common to multiple source files, in a separate header file or module ( Section 1.9)
The limits imposed on a conforming program by the ANSI C standard ( Section 1.10)

C compilers interpret source code as a stream of characters from the source file. These characters are grouped into tokens, which can be punctuators, operators, identifiers, keywords, string literals, or constants. Tokens are the smallest lexical element of the language. The compiler forms the longest token possible from a given string of characters; the token ends when white space is encountered, or when the next character could not possibly be part of the token.

White space can be a space character, new-line character, tab character, form-feed character, or vertical tab character. Comments are also considered white space. Section 1.1 lists all the white space characters. White space is used as a token separator (except within quoted strings), but is otherwise ignored in the character stream, and is used mainly for human readability. White space may also be significant in preprocessor directives (see Chapter 8).

Consider the following source code line:

static int x=0; /* Could also be written "static int x = 0;" */

The compiler breaks the previous line into the following tokens (shown one per line):

static int x = 0 ;

As the compiler processes the input character stream, it identifies tokens and locates error conditions. The compiler can identify three types of errors:

Lexical errors, which occur when the compiler cannot form a legal token from the character stream (such as when an illegal character is used).
Parsing (syntax) errors, which occur when a legal token can be formed, but the compiler cannot make a legal statement from the tokens. For example, the following line contains incorrect punctuation surrounding an initializer list:
char x[3] = (1,2,3);
Semantic errors, which are grammatically correct but break another C language rule. For example, the following line shows an attempt to assign a floating-point value to a pointer type:
int *x = 5.7;

Logical errors are not identified by the compiler.

An important concept throughout C is the idea of a compilation unit, which is one or more files compiled by the compiler.

Note

The ANSI C standard refers to compilation units as translation units. This text treats these terms as equivalent.

The smallest acceptable compilation unit is one external definition. The ANSI C standard defines several key concepts in terms of compilation units. Section 2.2 discusses compilation units in detail.

A compilation unit with no declarations is accepted with a compiler warning in all modes except for the strict ANSI standard mode.

1.1 Character Set

A character set defines the valid characters that can be used in source programs or interpreted when a program is running. The source character set is the set of characters available for the source text. The execution character set is the set of characters available when executing a program. The source character set does not necessarily match the execution character set; for example, when the execution character set is not available on the devices used to produce the source code.

Different character sets exist; for example, one character set is based on the American Standard Code for Information Interchange (ASCII) definition of characters, while another set includes the Japanese kanji characters. The character set in use makes no difference to the compiler; each character simply has a unique value. C treats each character as a different integer value. The ASCII character set has fewer than 255 characters, and these characters can be represented in 8 bits or less. However, in some extended character sets, so many characters exist that some characters' representation requires more than 8 bits. A special type was created to accommodate these larger characters, called the wchar_t (or wide character) type. Section 1.8.3.1 discusses wide characters further.

Most ANSI-compatible C compilers accept the following ASCII characters for both the source and execution character sets. Each ASCII character corresponds to a numeric value. Appendix C lists the ASCII characters and their numeric values.

The 26 lowercase Roman characters:

a b c d e f g h i j k l m n o p q r s t u v w x y z

The 26 uppercase Roman characters:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

The 10 decimal digits:
0 1 2 3 4 5 6 7 8 9
The 30 graphic characters:
! # % ^ & * ( ) - _ = + ~ ' " : ; ? / | \ { } [ ] , . < > $
A warning is issued if the $ character is used when the compiler's strict ANSI mode option is specified.

Five white space characters:

Space	( )
Horizontal tab	(\t)
Form feed	(\f)
Vertical tab	(\v)
New-line character	(\n)

In character constants and string literals, characters from the execution character set can also be represented by character or numeric escape sequences. Section 1.8.3.3 and Section 1.8.3.4 describe these escape sequences.

The ASCII execution character set also includes the following control characters:

New-line character (represented by \n in the source file),
Alert (bell) tone ( \a )
Backspace ( \b )
Carriage return ( \r )
Null character ( \0 )

The null character is a byte or wide character with all bits set to 0. It is used to mark the end of a character string. Section 1.7 discusses character strings in more detail.

The new-line character splits the source character stream into separate lines for greater legibility and for proper operation of the preprocessor.

Sometimes a line longer than the terminal or window width must be interpreted by the compiler as one logical line. One logical line can be typed as two or more lines by appending the backslash character ( \ ) to the end of the continued lines. The backslash must be immediately followed by a new-line character. The backslash signifies that the current logical line continues on the next line. For example:

#define ERROR_TEXT "Your entry was outside the range of \ 0 to 100."

The compiler deletes the backslash character and the adjacent new-line character during processing, so that this line becomes one logical line, as follows:

#define ERROR_TEXT "Your entry was outside the range of 0 to 100."

A long string can be continued across multiple lines by using the backslash-newline line continuation feature, but the continuation of the string must start in the first position of the next line. In some cases, this destroys the indentation scheme of the program. The ANSI C standard introduces another string continuation mechanism to avoid this problem. Two string literals, with only white space separating them, are combined to form one logical string literal. For example:

printf ("Your entry was outside the range of " "0 to 100.\n");

The maximum logical line length is 32,767 characters.

1.1.1 Trigraph Sequences

To write C programs using character sets that do not contain all of C's punctuation characters, ANSI C allows the use of nine trigraph sequences in the source file. These three-character sequences are replaced by a single character in the first phase of compilation. (See Section 2.16 for an explanation of compilation phases.) Table 1-1 lists the valid trigraph sequences and their character equivalents.

Table 1-1 Trigraph Sequences
Trigraph Sequence Character Equivalent

??= #

??( [

??/ \

??) ]

??' ^

??< {

??! |

??> }

??- ~

**Table 1-1 Trigraph Sequences**
Trigraph Sequence	Character Equivalent
??=	#
??(	[
??/	\
??)	]
??'	^
??<	{
??!	\|
??>	}
??-	~

No other trigraph sequences are recognized. A question mark (?) that does not begin a trigraph sequence remains unchanged during compilation. For example, consider the following source line:

printf ("Any questions???/n");

After the ??/ sequence is replaced, this line is translated as follows:

printf ("Any questions?\n");

1.1.2 Digraph Sequences

Digraph processing is supported when compiling in ISO C 94 mode (/STANDARD=ISOC94 on OpenVMS systems).

Digraphs are pairs of characters that translate into a single character, much like trigraphs, except that trigraphs get replaced inside string literals, but digraphs do not. Table 1-2 lists the valid digraph sequences and their character equivalents.

Table 1-2 Digraph Sequences
Digraph Sequence Character Represented

<: [

:> ]

<% {

%> }

%: #

%:%: ##

**Table 1-2 Digraph Sequences**
Digraph Sequence	Character Represented
<:	[
:>	]
<%	{
%>	}
%:	#
%:%:	##

1.2 Identifiers

An identifier is a sequence of characters that represents a name for the following:

Variable
Function
Label
Type definition
Structure, enumeration, or union tag
Structure, enumeration, or union member
Enumeration constant
Macro
Macro parameter

The following rules apply to identifiers:

Identifiers consist of a sequence of one or more uppercase or lowercase alphabetic characters, the digits 0 to 9, the dollar sign ($), and the underscore character (_).
Using the $ character provokes a warning from the compiler in strict ANSI mode.
Character case is significant in identifiers; for example, the identifier Test1 is different from the identifier test1 .
Identifiers cannot begin with a digit.
Do not begin identifiers with an underscore; the ANSI C standard reserves these identifiers for internal names.
Keywords are not identifiers ( Section 1.4 lists the C keywords).
Using the names of library functions for identifiers is bad practice (Chapter 9 lists the C library function names). A function with the same name as a library function will supersede the library function. This may be the desired outcome, but program maintenance can be confusing.
In general, identifiers are separated by white space, punctuators, or operators. For example, the following code fragment has four identifiers:
struct employee { int number; char sex; } emp;
The identifiers are: employee , number , sex , and emp . ( struct , int , and char are keywords).

An identifier without external linkage has at most 32,767 significant characters. An identifier with external linkage has 1023 significant characters on Tru64 UNIX systems and 31 significant characters for OpenVMS platforms. ( Section 2.8 describes linkage in more detail.) Case is not significant in external identifiers on OpenVMS systems.

Identifiers that differ within their significant characters are different identifiers. If two or more identifiers differ in nonsignificant characters only, they are treated as the same identifier.

1.3 Comments

The /* character combination introduces a comment and the */ character combination ends a comment, except within a character constant or string literal.

Comments cannot be nested; once a comment is started, the compiler treats the first occurrence of */ as the end of the comment.

To comment out sections of code, avoid using the /* and */ sequences. Using the /* and */ sequences works only for code sections containing no comments, because comments do not nest. A better method is to use the #if and #endif preprocessor directives, as in the following example:

#if 0 /* This code is excluded from execution because ... */ code_to_be_excluded (); #endif

See Chapter 8 for more information on the preprocessing directives #if and #endif .

Comments cannot span source files. Within a source file, comments can be of any length and are interpreted as white space by both the compiler and the preprocessor.

1.4 Keywords

C defines several keywords, each with special meaning to the compiler. Keywords identify statement constructs and specify basic types and storage classes. Keywords cannot be used as identifiers and cannot be declared.

Table 1-3 lists the C keywords.

Table 1-3 Keywords
auto double int struct

break else long switch

case enum register typedef

char extern return union

const float short unsigned

continue for signed void

default goto sizeof volatile

do if static while

**Table 1-3 Keywords**
`auto`	`double`	`int`	`struct`
`break`	`else`	`long`	`switch`
`case`	`enum`	`register`	`typedef`
`char`	`extern`	`return`	`union`
`const`	`float`	`short`	`unsigned`
`continue`	`for`	`signed`	`void`
`default`	`goto`	`sizeof`	`volatile`
`do`	`if`	`static`	`while`

In addition to the keywords listed in Table 1-3, the compiler reserves all identifiers that begin with two underscores (__) or with an underscore followed by an uppercase letter. User variable names must never begin with one of these sequences.

Keywords are used as follows:

To assign a storage class to a variable or function ( auto , extern , register , static )
To construct or qualify a data type ( char , const , double , enum , float , int , long , short , signed , struct , union , unsigned , void , volatile )
As part of a statement ( break , case , continue , default , do , else , for , goto , if , return , switch , while )
To define a new named type ( typedef )
To perform an operation ( sizeof , __typeof__ )

The following VAX C keywords are also sometimes ¹ recognized by the compiler:

_align globaldef globalref globalvalue noshare readonly variant_struct variant_union

The following C99 Standard keywords are also sometimes ² recognized by the compiler:

inline restrict

Use of a keyword as a superfluous macro name is not recommended, but is legal; for example, to change the default size of a basic data type:

#define int short

Here, the keyword int has been redefined as short , which causes all data objects declared with the int data type to be stored as short objects.

Note

¹ Recognized on OpenVMS systems when /STANDARD=RELAXED_ANSI (the default), /STANDARD=VAXC or /ACCEPT=VAXC_KEYWORDS is specified on the compiler command line. Recognized on Tru64 UNIX systems when `-vaxc` or `-accept vaxc_keywords` is specified on the compiler command line.

² Recognized on OpenVMS systems when /STANDARD=RELAXED_ANSI (the default), or /ACCEPT=C99_KEYWORDS is specified on the compiler command line. Recognized on Tru64 UNIX systems when `-std` (the default) or `-accept c99_keywords` is specified on the compiler command line.

Contents

Index

Tru64 UNIXCompaq C Language Reference Manual

Chapter 1Lexicon