Back Contents Next

1. Lexical Specification

Programs are written in Unicode or UTF-8 that is then converted to Unicode. Since ASCII is a subset of the UTF-8 encoding, source files that do not utilize unicode only characters can be edited with a standard ASCII text editor. Unicode allows the expression of many foreign characters in program comments and identifiers. One must be careful because some Unicode characters look like one another even though they are from different character sets. This section only informally treats this subject (see 16.1 Lexical Structure).

1.1 Whitespace

Whitespace is ignored except as it serves to separate identifiers and keywords. Whitespace includes all spaces, tabs, and new lines. New lines may be of the form found on any system. Editors should invisibly convert between the various new line formats.

1.2 Comments

The English explanations which are scattered throughout code are called comments. There are two forms of comments. Text following '//' up through the end of line is a comment. In addition block comments are available. Block comments begin '/*' and end '*/'. Block comments may be nested to an arbitrary depth. Implementations must document any limits on this imposed by the finite constraints of the computer (≥231 suggested). All comments appear as whitespace to the compiler so they may not be in the middle of an identifier or keyword.

1.3 Identifiers

Identifiers are used as names in programs. Identifiers begin with a letter and continue with numerals or letters until whitespace or another invalid character is reached. Characters which are invalid include all characters contained in any none alphabetic operator (see 1.5 Operators) as well other symbolic characters. The underscore '_' is however allowed in identifiers. In addition an identifier may be terminated with either a question mark '?' or a exclamation point '!'.

Identifiers are case sensitive. That is two identifiers that differ only in their case are different.

1.4 Keywords

In addition to the standard identifiers certain identifiers are reserved for special meanings as keywords and may not otherwise be used as identifiers. The keywords are:

abstract break case catch class const continue create default destroy
do else for final if import interface mutable operator
outer personal public private protected return self scope super
switch this throw throws try while get set

There are also operators which are identifiers (see 1.5 Operators). In addition to the keywords defined here, many identifiers are already defined in the API. These include the primitive types, just as any identifier, they may not be redefined.

1.5 Reserved Words

The following words are reserved for possible future use as keywords. They may be used as identifiers but a warning will be generated and code which makes use of them may not be valid in future versions of the language.

new delete resize dim sizeof deprecated inner
signal signals receive interrupt

1.6 Operators & Separators

Certain symbol sequences are operators. As noted before, no character contained in a standard operator may be used in an identifier. The standard operators are:

( ) { } [ ] ; : , . ..
== < > <= >= != + - * / %
++ -- << >> = += -= *= /= %=

In additon there are the following alphabtic or partially alphabtic operators.

shift_left shift_right bit_and bit_or bit_xor and or xor
shift_left= shift_right= bit_and= bit_or= bit_xor= create complement

1.7 Primitive Literals

In addition to keywords and operators a program may also contain literals. Literals are special syntax used to refer to specific object values. Their are literals for each of the primitive types and for arrays and strings. All literals have 'const' types.

1.7.1 Boolean Literals

The two boolean literals are 'true' and 'false'. Each has the type bool and may be used anywhere a bool object value is allowed.

1.7.2 Character Literals

Character literals represent one or more characters. They are written by enclosing a sequence of characters or character escape codes in single quotes. New lines and single quotes may not appear in character literals. Character literals may be short character literals or normal character literals. Their type is determined by the characters they contain. If a character literal contains any character which requires two bytes to express then it is a normal character literal. All characters in a normal character literal are expressed in two bytes whether it is required or not for individual characters. Once it has been determined how many bytes a character literal takes up, the type can be determined. Character literals of one character have the type schar or char depending on whether they are short literals. Any other character literal has the type of the smallest unsigned integer type which may hold it. A character literal must be short enough to fit in such a type and must match exactly in size to some such type. It is an error if a character literal is shorter than the shortest integer type that can hold it.

Inside a character literal escape sequence may be used to indicate characters which could not otherwise be contained in a character literal. The unicode escape sequences force a character literal which contains it to be normal even if the escape code describes a character that does not require this. Each escape stands for a single character. The exceptions to this are the unicode marker '\U' which forces the characters in the literal to be normal unicode and the '\S marker which forces the literal to be short. Both are not replaced by a character value but just mark what kind of literal it is. The other escape sequences are:

NameASCII NameUnicodeEscape Code
null NULL \u0000 \0
newline NL (LF) \u000a \n
horizontal tab HT \u0009 \t
backspace BS \u0008 \b
carrige return CR \u000d \r
form feed FF \u000c \f
backslash \ \u005c \\
single quote ' \u0027 \'
double quote ' \u0022 \"
ASCII escape hh \u00hh \xhh
Unicode escape \uhhhh \uhhhh
Unicode marker \U

The unicode and short markers must be before any other character in a character literal (not both). The ASCII and unicode escape sequences must contain respectively two and four hexadecimal digits. Hexadecimal digits are '0' through '9' and 'a' or 'A' through 'f' or 'F'.

1.7.3 Integer Literals

Integer literals represent integer values and are either in decimal or hexadecimal(base 16). Decimal integers are either '0' or a non zero decimal digit ('1' through '9') followed by any number of decimal digits. Note that a unary minus sign may appear before a decimal literal but is not strictly part of the literal. A decimal literal is always of the physically smallest signed or unsigned type that can hold its value. If it can be held by a signed and unsigned type of the same size then it is signed. If it is of an unsigned type and is negated by a unary minus it is promoted to the next signed type in order to hold the negative value.

Hexadecimal integer literals always begin '0x' which is then followed by one or more hexadecimal digits. Hexadecimal literals are of the smallest unsigned type that will hold them. Note that leading extra zeros do not affect this. A hexadecimal digit is either a decimal digit or the letters 'a' or 'A' through 'f' or 'F'.

1.7.4 Floating Point Literals

Floating point literals represent values of type float or double. If the number can be exactly represented in a float then the type is float. Otherwise the type is double. A floating point literal is composed of a decimal, fractional and exponent part. The decimal part is an decimal integer not beginning with zero or is zero. The fractional part is a decimal point followed by one or more decimal digits. The exponent part is an 'e' optionally followed by a '+' or '-' then by a decimal integer not beginning with zero and not zero. Either the fractional part or the exponent but not both may be omitted. Also either the decimal part or the fractional part but not both may be omitted.

1.7.5 Array Literals

Array literals are enclosed in square brackets and can be distinguished from subscripting by not being next to a literal. The items are then contained within this and separated by commas. Multi-dimensional arrays are specified by listing an array literal inside of an array literal.

1.7.6 String Literals

String literals are closely related to character literals. String literals may be short or normal depending on whether they contain chars. String literals are enclosed in double quotes and may contain many characters including single quotes. Other characters including double quotes may be escaped using the character escapes (see 1.7.2 Character Literals). Just as with character literals, new lines may not be used in strings. A string literal is comprised on chars unless the '\S' escape is used to force it to be a string of schars.



Back Contents Next

jwalker@cs.oberlin.edu