3. Lexical form

3.1 Overview

Although Perfect programs are normally represented using an extended character set in an integrated development environment, they can be represented in the standard ASCII character set or in Unicode. This chapter describes how the character set is used to express the various tokens of the language.

3.2 Character set

The character set used by Perfect comprises the letters A through Z and a through z, the digits 0 through 9 and the following special characters:

. , ; : ? ! ' " ` + - * / % & | ( ) { } [ ] < > ~ # ^ _ = \ @

[SC] Other printable characters supported by the underlying character set are legal only in comments and in character and string literals. Nonprintable characters other than those characters or character combinations used to represent space, newline and horizontal tab are illegal in Perfect text, except that a nonprintable character or character combination representing end-of-file may be present at the very end of the program if allowed by the underlying file system and character representation.

It is recommended that when printing or displaying Perfect text, tab stops are considered to exist every 4 space-character widths from the left hand margin.

[Note: the character "$" is the only printable 7-bit character in the ASCII set that is not used.]

3.3 Comments

Comments are introduced by two adjacent forward slash characters ("//") and terminate at the end of the line. The last line in a file is always considered as having an end, even if there is no end-of-line marker before the end of the file.

3.4 White space

Comments, and newline, space and tab characters (other than those within comments, and space characters within string and character literals) are collectively known as whitespace. Multiple adjacent whitespace elements are equivalent to a single whitespace.

Whitespace may occur between any two program tokens but not within an identifier, literal, reserved word or multi-character token. Whitespace may, however, appear between two tokens that construct a new operator from an existing one according to the rules of the language. Whitespace must occur between a pair of adjacent tokens if the beginning of the second would otherwise be a legal continuation of the first (e.g. between a reserved word and an identifier).

3.5 Multi-character tokens

The multi-character tokens of the language are:

<= >= << >> <<= >>= <== ==> <==> || ~~ ^= :- :: -> <- <-> ++ -- ** ## .. ...

Where an input character sequence can be interpreted in more than one way, the lexical analyser picks the longest leading sub-sequence that forms a token, then applies the same rule to the remainder of the input sequence. For example, "=>>" would be interpreted as "=>"followed by ">", not as "=" followed by ">>", even if the former interpretation gave rise to a parsing or other error message and the latter did not.

3.6 Reserved words

The reserved words of the language are:

abstract any associative bag catch confined decrease done end false final forall function heap if import internal it limited map nonmember on out par pragma property rank redefine require schema set tag those total try via yield absurd after anything as assert axiom begin bool build byte change char class commutative const deferred define early enum exempt exists fi float for from ghost goto has highest idempotent identity implements in inherits int interface invariant is keep let like loop lowest null of opaque operator over pair pass post pre proof real ref repeated result satisfy selector self seq storable super supports that then throw trace triple true until value var void when within

Reserved words are case-sensitive. Note that float, implements, supports and trace are not used at present, but are reserved for future use.

The following words are not reserved but are names for built-in global methods and are therefore best avoided:

debugHalt debugPrint flatten interleave loadObject max min storeObject swap

Similarly, the following words are names for built-in classes and are also best avoided:

ByteData ByteStream CharDecoder CharEncoder CharEncoderDecoder Comparator DebugType Environment FileAttribute FileError FileHandle FileMode FileModeType FilePath FileRef FileStats FileStream GuardedObject InputStream nat OsInfo OsType OutputStream ReverseComparator SerialError SerialErrorType SimpleComparator Socket SocketError SocketMode StandardInputStream StandardOutputStream Storable StreamBase StreamHeap string Time

3.7 Identifiers

Identifiers comprise a letter or underscore character optionally followed by any number of characters each of which is a letter, digit or underscore character, provided only that the resulting string is not a reserved word. There is no limit on the length of an identifier. All characters in an identifier are significant. The case of letters is significant.

3.8 Character literals

Literals of type char are written as the desired character between opening-single-quote symbols thus:

`a`

The backslash character has a special meaning within character literals in that the backslash character and one or more of the characters following it are replaced by a single character, as follows:

\a	alert (bell)
\b	backspace
\f	form feed
\n	line feed
\r	carriage return
\t	horizontal tab
\v	vertical tab
\\	\
\`	`
\"	"
\(ddd)	the character represented by the integer literal ddd

In the case of the form `\(ddd)`, ddd is any integer literal such that the resulting integer is within the range appropriate to the character set in use. There must be no whitespace between the brackets and the integer literal.

The use of any other character following the initial backslash is illegal. The amount of storage associated with each character and the character set supported are implementation dependent (a typical implementation might offer a choice of ASCII or Unicode).

[SC] Exactly one printable character, space character or backslash combination equivalent to one character must appear between the quotes. If the `\(ddd)` form is used then the integer literal ddd must in be the range of the supported character set.

3.9 String literals

Literals of type "sequence of characters" (seq of char) are written as a sequence of characters enclosed in double quotes. The backslash character has the same special meaning as it does in character literals and every backslash sequence gives rise to a single character in the string. The closing quote must be on the same line as the opening quote.

Within a character or string constant, nonprintable characters (including newline and tab characters) are not permitted, and comments are not recognised.

[SC] The sequence between the quotes must comprise only printable characters, space characters and valid backslash sequences.

3.10 Integer literals

An integer literal is written as a sequence of decimal digits, or as a sequence of hexadecimal digits (0-9 and A-F or a-f) preceded by 0X or 0x, or as a sequence of binary digits preceded by 0B or 0b. There is no fixed limit on the size of integer literals, however if the compiler or target uses bounded integers, an error message will be generated in respect of any integer literal that cannot be represented. The case of any letter forming part of an integer literal is not significant.

Underscore characters may be inserted within the sequence of digits (but not at the start or the end) to improve readability.

3.11 Real literals

Real literals are written in the form s.s or ses or s.ses where s is any sequence of decimal digits and e is the letter e or the letter E optionally followed by a minus sign. The digit string following e or E is interpreted as a decimal exponent. White space is not permitted within a real literal. Each digit string s may contain embedded underscore characters to improve readability.