Current Version: 1.0.32
Project Name: csspp
Lexer Rules

Note
Many of the SVG images below were taken from the CSS Syntax Module Level 3 document.

The lexer is composed of the following rules:

Input Stream (CSS Preprocessor Detail)

Contrary to CSS 3 which allows for any encoding as long as the first 128 bytes match ASCII sufficiently, CSS Preprocessor only accepts UTF-8. This is because (1) 99% of the CSS files out there are ACSII anyway and therefore already UTF-8 compatible and (2) because the Snap! Websites environment is using UTF-8 throughout all of its documents (although in memory text data may use a different format such as UTF-16 or UTF-32.)

The input stream is checked for invalid data. The lexer generates an error if an invalid character is found. Characters that are considered invalid are:

  • \0 – the NULL terminator; the lexer can still parse strings, only you have to write such strings in an I/O buffer first and you just should not include the NULL terminator in that buffer; (see example below)
  • \xFFFD – the INVALID character; in CSS 3, this character represents the EOF of a stream; in CSS Preprocessor, it is just viewed as an error
  • \x??FFFE and \x??FFFF – any character that ends with FFFE or FFFF is viewed as invalid and generates an error

Note that the parsing will continue after such errors. However, if one or more errors occured while parsing an input stream, you should not use the output since it is likely invalid.

An example using the CSS Preprocessor lexer with a string:

...
// parse a string with CSS code
std::stringstream ss;
ss << my_css_string;
csspp::position pos("string.css");
csspp::lexer l(ss, pos);
csspp::node::pointer_t n(l.next_token());
...

Any Character "ANYTHING" (CSS 3)

\??????

A valid character is any character code point defined between 0x000000 and 0x10FFFF inclusive.

The input-stream defines a small set of characters within that range that are considered invalid in CSS Preprocessor streams. Any character considered invalid is replaced by the 0xFFFD code point so the rest of the implementation does not have to check for invalid characters each time.

ASCII Character "ASCII" (CSS 3)

\0-7f

An ASCII character is any value between 0 and 127 inclusive.

CSS 3 references ASCII and non-ASCII characters.

Non-ASCII Character "NON-ASCII" (CSS 3)

ANYTHING except ASCII

A NON-ASCII character is any valid character code point over 127.

Note
In CSS 2.x, characters between \80 and \9F were considered invalid graphic controls.

C-Like Comment "COMMENT" (CSS 3)

/* ANYTHING but * followed by / */

Note that "anything" means any character that is not considered invalid by the CSS Preprocessor implementation.

C++ Comment "COMMENT" (CSS Preprocessor Extension)

// ANYTHING but \n \n

Note that "anything" means any character that is not considered invalid by the CSS Preprocessor implementation.

The CSS Preprocessor returns a C++ comment appearing on multiple lines, one after another, as a single C++ comment token. This is used that way because it is possible to mark a comment as @preserve in order to keep said comment in the output. In most cases this is used in the comment at the top or bottom which includes the copyright notice about the document.

// CSS Preprocessor
// Copyright (c) 2015-2022 Made to Order Software Corp. All Rights Reserved
//
// @preserve
Note
C++ comments are output as standard C-like comments so they are 100% compatible with CSS.

Newline "NEWLINE" (CSS 3)

\n \r\n \r \f

CSS Preprocessor counts lines starting at one and incrementing anytime a "\n", "\r", "\r\n" sequence is found.

CSS Preprocessor resets the line counter back to one and increments the page counter by one each time a "\f" is found.

CSS Preprocessor also counts the total number of lines and pages in a separate counter.

The line number is used to print out errors. If you use paging (\f), you may have a harder time to find your errors in the current version.

Note that line counting also happens in C++ and C-like comments.

Non-Printable "NON-PRINTABLE" (CSS 3)

\0 \8 \b \e-\1f \7f

URL do not accept non-printable characters if not written between quotes. This rule shows you which characters are considered non-printable in CSS 3.

Whitespace "WHITESPACE" (CSS 3)

space \t NEWLINE

Whitespaces are quite important in CSS since they are required in many cases. For example, a dash (-) can start an identifier, so you want to add a space after a dash if you want to use the minus sign.

The CSS Preprocessor documentation often references the WHITESPACE token meaning that any number of whitespaces, including zero. It may be written as WHITESPACE* (0 or more whitespaces) or WHITESPACE+ (one or more whitespaces) to be more explicit.

CSS 3 defines a whitespace, a whitespace-token, and a ws* to represents all those possibilities.

a - b // 'a' 'minus' 'b'
a -b // 'a' '-b', which are two identifiers
a-b // 'a-b', which is one identifier

Decimal Digit "DIGIT" (CSS 3)

0-9

An hexadecimal digit is any digit (0-9) and a letter from A to F either in lowercase (a-f) or in uppercase (A-F).

Hexadecimal Digit "HEXDIGIT" (CSS 3)

0-9 a-f or A-F

An hexadecimal digit is any digit (0-9) and a letter from A to F either in lowercase (a-f) or in uppercase (A-F).

Escape "ESCAPE" (CSS 3)

\ not NEWLINE or HEXDIGIT HEXDIGIT 1-5 times WHITESPACE HEXDIGIT 6 times

Allow hexadecimal or direct escaping of any character, except the new line character ('\' followed by any newline). There is an exception to the newline character in strings.

The hexadecimal syntax allows for any number from 0 to 0xFFFFFF. However, the same constraint applies to escape characters and only code points that are considered valid from the input stream are considered valid in an escape sequence. This means any character between 0 and 0x10FFFF except those marked as invalid in the input-stream section.

Note
Contrary to the CSS 3 definition, this definition clearly shows that the space after an escape is being skipped if the escape includes 1 to 5 digits. In case 6 digits are used, the space is NOT skipped in our implementation. It looks like this is consistent with the text explaining how the escape sequence works. If this is a bug, the lexer will need to be changed.

Identifier "IDENTIFIER" (CSS 3)

- a-z A-Z _ or NON-ASCII ESCAPE a-z A-Z 0-9 _ - or NON-ASCII ESCAPE

We do not have any exception to the identifier. Our lexer returns the same identifiers as CSS 3 allows. Note that there is an extension compared to CSS 2.x, characters 0x80 to 0x9F are accepted in identifiers.

Note that since you can write an identifier using escape characters, it can really be composed of any character except NEWLINE and invalid characters. This allows for CSS to represent any character that can be used in an attribute value in HTML.

.class [attribute] #id :state { ... }
Note
Identifiers that start with a dash must be at least two characters.

Function "FUNCTION" (CSS 3)

A FUNCTION token is an IDENTIFIER immediately followed by an open parenthesis. No WHITESPACE is allowed between the IDENTIFIER and the parenthesis.

a { color: rgba(255, 255, 255, 0.3); }

At Keyword "AT_KEYWORD" (CSS 3)

Various CSS definitions require an AT-KEYWORD. Note that a full keyword can be defined, starting with a dash, with escape sequence, etc.

Our extensions generally make use of AT-KEYWORD commands to extend the capabilities of CSS.

@media { ... }

Placeholder "PLACEHOLDER" (CSS Preprocessor Extension)

The PLACEHOLDER is a CSS Preprocessor extension allowing for the definition of rules that do not get included in your CSS unless they get referenced.

// a simple rule with a placeholder
rule%one { ... }
// a reference to such a rule
.extended
{
@extend %one;
}
Warning
The placeholder token and rules are supported as expected. The @extend is not yet supported in version 1.0.0 of CSS Preprocessor.

Variable "VARIABLE" (CSS Preprocessor Extension)

$ a-z A-Z 0-9 - _

Variables are a CSS Preprocess extension, very similar to the variables defined in the SASS language (also to variables in PHP).

The name of a variable is very limited on purpose.

$text_color = #ff00ff;
font-color: $text_color;
Warning
The '-' character is allowed to be backward compatible with SASS, but you are expected to use '_' instead. The lexer automatically transforms all the '-' characters into '_' when reading the input stream.

Variable Function "VARIABLE_FUNCTION" (CSS Preprocessor Extension)

$ a-z A-Z 0-9 - _ (

Just like regular identifiers followed by a '(', we view variables immediately followed by a '(' as Variable Functions.

The name of a variable function is very limited on purpose.

$text_color($color): desaturate($color, 10%);
font-color: $text_color(#334699);
Warning
The '-' character is allowed to be backward compatible with SASS, but you are expected to use '_' instead. The lexer automatically transforms all the '-' characters into '_' when reading the input stream.

Hash "HASH" (CSS 3)

# a-z A-Z 0-9 _ - or NON-ASCII ESCAPE

The HASH token is an identifier without the first character being limited.

Note
Contrary to an identifier, "-" by itself can be a HASH token.

String "STRING" (CSS 3)

" not " \ or NEWLINE ESCAPE \ NEWLINE " ' not ' \ or NEWLINE ESCAPE \ NEWLINE '

Strings can be written between '...' or "...". The string can contain a quote if properly escaped. You may either escape the quote itself or use the corresponding hexadecimal encoding:

'... \' or \27 ...'
"... \" or \22 ..."

Of course, you can use ' in a string quoted with " and vice versa.

Strings accept the backslashed followed by a newline to insert a newline in the string and write that string on multiple lines. In other words, the slash is removed, but not the '\n' character.

Warning
Note that the '\n' is a C/C++ syntax which is not supported by CSS. That is, in CSS '\n' is equivalent to 'n'.
As an extension, CSS Preprocessor does not allow for strings to not be closed. This is always an error. CSS generally gives you the option to make sure that improperly terminated style attributes are still given a chance to function.

URL "URL" (CSS 3)

The URL token is quite peculiar in CSS. This keyword was available since CSS 1 which did not really offer functions per se. For that reason it allows some backward compatible syntax which would certainly be quite different in CSS 3 had they chosen to not allow as is URLs to be entered (i.e. only allow quoted URLs.)

Also because of that, the URL is a special token and not a function. Note that the syntax allows for an empty URL, which is important to be able to cancel a previous URL definition (overwrite a background image with nothing.)

url(/images/background.png)
url('images/background.png')

URL Unquoted "URL-UNQUOTED" (CSS 3)

not " ' ( ) \ WHITESPACE or NON-PRINTABLE ESCAPE

The URL can be nearly any kind of characters except spaces, parenthesis, quotes, and the backslash. To include such character you may either ESCAPE them or use quotes.

url( It\'s\ the\ same\ as... )
url( "It's the same as..." )

Number "NUMBER", "INTEGER", "DECIMAL_NUMBER" (CSS 3)

+ - DIGIT . DIGIT DIGIT . DIGIT e E + - DIGIT

CSS 3 distinguishes between integers and floating points, only the definition of an integer is just a floating point with no decimal digits after the period and no exponent.

The CSS Preprocessor lexer returns two different types of tokens: INTEGER and DECIMAL_NUMBER. The compiler may force the use of one or the other in a few places where the type has to be an INTEGER or a DECIMAL_NUMBER. For example, a PERCENT number always uses a DECIMAL_NUMBER.

The CSS 3 lexer is expected to include the signs as part of a number (to simplify the rest of the grammar.) This is important because otherwise rules such as a background field would look like expressions:

background: transparent url(images/example.png) +3px -5px;

Here the +3px and -5px are viewed as two distinct numbers. If we were to make the + and - operators instead of part of the numbers, these two numbers would look like a subtraction (3px - 5px). When you write expressions, you should anyway always add spaces around your operators. Another one that may get you is negating the result of a function call. Without the space the dash becomes part of the function name. In the following, you are calling a function named 'color-saturate' instead of subtracting 'saturate($color, -33%)' from '$color':

background-color: $color-saturate($color, -33%);

The correct expression would be:

background-color: $color - saturate($color, -33%);

Example of numbers:

0
-1.3
.55
+801
9e3
-1.001e+4
.45e-7
Note
The number of decimal digits is limited to 20. If you write a number with more decimal digits after the decimal point, then an error is generated.
Warning
When a number includes a decimal point or an exponent, it is considered to be a DECIMAL_NUMBER even when that number is otherwise an integer (1.001e4 = 10010 which is an integer.)

Dimension "DIMENSION" (CSS 3)

When a number is immediately followed by an identifier, the result is a dimension.

Note that the name of a dimension can start with the character 'e' (i.e. "13em",) however, if the character 'e' is followed by a sign ('+' or '-') or a DIGIT, then the 'e' is taken as the exponent character.

Warning
Note that "-" by itself is not a valid identifier. This means a number followed by a dash and another number is clearly a subtraction.
font-size: 12pt;
height: 13px;
width: 3em;

The lexer let you enter any dimension. At some point the compiler will make sure that all dimensions are understood by CSS. That being said, the CSS Proprocessor is likely to understand many other dimensions and convert to on that CSS 3 understands.

You may find a complete list of supported CSS 3 dimmensions here: http://www.w3.org/TR/css3-values/

Percent "PERCENT" (CSS 3)

A PERCENT is a number immediately followed by the '' character. Internally a PERCENT is always represented as a decimal number, even if the number was an integer (integers are automatically converted as required.)

33.33%
100%
2.25%

Note that the percent character can be appended using the escape character. In that case, it is viewed as a dimension which will fail validation.

// the following two numbers are DIMENSIONs, not PERCENT
33.33\%
100\25

Unicode Range "UNICODE_RANGE" (CSS 3)

U u + HEXDIGIT 1-6 times HEXDIGIT 1-5 times ? 1 to (6 - digits) times HEXDIGIT 1-6 times - HEXDIGIT 1-6 times

Define a range of Unicode characters from their code points. The expression allows for:

  • One specific code point (U+<######>);
  • A mask with all the code points that match (U+<###>???);
  • A range with a start and an end code point (U+<######>-<######>).

The mask mechanism actually generates a range like the third syntax, only it replaces the '?' character with '0' for the start code point and with 'f' for the end code point.

U+ff // characters from 0xFF to 0xFF
U+3?? // characters from 0x300 to 0x3FF
U+320-34F // characters from 0x320 to 0x34F explicitly
Note
The lexer makes use of the csspp::unicode_range_t class to record these values in a UNICODE_RANGE node. The range is then compressed and saved in one 64 bit number.

A Unicode range is used by @font-face definitions to limit the number of characters to be loaded for a page.

Include Match (E[attr~="value"])

~=

Match when the parameter on the right is included in the list of parameters found on the left. The value on the left is a list of whitespace separated words (i.e. a list of classes).

a[class ~= "green"] // equivalent to a.green

Dash Match (E[attr|="value"])

|=

Match when the first element of the hyphen separated list of words on the left is equal to the value on the right.

[lang |= "en"] // match lang="en-US"

Prefix Match (E[attr^="value"])

^=

Match when the value on the left starts with the value on the right.

a[entity ^= 'blue'] // match entity="blue-laggoon"

Suffix Match (E[attr$="value"])

$=

Match when the value on the left ends with the value on the right.

a[entity $= 'laggoon'] // match entity="blue-laggoon"

Substring Match (E[attr*="value"])

*=

Match when the value on the right is found in the value on the left.

a[entity *= '-lag'] // match entity="blue-laggoon"

Colunm Match "COLUMN" (CSS 3 Token, CSS Preprocessor Extension)

||

The CSS 3 documentation says:

<column-token> has been added, to keep Selectors parsing in single-token lookahead.

At this point I am not too sure whether that means it is only a lexer artifact or whether it would be an operator people can use.

See also
http://stackoverflow.com/questions/30702971/how-is-the-operator-used-in-css

All of that being said, since we support the Logical AND "AND" (CSS Preprocessor extension), we accept this operator as the Logical OR in our expressions.

The OR operator takes two boolean value. If at least one of these boolean value is true, then the result is true, otherwise it is false.

You may also use the 'or' identifier.

width: $bool1 || $bool2 ? $left-column : $right-column;
// or
width: $bool1 or $bool2 ? $left-column : $right-column;

Logical AND "AND" (CSS Preprocessor extension)

&&

The '&&' operator is the logical AND operator. It returns true when its left and right handsides are both set to true.

You may also use the 'and' identifier.

width: $bool1 && $bool2 ? $left-column : $right-column;
// or
width: $bool1 and $bool2 ? $left-column : $right-column;
See also
Colunm Match "COLUMN" (CSS 3 Token, CSS Preprocessor Extension)

Double Equal "EQUAL" (CSS Preprocessor extension)

==

CSS 3 clearly uses '=' to test for equality. Somehow, SASS added '==' which is really not consistent. To be more compatible with SASS, we support both. At this point we do not warn or anything when '==' is found. We may do so later. Internally, we immediately convert '==' to the exact same token as '='.

Not Equal "NOT_EQUAL" (CSS Preprocessor extension)

!=

We offer a 'not equal' operator for our expressions and also attributes.

color: $var != 3 ? red : blue;
and
p[book!='family'] { display: none; }

The attribute extension is converted by the compiler to valid CSS as in:

p:not([book='family']) { display: none; }

Assignment "ASSIGNMENT" (CSS Preprocessor extension)

:=

We added the ':=' operator to allow one to set a variable within an expression. For example, you could write an assignment of a long expression, then reuse that value many times in the rest of the expression:

(value := rather + complicated * expression ** here,
display(value * value), enlarge(value))

Less or Equal To "LESS_EQUAL" (CSS Preprocessor extension)

<=

We added the '<=' operator to allow one to compare values in an expression between each others.

width: $left-column <= $right-column ? $left-column : $right-column;
// which in this case is equivalent to:
width: min($left-column, $right-column);

Greater or Equal To "GREATER_EQUAL" (CSS Preprocessor extension)

>=

We added the '>=' operator to allow one to compare values in an expression between each others.

width: $left-column >= $right-column ? $left-column : $right-column;
// which in this case is equivalent to:
width: max($left-column, $right-column);

Power "POWER" (CSS Preprocessor extension)

**

The '**' operator is an extension that allow you to caculate the power of a number (left hand side) by another (right hand side). Note that dimensions (numbers with a unit) cannot be used with the '**' operator.

width: 1px * 2 ** 5;

Comment Document Open "CDO" (CSS 3)

<!--

The Comment Document Open is understood so that way it can be skipped when reading a block of data coming from an HTML <style> tag.

Comment Document Close "CDC" (CSS 3)

-->

The Comment Document Close is understood so that way it can be skipped when reading a block of data coming from an HTML <style> tag.

Delimiter "DELIMITER" (CSS 3)

The DELIMITER is activated for any character that does not activate any other lexer rule.

For example, a period that is not followed by a DIGIT is returned as itself. The grammar generally shows delimiters using a simple quoted string rather than its node_type_t name.

The delimiters actually return a specific node_type_t value for each one of these characters:

  • = – EQUAL
  • , – COMMA
  • : – COLON
  • ; – SEMICOLON
  • ! – EXCLAMATION
  • ? – CONDITIONAL
  • > – GREATER_THAN
  • ( – OPEN_PARENTHESIS
  • ) – CLOSE_PARENTHESIS
  • [ – OPEN_SQUAREBRACKET
  • ] – CLOSE_SQUAREBRACKET
  • { – OPEN_CURVLYBRACKET
  • } – CLOSE_CURVLYBRACKET
  • . – PERIOD
  • & – REFERENCE
  • < – LESS
  • + – ADD
  • - – SUBTRACT (if by itself or at least not followed by an identifier)
  • $ – DOLLAR
  • ~ – PRECEDED
  • * – MULTIPLY
  • | – SCOPE
  • / – DIVIDE
  • % – MODULO

Any character that does not match one of these DELIMITER characters, or another lexer token, generates an immediate lexer error.

Note
The EXCLAMATION is returned as a simple token by the lexer. The parser will convert it to a form of identifier unless it is not followed by an identifier in which case an error is generated. The parser will also take care of removing whitespaces.
The DIVIDE character is viewed as a standard CSS 3 separator when used in the font field (actually, any field that match as defined in the scripts/validation/has-font-metrics.scss script) as in:
font: 17px/1.3em helvetica;
Note
The lexer cannot know what to do with the DIVIDE. The compiler, however, knows at the time it runs the expression since it has the name of the field name 'font'. In that case it tells the expression class to handle the DIVIDE as a CSS 3 separator. That mean the sequence <number> / <number> will generate the token FONT_METRICS.
In any other field, that do not have a name that matches, the slash is viewed as a regular DIVIDE operator, so <number> / <number> is calculated and the result of the operation is returned.
In order to issue a division in a font field, one can use parenthesis:
$height: 480px;
font: ($height / 32) / 1.3em helvetica;

Documentation of CSS Preprocessor.

This document is part of the Snap! Websites Project.

Copyright by Made to Order Software Corp.