Building a Language: Tokens

Continuing our series on building a programming language, this post dives into the concept of tokens. A token is the basic unit in the source code of a Codexivo program. It represents an indivisible part of the code — a keyword, an identifier, an operator, or a literal.

A token in Codexivo has two main components:

Token Type (TokenType): Indicates the category or classification of the token. For example, TokenType.IDENT for identifiers, TokenType.KEYWORD for keywords, TokenType.OPERATOR for operators, TokenType.LITERAL for literals, etc. The token type provides information about the nature of the token and how it should be interpreted.
Literal: The textual value associated with the token. It represents the exact representation of the lexical element in the source code. For example, for a TokenType.IDENT token representing an identifier, the literal could be the identifier's name ("x", "foo", etc.). For a TokenType.LITERAL token representing a value, the literal could be a number ("123", "3.14") or the text between quotes in a string ("Hello", "world", etc.).

In addition to these essential components, a token in Codexivo can include extra information depending on the context and the parser's needs — such as the location in the source code (line and column) for error diagnostics and syntax highlighting.

In short, a Codexivo token is composed of a token type that indicates its category, and a literal that represents its textual value. It can also carry additional details like source location, as needed.

For Codexivo, we defined the following TokenTypes — covering basic operators and simplifying others like numbers. In some languages, numbers are split into separate types like integers, floats, and doubles. In Codexivo we use TokenType.NUM to handle both integers and floats without distinction, similar to how JavaScript works.

export enum TokenType {
  AND = "AND",
  ASSIGN = "ASSIGN",
  ASTERISK = "ASTERISK",
  BANG = "BANG",
  COMMA = "COMMA",
  DO = "DO",
  ELSE = "ELSE",
  EOF = "EOF",
  EQ = "EQ",
  FALSE = "FALSE",
  FOR = "FOR",
  FUNCTION = "FUNCTION",
  GT = "GT",
  GT_EQ = "GT_EQ",
  IDENT = "IDENT",
  IF = "IF",
  ILLEGAL = "ILLEGAL",
  LBRACE = "LBRACE",
  LET = "LET",
  LPAREN = "LPAREN",
  LT = "LT",
  LT_EQ = "LT_EQ",
  MINUS = "MINUS",
  NEQ = "NEQ",
  NOT = "NOT",
  NUM = "NUM",
  OR = "OR",
  PLUS = "PLUS",
  PLUS_EQ = "PLUS_EQ",
  RBRACE = "RBRACE",
  RETURN = "RETURN",
  RPAREN = "RPAREN",
  SEMICOLON = "SEMICOLON",
  SLASH = "SLASH",
  TRUE = "TRUE",
  WHILE = "WHILE",
}

Once the TokenTypes are defined, we need to establish the literals for the symbols and keywords used in Codexivo. Based on JavaScript, we considered the following symbols:

const tokenPatterns = {
  "=": TokenType.ASSIGN,
  "==": TokenType.EQ,
  "+": TokenType.PLUS,
  "+=": TokenType.PLUS_EQ,
  "-": TokenType.MINUS,
  "*": TokenType.ASTERISK,
  "/": TokenType.SLASH,
  "<": TokenType.LT,
  "<=": TokenType.LT_EQ,
  ">": TokenType.GT,
  ">=": TokenType.GT_EQ,
  "!": TokenType.BANG,
  ",": TokenType.COMMA,
  ";": TokenType.SEMICOLON,
  "(": TokenType.LPAREN,
  ")": TokenType.RPAREN,
  "{": TokenType.LBRACE,
  "}": TokenType.RBRACE,
  "": TokenType.EOF,
};

And the keywords:

const keywords: { [key: string]: TokenType } = {
  falso: TokenType.FALSE,
  procedimiento: TokenType.FUNCTION,
  regresa: TokenType.RETURN,
  si: TokenType.IF,
  si_no: TokenType.ELSE,
  pero_si: TokenType.ELSEIF,
  variable: TokenType.LET,
  verdadero: TokenType.TRUE,
  mientras: TokenType.WHILE,
  hacer: TokenType.DO,
  hasta_que: TokenType.WHILE,
  y: TokenType.AND,
  o: TokenType.OR,
  para: TokenType.FOR,
  no: TokenType.NOT,
};

And with that we have all the tokens needed for this first version. These tokens will be used by the lexer to identify the lexical elements in source code like the following:

// Definición de variables
variable n = 10;
variable m = 20;
variable q = "hola mundo";

// Definición de procedimientos
procedimiento suma(a, b) {
  regresa a + b;
}

procedimiento resta(a, b) {
  regresa a - b;
}

procedimiento multiplicacion(a, b) {
  regresa a * b;
}

procedimiento division(a, b) {
  regresa a / b;
}

// Llamadas a procedimientos
suma(n, m);

resta(n, m);

// Bucle "para" con condición y cuerpo de bucle
para (variable i = 0; i < 10; i = i + 1) {
    multiplicacion(n, m);
}
// Bucle "hacer-mientras" con condición y cuerpo de bucle
hacer {
  division(n, m);
} hasta_que (n > 0);

With the example above, we can see Codexivo taking shape. We've defined the TokenTypes, the symbols, and the keywords that will form the foundation of the language. In the next posts we'll explore more aspects of Codexivo — the parser and the Abstract Syntax Tree (AST) — which will allow us to understand and execute Codexivo code.

It's exciting to watch a programming language evolve from nothing. Keep following this series to discover more about the process and the key concepts behind Codexivo.