Marc-Olivier Fiset Aspiring at Life and Software Development

A Very Subtle Detail About PHP Character Escaping

Most of you probably already know that PHP offers two ways of defining strings: 'single-quoted strings' and "double-quoted strings". Most of you should also know that unlike single-quoted strings, double-quoted strings offer variable interpolation:

$age = 23;
$name = 'Marco';

echo "$name is $age years old";
// => 'Marco is 23 years old'
?>

This is something you can't do with single-quoted string. For a lot of us, that's about all we know about the two types of string. We often use the two interchangeably without much understanding of their differences, apart from that thing they do with variables. But there is another very subtle difference, the very reason why I wrote this article.

Character Escaping

I'll use an example to demonstrate my point. Let's say we have a multi-line string, and we want to split it into an array containing the individual lines:

You should note that I'm running on Mac OS X, so line endings are actually \n, and not \r\n like on Windows.

$str = 'This is a\nmulti-line string'; // Note the \n in there, indicating a line feed.
$lines = explode('\n', $str);

var_dump($lines);
?>

If you happen to run this code, you will see that it prints out:

array(2) {
  [0] =>
  string(9) "This is a"
  [1] =>
  string(17) "multi-line string"
}

We see an array containing two elements: the lines of our string. Exactly what we expected. Now, we'll create a file whose text is split on multiple lines:

Hello, this is
a file with
three lines.

We'll modifiy the code to read the string from the file we just created:

$str = file_get_contents('file.txt');
$lines = explode('\n', $str);

var_dump($lines);
?>

Run that again, and we should get an array with 3 elements, one for each line of the file. Unfortunately, we don't. Here is what we get instead:

array(1) {
  [0] =>
  string(39) "Hello, this is
a file with
three lines."
}

We get an array with a single element, containing the whole string. Not quite what we expected. Hmmm...

The Solution

The solution couldn't be simpler. The only thing you have to do for this to work, is change the single quotes to double quotes in the call to the explode function, just like this: $lines = explode("\n", $str);. Now it works as it should:

array(3) {
  [0] =>
  string(14) "Hello, this is"
  [1] =>
  string(11) "a file with"
  [2] =>
  string(12) "three lines."
}

The Explanation

In PHP, single-quoted strings don't interpret most escaping sequences. It means that the sequence \n is interpreted as two distinct characters: a backslash \, and the n character. It's also the reason why our first example works: we are using single-quoted strings for both the multi-line string and the split delimiter. If we were to use two different types of string (single-quotes for the string, double quotes for the delimiter, and vice-versa), it wouldn't work, like with the file example.

When loading a file into a string, the line feed is interpreted as a single character. When we try to split it with \n inside single quotes, they are seen by the PHP engine as two distinct characters, so the split doesn't happen. When switching the split delimiter to double quotes, the \n escape sequence is interpreted properly as a single line feed character, making the split work correctly.

This can be demonstrated further by comparing the length of the two types of string:

var_dump(strlen('\n'));
// => 2
var_dump(strlen("\n"));
// => 1
?>

That's it!

Programming Language Implementation - Part 4 - Variables

Today, we'll be looking at how to implement variable support into the Fun programming language. In the previous article, we wrote the Interpreter and everything surrounding it. If you didn't code along with the series so far, here is a downloadable archive of the code including everything until part 3.

Before I get started, some of you might have noticed that I've given up on TDD. While I did give up on TDD for the early stages of the project, I have not tossed aside all unit testing. I will not be writing those tests as part of the articles though, but they will be available in the fun-lang GitHub repository so you can check them out if you want to.

Now hang on tight, we have a lot of stuff to cover today so let's get started!

An Example

Here is a sneak peak of what we'll be able to do at the end of this article:

$ cat op.fun
x = 25 * 8;
y = x * 2;
z = y / 4;
z * z;
$ php fun.php op.fun
int(10000)

A lot of new things are happening here. We can now:

  • Parse multiple expressions, each terminated by a ;
  • Assign variables with the = operator
  • Use variables as part of an operation

You should also note that the last expression of the source is assigned as the result of the program. This will come in handy when we implement functions in the future.

Changes to the grammar

Up until now, I did not talk to you about the grammar. We did not really need to think much about it since it was so simple. Here is the grammar we support for the moment, really not much to see here:

Expression = Number (Operator Number)?
Number     = /\d+/
Operator   = /[+\-*\/]/

That's all we do for the moment. An expression is a Number, optionnally followed by an Operator and another Number. Numbers and operators are represented by the regex that we feed to the Lexer, as we saw in part 1. An important thing to note, is that our Parser is completely based on this grammar:

private function parseExpressionNode()
{
    // An Expression absolutely starts with a Number
    $left = $this->parseNumberNode();

    // And maybe there is nothing else
    if ($this->isEmpty())
        return new ExpressionNode($left);

    // If there is something else, it must be an Operator, and another Number
    $operatorToken = $this->expectTokenType(TokenType::Operator);
    $right = $this->parseNumberNode();

    return new ExpressionNode($left, $operatorToken->getValue(), $right);
}
?>

Just like our grammar told us. Now, we need to parse multiple expressions, each separated by a semi-colon. Here is what such a grammar would look like:

ExpressionList = (Expression Terminator)+
Expression     = Number (Operator Number)?
Number         = /\d+/
Operator       = /[+\-*\/]/
Terminator     = ";"

We can also assign variables, but as we saw, variable assignments can not only contain Numbers, but also Expressions. For a reason I'll explain in a moment, we'll rename the Expression grammar rule to Operation, and we'll use Expression for something else:

ExpressionList = (Expression Terminator)+
Expression     = ??? // What do we do here?
Operation      = Number (Operator Number)?
Number         = /\d+/
Operator       = /[+\-*\/]/
Terminator     = ";"

If we take our example from the beginning of the article, we can see that the whole source file consists of an ExpressionList. An expression list being defined as a series of one of more expression, each terminated by a semi-colon, means that every line of code is actually an Expression. If we take our example a step further, we can see that an Expression can be either of two things: a variable assignment, or an operation. Let's see how this translates to our grammar:

ExpressionList      = (Expression Terminator)+
Expression          = Assignment | Operation
Assignment          = Identifier AssignmentOperator Operation
Operation           = Number (Operator Number)?
Number              = /\d+/
Operator            = /[+\-*\/]/
AssignementOperator = "="
Identifier          = /[_a-zA-Z][_a-zA-Z0-9]*/
Terminator          = ";"

Oh! We introduced two new token types: Identifier and AssignmentOperator.

For the moment, we'll take for granted that every terminating rule (any grammar rule that does not reference another rule) will be implemented at the Lexer level, mapping the regex to a TokenType of the same name as the rule itself.

An identifier can start with a letter or an underscore, followed by any number of letters, numbers or underscores. An assignement operator is simply the equal sign. We'll add more of those in the future: +=, -= and their friends.

We're almost done! We got one last thing to add to the grammar before we can get to code: using variables as an operand. Let's introduce the Term rule, and reference it instead of Number in the Operation rule:

Operation = Term (Operator Term)?
Term      = Number | Identifier

So here is what our complete grammar looks like with variable support:

ExpressionList      = (Expression Terminator)+
Expression          = Assignment | Operation
Assignment          = Identifier AssignmentOperator Operation
Operation           = Term (Operator Term)?
Term                = Number | Identifier
Number              = /\d+/
Identifier          = /[_a-zA-Z][_a-zA-Z0-9]*/
Operator            = /[+\-*\/]/
AssignementOperator = "="
Terminator          = ";"

This new grammar will drive the reimplementation of our Parser, but that being the most complicated part, we'll leave it off until the end.

Changes to the Lexer

The previous section introduced a few new token types. We'll need to define some new lexing rules to accomodate them. But before we do that, we'll create a FunLexer class, that when instantiated will come pre-loaded with all the required rules to tokenize a Fun source file:

// src/Fun/Lexing/FunLexer.php

<?php namespace Fun\Lexing;

class FunLexer extends Lexer
{
    public function __construct()
    {
        $this->addTokenDefinition('/\s+/', TokenType::Whitespace);
        $this->addTokenDefinition('/\d+/', TokenType::Number);
        $this->addTokenDefinition('/[+\-*\/]/', TokenType::Operator);
    }

    private function addTokenDefinition($pattern, $type)
    {
        $this->add(new TokenDefinition($pattern, $type));
    }
}
?>

And in the fun.php script, we'll replace these lines:

$lexer = new Lexer();
$lexer->add(new TokenDefinition('/\s+/', TokenType::Whitespace));
$lexer->add(new TokenDefinition('/\d+/', TokenType::Number));
$lexer->add(new TokenDefinition('/[+\-*\/]/', TokenType::Operator));
?>

with this one simple line:

$lexer = new FunLexer();
?>

A couple of use statements at the top are now obsolete. Here they are after cleaning them up and adding FunLexer to the mix:

use Fun\Lexing\FunLexer;
use Fun\Parsing\Parser;
use Fun\Interpreting\Interpreter;
?>

Now that we built our custom lexer class, let's add the new token definitions that we'll need to implement variable support. Add these new token types at the end of the TokenType class:

const Identifier = 4;
const AssignmentOperator = 5;
const Terminator = 6;
?>

Next, we'll add the new rules matching those tokens to our newly created FunLexer class, at the end of the constructor:

$this->addTokenDefinition('/[_a-zA-Z][_a-zA-Z0-9]*/', TokenType::Identifier);
$this->addTokenDefinition('/=/', TokenType::AssignmentOperator);
$this->addTokenDefinition('/;/', TokenType::Terminator);
?>

Now if you run the usual command to interpret the op.fun file, everything should be working just as before.

New Node Types

Let's take a look back at the grammar we defined a little earlier:

ExpressionList      = (Expression Terminator)+
Expression          = Assignment | Operation
Assignment          = Identifier AssignmentOperator Operation
Operation           = Term (Operator Term)?
Term                = Number | Identifier
Number              = /\d+/
Identifier          = /[_a-zA-Z][_a-zA-Z0-9]*/
Operator            = /[+\-*\/]/
AssignementOperator = "="
Terminator          = ";"

We have 3 new rules that will translate to their respective nodes: ExpressionList, Assignment and Identifier.

Also, I don't know if you noticed, but there is a relatively big change in there too, which is about the Expression rule. An expression can now be an Assignment, or an Operation. So before we implement any new node types, we must make some modifications to the ExpressionNode class. First of all, Expression will not map to its own node type anymore. Instead we'll have the class be split in 2: AssignmentNode and OperationNode. Go ahead and rename the ExpressionNode class to OperationNode.

I'll have PHPStorm do it for me, so it will replace any old reference with the new one. If you don't use an IDE that does it for you, don't forget to run through the code to replace every instance of ExpressionNode to OperationNode.

Now change the content of the file so it matches with our Operation grammar rule:

// src/Fun/Parsing/Nodes/OperationNode.php

<?php namespace Fun\Parsing\Nodes;

use Fun\Interpreting\Visitors\Visitor;

class OperationNode extends Node
{
    private $left;
    private $operator;
    private $right;

    function __construct(Node $left, $operator = null, Node $right = null)
    {
        $this->left = $left;
        $this->operator = $operator;
        $this->right = $right;
    }

    /**
     * @return Node
     */
    public function getLeft()
    {
        return $this->left;
    }

    /**
     * @return mixed
     */
    public function getOperator()
    {
        return $this->operator;
    }

    /**
     * @return Node
     */
    public function getRight()
    {
        return $this->right;
    }

    public function accept(Visitor $visitor)
    {
        return $visitor->visitOperationNode($this);
    }
}
?>

We also renamed the visitExpressionNode method of the Visitor interface to visitOperationNode to keep everything in sync and avoid confusion.

With our new grammar, operation terms can now be of two types (Number and Idenfitier), so that's the reason why we generalized our $left and $right constructor parameters to allow any Node type. Later in the series, we will introduce an intermediate node type that will allow us to restrict which type of nodes that can be used as terms, but we'll leave it at that for the moment.

At this point, everything should still work. We did not introduce any breaking change yet.

Let's create the VariableNode class, that will be used to hold a variable reference inside an operation. You'll see it's a very simple class:

// src/Fun/Parsing/Nodes/VariableNode.php

<?php namespace Fun\Parsing\Nodes;

use Fun\Interpreting\Visitors\Visitor;

class VariableNode extends Node
{
    private $name;

    public function __construct($name)
    {
        $this->name = $name;
    }

    public function getName()
    {
        return $this->name;
    }

    public function accept(Visitor $visitor)
    {
        return $visitor->visitVariableNode($this);
    }
}
?>

The VariableNode class will be instantiated from Identifier tokens that are parsed from an operation. More on that when we reimplement the parsing phase.

Don't forget to add the visitVariableNode in the Visitor interface. Now, to be able to use variables, we first have to assign them! Introducing the VariableAssignmentNode:

// src/Fun/Parsing/Nodes/VariableAssignmentNode.php

<?php namespace Fun\Parsing\Nodes;

use Fun\Interpreting\Visitors\Visitor;

class VariableAssignmentNode extends Node
{
    private $variableName;
    private $operation;

    public function __construct($variableName, OperationNode $operation)
    {
        $this->variableName = $variableName;
        $this->operation = $operation;
    }

    public function getVariableName()
    {
        return $this->variableName;
    }

    /**
     * @return OperationNode
     */
    public function getOperation()
    {
        return $this->operation;
    }

    public function accept(Visitor $visitor)
    {
        return $visitor->visitVariableAssignmentNode($this);
    }
}
?>

This node represents a variable assignment. We only need the variable name, and the operation to be assigned to it. We have a single node type left to implement: the ExpressionListNode:

// src/Fun/Parsing/Nodes/ExpressionListNode.php

<?php namespace Fun\Parsing\Nodes;

use Fun\Interpreting\Visitors\Visitor;

class ExpressionListNode extends Node
{
    private $expressions;

    /**
     * @param ExpressionNode[] $expressions
     */
    public function __construct(array $expressions)
    {
        $this->expressions = $expressions;
    }

    /**
     * @return ExpressionNode[]
     */
    public function getExpressions()
    {
        return $this->expressions;
    }

    public function accept(Visitor $visitor)
    {
        return $visitor->visitExpressionListNode($this);
    }
}
?>

As its name implies, it's a node the contains a list of expressions. It will take the place of the ExpressionNode as the root node of the AST.

Reimplementing the Parser

Since the structure of our tree changed so much, we'll basically reimplement the Parser class from scratch. I'll post the whole source code here for you to read, but I won't explain that much what is going on. If you went through Part 2 without problems, you should be able to reason about this version quite well:

<?php namespace Fun\Parsing;

use Exception;
use Fun\Lexing\Token;
use Fun\Lexing\TokenType;
use Fun\Parsing\Nodes\ExpressionListNode;
use Fun\Parsing\Nodes\NumberNode;
use Fun\Parsing\Nodes\OperationNode;
use Fun\Parsing\Nodes\VariableAssignmentNode;
use Fun\Parsing\Nodes\VariableNode;

class Parser
{
    /**
     * @Token[]
     */
    private $tokens;

    public function parse(array $tokens)
    {
        $tokens = $this->filterTokens($tokens);
        $this->tokens = $tokens;

        return $this->parseExpressionListNode();
    }

    private function filterTokens(array $tokens)
    {
        $filteredTokens = array_filter($tokens, function (Token $t) {
            return $t->getType() !== TokenType::Whitespace;
        });

        // Return the array values only, because array_filter preserves the keys
        return array_values($filteredTokens);
    }

    // ExpressionList = (Expression Terminator)+
    private function parseExpressionListNode()
    {
        $expressions = [];

        while (!$this->isEmpty()) {
            $expr = $this->parseExpressionNode();
            $this->expectTokenType(TokenType::Terminator);

            $expressions[] = $expr;
        }

        return new ExpressionListNode($expressions);
    }

    // Expression = Assignment | Operation
    private function parseExpressionNode()
    {
        $nextToken = $this->lookAhead();

        // If the next token is an AssignementOperator, we know we are dealing with an assignement expression
        if ($nextToken->getType() === TokenType::AssignmentOperator) {
            return $this->parseVariableAssignmentNode();
        }

        // else we have an operation
        return $this->parseOperationNode();
    }

    // Assignment = Identifier AssignmentOperator Operation
    private function parseVariableAssignmentNode()
    {
        $identifier = $this->expectTokenType(TokenType::Identifier);
        $this->expectTokenType(TokenType::AssignmentOperator);

        $operation = $this->parseOperationNode();

        return new VariableAssignmentNode($identifier->getValue(), $operation);
    }

    // Operation = Term (Operator Term)?
    private function parseOperationNode()
    {
        $left = $this->parseTerm();

        if ($this->isEmpty()) {
            return new OperationNode($left);
        }

        $operatorToken = $this->expectTokenType(TokenType::Operator);
        $right = $this->parseTerm();

        return new OperationNode($left, $operatorToken->getValue(), $right);
    }

    // Term = Idenfitier | Number
    private function parseTerm()
    {
        $currentToken = $this->consumeToken();

        // Instantiate the right type of node based on the type of the token
        switch ($currentToken->getType()) {
            case TokenType::Identifier:
                return new VariableNode($currentToken->getValue());

            case TokenType::Number:
                return new NumberNode($currentToken->getValue());

            default:
                throw new Exception('Expected current token to be a Number or an Identifier.');
        }
    }

    /**
     * @param $tokenType
     * @return Token
     * @throws Exception
     */
    private function expectTokenType($tokenType)
    {
        $peek = $this->currentToken();

        if ($peek->getType() !== $tokenType) {
            throw new Exception('Unexpected token type.');
        }

        return $this->consumeToken();
    }

    /**
     * @return Token
     */
    private function consumeToken()
    {
        return array_shift($this->tokens);
    }

    /**
     * @return Token
     * @throws Exception
     */
    private function currentToken()
    {
        return $this->lookAhead(0);
    }

    /**
     * @param int $position
     * @return Token
     * @throws Exception
     */
    private function lookAhead($position = 1)
    {
        if ($position >= count($this->tokens)) {
            throw new Exception('Tried to look past end of token stream');
        }

        return $this->tokens[$position];
    }

    private function isEmpty()
    {
        return count($this->tokens) === 0;
    }
}
?>

Whew! That was a long one. Hopefully everything makes sense to you. The code should be pretty straightforward. Now the last thing we have to do, is actually implement the missing visit* functions on the Interpreter class to evaluate the nodes we just added.

Interpreting those variables!

The main question here is: How will we keep track of our variables' values? We'll stick with a very simple solution for the moment: a map with the variable name as the key, and its value as the value. We'll introduce a run function on our interpreter in order to initialize some stuff before actually running the code:

// src/Fun/Interpreting/Interpreter.php

//...

class Interpreter implements Visitor
{
    private $context;

    public function run(Node $rootNode)
    {
        // The context will hold the variables' values
        $this->context = [];

        return $rootNode->accept($this);
    }

    //...
}
?>

The $rootNode now being an instance of ExpressionListNode, let's implement the visit* function for this node type:

public function visitExpressionListNode(ExpressionListNode $node)
{
    $result = null;

    foreach ($node->getExpressions() as $expr) {
        $result = $expr->accept($this);
    }

    return $result;
}
?>

We simply run through the list of expressions, dispatching the interpreter to each one sequentially, overriding the result each time, which will make the last expression the result of the computation.

The next part is deceptively simple. We'll implement the visit* method for our two variable nodes:

public function visitVariableAssignmentNode(VariableAssignmentNode $node)
{
    $variableName = $node->getVariableName();
    $value = $node->getOperation()->accept($this);

    $this->context[$variableName] = $value;
}

public function visitVariableNode(VariableNode $node)
{
    $variableName = $node->getName();

    if (!array_key_exists($variableName, $this->context))
        throw new Exception('Undefined variable: ' . $variableName);

    return $this->context[$variableName];
}
?>

To assign a variable, we extract its name from the node, and dispatch the interpreter to the operation that we want assigned to it. This gives us back a result, which we store into our local $context, which holds all the variables. To use a variable in an operation, we simple check for its existence, and return its value from the context. It couldn't be simpler than that!

Final touches

The last thing we have to do now, is adapt our fun.php script to call our code in the good way. You only have to replace this line:

$result = $rootNode->accept($interpreter);
?>

with this one:

$result = $interpreter->run($rootNode);
?>

And we're done! Let's try out our new toy with an example:

$ cat op.fun
x = 25 * 8;
y = x * 2;
z = y / 4;
z * z;
$ php fun.php op.fun
int(10000)

Yay!

Wow, that was a good one. Very proud of what we accomplished today, I hope you had fun as much as I did. Until the next article, I will make some code clean up behind the scenes, add a couple of unit tests, and also create some custom exception classes for better error handling. Next post will be a write up about that. It's time to solidify the code base before going further.

Programming Language Implementation - Part 3 - The Interpreter

Before we get started, you might want to download the whole source code built from the previous articles, so you can follow along more easily.

Here we are! Now is the time where we take that AST we parsed in the previous article, and actually interpret the underlying expression :) This is my favorite part, so let's jump right in.

In our simple case, a dedicated interpreting class may not be required. But since we have higher goals than that, we'll implement a solution that will be more extensible for future iterations. Let's define a couple interfaces that will serve as the basis of our interpreting mechanism, and I'll explain everything afterwards:

// src/Fun/Interpreting/Visitors/Visitor.php

<?php namespace Fun\Interpreting\Visitors;

use Fun\Parsing\Nodes\ExpressionNode;
use Fun\Parsing\Nodes\NumberNode;

interface Visitor
{
    function visitExpressionNode(ExpressionNode $node);
    function visitNumberNode(NumberNode $node);
}
?>
// src/Fun/Interpreting/Visitors/Visitable.php

<?php namespace Fun\Interpreting\Visitors;

interface Visitable
{
    function accept(Visitor $visitor);
}
?>

We'll have each node class implement the Visitable interface, but not directly. Let's introduce an abstract Node class that will serve as a base class for our different nodes to inherit from:

// src/Fun/Parsing/Nodes/Node.php

<?php namespace Fun\Parsing\Nodes;

use Fun\Interpreting\Visitors\Visitable;

abstract class Node implements Visitable
{

}
?>

That's it. We leave the class empty for the moment. Even though we implement the Visitable interface, we don't implement it in the Node class, because the implementation will differ for every node. Now, if we make each node class derive from Node, PHP will force us to implement the accept(Visitor $visitor) function, but we will come back to it later.

Some of you may have noticed that the two intertaces we built constitute what we call the Visitor Design Pattern. This pattern is very useful in our case, as it allows us to decouple the interpretation logic from the data structure. Instead of scattering our logic between all of the node classes (we only have 2 for the moment, but we'll eventually have more than a dozen), we can have it centralized in a single place. Let me introduce the Interpreter class, which implements the Visitor interface:

// src/Fun/Interpreting/Interpreter.php

<?php namespace Fun\Interpreting;

use Fun\Interpreting\Visitors\Visitor;
use Fun\Parsing\Nodes\ExpressionNode;
use Fun\Parsing\Nodes\NumberNode;

class Interpreter implements Visitor
{
    public function visitExpressionNode(ExpressionNode $node)
    {
        // evaluate and return the result of the ExpressionNode
    }

    public function visitNumberNode(NumberNode $node)
    {
        // evaluate and return the result of the NumberNode
    }
}
?>

The Visitor interface requires you to implement a visit* method for every type of node that we support. The thing is, we won't call these visit* methods directly. We will rely on each node to dispatch itself to the method that corresponds to its type, hence the accept method in the Visitable interface. Let's derive NumberNode and ExpressionNode from the base Node class we defined earlier, in order to satisfy this interface:

// src/Fun/Parsing/Nodes/ExpressionNode.php

//...

use Fun\Interpreting\Visitors\Visitor;

class ExpressionNode extends Node
{
    //...

    public function accept(Visitor $visitor)
    {
        return $visitor->visitExpressionNode($this);
    }
}
?>
// src/Fun/Parsing/Nodes/NumberNode.php

//...

use Fun\Interpreting\Visitors\Visitor;

class NumberNode extends Node
{
    //...

    public function accept(Visitor $visitor)
    {
        return $visitor->visitNumberNode($this);
    }
}
?>

If you're not familiar with this pattern, you might think that this doesn't make a lot of sense, but I encourage you to read forward, and everything will be clearer once we're done. Let's implement those visit* methods on the Interpreter class. We will start with visitNumberNode:

public function visitNumberNode(NumberNode $node)
{
    return intval($node->getValue());
}
?>

Pretty simple, isn't it? What is the result of interpreting a NumberNode? Simply an integer representation of its underlying value. Nothing more. Now the most interesting part, visitExpressionNode:

public function visitExpressionNode(ExpressionNode $node)
{
    $leftValue = $node->getLeft()->accept($this);
    $operator = $node->getOperator();

    // The operator and the right side are optional, remember?
    if (!$operator)
        return $leftValue;

    $rightValue = $node->getRight()->accept($this);

    // Perform the right operation based on the operator
    switch ($operator) {
        case '+':
            return $leftValue + $rightValue;

        case '-':
            return $leftValue - $rightValue;

        case '*':
            return $leftValue * $rightValue;

        case '/':
            if ($rightValue == 0)
                throw new Exception('Cannot divide by zero');

            return $leftValue / $rightValue;

        default:
            throw new Exception('Unsupported operator: ' . $operator);
    }
}
?>

Okay, a lot is going on here. Let's concentrate on the first line of the function:

$leftValue = $node->getLeft()->accept($this);
?>

At first, I thought this was really confusing. What you are seeing here, is the basis of the Visitor pattern. I'll run you through a step by step example of the execution path that this expression takes through the code. Let's say we have the following tree: ExpressionNode(NumberNode(8)). This representation tells us we have an ExpressionNode, with a single NumberNode in the left position that contains the value 8. We kickoff the interpretation:

$interpreter = new Interpreter();
$expressionNode->accept($interpreter);
?>

$expressionNode being an ExpressionNode, the PHP runtime dispatches this method call to the accept method on the ExpressionNode class:

public function accept(Visitor $visitor)
{
    return $visitor->visitExpressionNode($this);
}
?>

Here, the $visitor we receive as a parameter is actually our $interpreter. We know it implements the Visitor interface, and we know that the Visitor can deal with ExpressionNodes by calling visitExpressionNode on it, and passing $this (the current node), as an argument. Doing that, we are now inside the Interpreter class, in the visitExpressionNode method:

public function visitExpressionNode(ExpressionNode $node)
{
    $leftValue = $node->getLeft()->accept($this);

    //...
}
?>

We know we have a NumberNode on the left side of the expression, so what happens when we call accept on it? It executes the accept method on the NumberNode class, which in turn calls back the visitNumberNode on the interpreter:

public function visitNumberNode(NumberNode $node)
{
    return intval($node->getValue());
}
?>

This simply returns the value of the NumberNode as an int. This brings us back to here, with the actual value of the left node of the expression:

public function visitExpressionNode(ExpressionNode $node)
{
    $leftValue = $node->getLeft()->accept($this);
    // We now have a value for $leftValue!

    $operator = $node->getOperator();

    if (!$operator)
        return $leftValue;

    //...
}
?>

Now we get the operator. In our case we don't have any, so we just return the $leftValue variable, which contains 8, the value of the left NumberNode we defined earlier.

Let's take a look at the rest of the visitExpressionNode function:

$rightValue = $node->getRight()->accept($this);

switch ($operator) {
    case '+':
        return $leftValue + $rightValue;
    case '-':
        return $leftValue - $rightValue;
    case '*':
        return $leftValue * $rightValue;
    case '/':
        if ($rightValue == 0)
            throw new Exception('Cannot divide by zero');

        return $leftValue / $rightValue;

    default:
        throw new Exception('Unsupported operator: ' . $operator);
}
?>

We repeat the same process for the right node of the expression, and then the switch statement makes sure we perform the correct operation based on the operator. Pretty straightforward.

Wireing it up

Now is the time to plug our interpreter into the fun.php script. Scroll to the end of the file, and change the code to this:

//...

$tokens = $lexer->tokenize($fileContent);

$parser = new Parser();
$rootNode = $parser->parse($tokens);

$interpreter = new Interpreter();
$result = $rootNode->accept($interpreter);

var_dump($result);
?>

We simply instantiate our Interpreter, and pass it to the accept method of the $rootNode. That effectively kicks off the interpreting process as we saw earlier. If you populate op.fun with the expression 8 + 5, and you run php fun.php op.fun, you should see 13 printed to the console:

$ cat op.fun
8 + 5
$ php fun.php op.fun
int(13)

Yay! That wraps it up for today, don't hesitate to read the article multiple times to make sure you perfectly grasp the code we wrote, because the rest of the series will build up from here. When you're ready, can can move on to Part 4, where we will implement... variables! See you ;)

Programming Language Implementation - Part 2 - The Parser

In the previous article, we got started with our journey of being able to interpret very simple mathematical expressions from a source file. We wrote the Lexer, a class that is used to translate the input source into a token stream, a representation that will be easier for the parser to work with. The parser's job is to translate the tokens into an AST (Abstract Syntax Tree), which will then be interpreted by the Interpreter in the next article.

Before we can get started with the parser, we need to know exactly what we want to parse. As I said earlier in the series, in this first iteration we are concentrating on very simple mathematical expressions. We will only support single numbers and binary operations for the moment. Here are some examples that we will be able to parse at the end of the article:

3
425
1 + 1
24 - 8
10 * 100
35 / 7

The result of the parsing phase being a tree, we need to define the nodes that will compose the tree. In this case, it's going to be fairly simple. We have a top-level node: the Expression node, which is composed of a single Number node, or two Number nodes and an operator. Here is how we represent this in code:

// src/Fun/Parsing/Nodes/ExpressionNode.php

<?php namespace Fun\Parsing\Nodes;

class ExpressionNode
{
    private $left;
    private $operator;
    private $right;

    function __construct(NumberNode $left, $operator = null, NumberNode $right = null)
    {
        $this->left = $left;

        // $operator and $right are optional in case we have
        // an expression consisting of a single NumberNode
        $this->operator = $operator;
        $this->right = $right;
    }

    /**
     * @return NumberNode
     */
    public function getLeft()
    {
        return $this->left;
    }

    /**
     * @return mixed
     */
    public function getOperator()
    {
        return $this->operator;
    }

    /**
     * @return NumberNode
     */
    public function getRight()
    {
        return $this->right;
    }
}
?>

This class is pretty straightforward. The NumberNode class is even simpler:

// src/Fun/Parsing/Nodes/NumberNode.php

<?php namespace Fun\Parsing\Nodes;

class NumberNode
{
    private $value;

    function __construct($value)
    {
        $this->value = $value;
    }

    /**
     * @return mixed
     */
    public function getValue()
    {
        return $this->value;
    }
}
?>

As you can see, both classes only contain data that get passed to their respective constructors, and expose it through getters. Now we'll write the Parser class, which will build those node objects from the token stream. Here is a first test that will make sure we can parse single number expressions:

// spec/Fun/ParserTest.php

<?php

use Fun\Lexing\Token;
use Fun\Lexing\TokenType;
use Fun\Parsing\Parser;

class ParserTest extends PHPUnit_Framework_TestCase
{
    public function testCanParseSingleNumberExpression()
    {
        $tokens = [
            new Token('3', TokenType::Number)
        ];

        $node = $this->parse($tokens);

        $this->assertInstanceOf('Fun\Parsing\Nodes\ExpressionNode', $node);
        $this->assertNumberNode($node->getLeft(), 3);
    }

    private function parse($tokens)
    {
        $parser = new Parser();
        return $parser->parse($tokens);
    }

    private function assertNumberNode($node, $value)
    {
        $this->assertInstanceOf('Fun\Parsing\Nodes\NumberNode', $node);
        $this->assertEquals($value, $node->getValue());
    }
}
?>

The test simply builds a list made up of a single number token, which we feed into the parser's parse method. We expect to get back an object of type ExpressionNode, which is the top-level node of the tree, that contains a single NumberNode with a value of 3. Here is the code of the Parser class that satisfies the previous test:

// src/Fun/Parsing/Parser.php

<?php namespace Fun\Parsing;

use Fun\Lexing\Token;
use Fun\Lexing\TokenType;
use Fun\Parsing\Nodes\ExpressionNode;
use Fun\Parsing\Nodes\NumberNode;

class Parser
{
    private $tokens;

    public function parse(array $tokens)
    {
        $this->tokens = $tokens;

        return $this->parseExpressionNode();
    }

    private function parseExpressionNode()
    {
        $left = $this->parseNumberNode();

        return new ExpressionNode($left);
    }

    private function parseNumberNode()
    {
        $numberToken = $this->expectTokenType(TokenType::Number);

        return new NumberNode($numberToken->getValue());
    }

    /**
     * @param $tokenType
     * @return Token
     * @throws Exception
     */
    private function expectTokenType($tokenType)
    {
        $peek = $this->tokens[0];

        if ($peek->getType() !== $tokenType)
            throw new UnexpectedTokenException($tokenType, $peek->getType());

        return array_shift($this->tokens);
    }
?>

The only public method on the Parser class is the parse method, which expects an array of tokens as a parameter. We set the $tokens array to a private member of the same name so it is available to other methods. We then start the parsing process by calling the parseExpressionNode method.

As we defined in our test, we only want to parse expressions that consist of a single number. In the parseNumberNode method, we expect the first token of the list to be of type Number, and if it is, remove it from the head of the token stream and return it. However, it's not a token we want to return, but a NumberNode, so we instantiate it with the value of the returned token. If you run the test, it passes. Onto binary expressions!

Add the following failing test to the ParserTest.php file:

public function testCanParseBinaryExpression()
{
    $tokens = [
        new Token('3', TokenType::Number),
        new Token('+', TokenType::Operator),
        new Token('5', TokenType::Number)
    ];

    $node = $this->parse($tokens);

    $this->assertInstanceOf('Fun\Parsing\Nodes\ExpressionNode', $node);

    $this->assertNumberNode($node->getLeft(), 3);
    $this->assertEquals('+', $node->getOperator());
    $this->assertNumberNode($node->getRight(), 5);
}

Nothing to explain here, still pretty straightforward. To make this test pass, simply change the parseExpressionNode of the Parser class to match with this code:

private function parseExpressionNode()
{
    $left = $this->parseNumberNode();

    if ($this->isEmpty())
        return new ExpressionNode($left);

    $operatorToken = $this->expectTokenType(TokenType::Operator);
    $right = $this->parseNumberNode();

    return new ExpressionNode($left, $operatorToken->getValue(), $right);
}

And add the isEmpty method right under expectTokenType:

private function isEmpty()
{
    return count($this->tokens) === 0;
}

The new test should now pass, and the previous one as well. Here a the steps we take now:

  1. We start off just as before: we parse a NumberNode, because our ExpressionNode needs at least one.
  2. Then, if the token stream is empty, we return the expression with a single number just like we did before.
  3. However, if we still got tokens left, we expect the next one to be an operator
  4. We try to parse another NumberNode that we store in the $right variable.
  5. Finally we return an instance of ExpressionNode, but this time we also provide the operator and the right-hand side of the expression.

Putting it all together

Up until now, we only run our code through the tests we write. It's not completely bad, but it's not ideal. We want to be able to run our code from the command line, and supply it different inputs and see what comes out. We'll create a script to read the content of a specified file, pass it to the lexer, then to the parser and we'll dump the tree nodes that we get back. Create the following file:

// fun.php

<?php

use Fun\Lexing\Lexer;
use Fun\Lexing\TokenDefinition;
use Fun\Lexing\TokenType;
use Fun\Parsing\Parser;

include 'vendor/autoload.php';

$fileName = $argv[1];
$fileContent = file_get_contents($fileName);

$lexer = new Lexer();
$lexer->add(new TokenDefinition('/\d+/', TokenType::Number));
$lexer->add(new TokenDefinition('/[+\-*\/]/', TokenType::Operator));

$tokens = $lexer->tokenize($fileContent);

$parser = new Parser();
$tree = $parser->parse($tokens);

var_dump($tree);
?>

If you create a file operation.fun containing 25*8, and run php fun.php operation.fun from the command-line, you will see the ExpressionNode being dumped to the console:

class Fun\Parsing\Nodes\ExpressionNode#11 (3) {
  private $left =>
  class Fun\Parsing\Nodes\NumberNode#9 (1) {
    private $value =>
    string(2) "25"
  }
  private $operator =>
  string(1) "*"
  private $right =>
  class Fun\Parsing\Nodes\NumberNode#10 (1) {
    private $value =>
    string(1) "8"
  }
}

What about whitespaces !?

You may have noticed that introducing any whitespace in the source file makes our script fail with an UnknownTokenException in the tokenize function call of the Lexer class. This happens because we did not define a TokenDefinition to match whitespace characters.

An option to get rid of those characters is to pre-process the source file before feeding it to the lexer and remove all whitespace from it. But it's not quite the right way to go about it. We don't want to ignore the whitespaces during the lexing phase either, even though Fun is going to be whitespace independant. We want to have whitespaces identified as a token type, only to be dumped before the parsing phase in case we change our mind later on. So I guess the best way is to have our lexer tokenize whitespaces, and filter them out in the parsing phase.

Let's add a token definition that will match all whitespace characters. First, create a new token type in the TokenType class:

// src/Lexing/TokenType.php

<?php namespace Fun\Lexing;

final class TokenType
{
    const Number = 1;
    const Operator = 2;
    const Whitespace = 3; // <---- Add this
}
?>
// fun.php

//...

$lexer = new Lexer();
$lexer->add(new TokenDefinition('/\d+/', TokenType::Number));
$lexer->add(new TokenDefinition('/[+\-*\/]/', TokenType::Operator));
// Add the whitespace TokenDefinition to the Lexer
$lexer->add(new TokenDefinition('/\s+/', TokenType::Whitespace));

// ...
?>

Now if you include whitespaces in the source file, the lexing phase will go without any issue, but it is going to fail in the parser. We simply have to filter out the whitespace tokens before kicking off the parsing phase. We'll replace the parse  method with the following implementation:

// src/Fun/Parsing/Parser.php

//...

public function parse(array $tokens)
{
    $tokens = $this->filterTokens($tokens);
    $this->tokens = $tokens;

    return $this->parseExpressionNode();
}

private function filterTokens(array $tokens)
{
    return array_filter($tokens, function(Token $t) {
        return $t->getType() !== TokenType::Whitespace;
    });
}

//...
?>

The filterTokens function simply filters out the tokens of the Whitespace type. The parsing process then goes on as usual, having to deal only with Numbers and Operators. It's that simple :)

Want to know how to take the AST and compute the result? You should definitely check out Part 3 - The Interpreter!

Programming Language Implementation - Part 1 - The Lexer

Let's get started for real this time. In the previous post, we setup our development environment. If you did not do that, go read that article first and come back when you're done.

Today we'll be building the first part of our programming language, the lexer.

The job of the lexer is to tokenize the source file. It transforms the source into a stream of tokens, making it easier for the parser to make assumptions about the formatting of the code. The less the parser has to deal with the lexical part, the easier we can modify how the language is lexed and parsed. It's mostly about separating concerns so that they do not interact with each other too much.

Tokens!

A token is simply a string of characters, with a type attached to it. It represents a symbol of the language, and it can be anything you want it to be. It can also be as specific or as general as you like. Let's start with an example. If we take the expression 3+5 and we feed it to the lexer, we get back the following list of tokens :

['3', 'number'] , ['+', 'operator'] , ['5', 'number']

If we take the first token, ['3', 'number'] means that the lexer encountered a 3 in the source file, and it decided that it was of type number. Same thing goes for the + sign that was identified as a token of the operator type. As I said, you can be more specific if you want, by identifying the + sign as a token of type plus_operator, but we won't go into that for the moment. Every level of granularity just makes it easier for the parser in the end.

How does the lexer decide which type is which token?

The lexer makes its decisions based on the token definitions we provide it. Token definitions are most of the time regular expressions that match a particular symbol in the language. The types of tokens could be something like:

  • Numbers (1, -1, 100, 475.12)
  • Operators (+, -, *, /)
  • Identifiers (a, x, count, firstName)
  • Symbols ((, ), ,, ;)
  • Assignment operators (=, +=, -=, *=, /=)
  • And so on

The fun part is that you get to set the rules. Regular expressions for the previous types could be:

  • /\d+/ to match positive integers
  • /[+\-*/]/ to match the different mathematical operators
  • /[a-zA-Z]+/ to match identifiers (identifiers can be variable names, function names, class names, etc)
  • And so on

Let's create the TokenDefinition class, which will hold the regular expression pattern, along with the type of token it's intended to match:

// src/Fun/Lexing/TokenDefinition.php

<?php namespace Fun\Lexing;

class TokenDefinition
{
    private $pattern;
    private $type;

    public function __construct($pattern, $tokenType)
    {
        $this->pattern = $pattern;
        $this->type = $type;
    }
}

Now that we have a class to hold that data, we need a method in order to figure out if this particular token definition matches the next token in the source file. We'll add the match method to the class, right under the constructor:

public function match($input)
{

}

Hey, wait! Didn't you say that we were going to use TDD throughout the series?

Oh, thanks for noticing ;) Indeed, I said we were going to use TDD, so let's first start by writing a test for the match method, describing the behavior we would like it to have.

// spec/Fun/Lexing/TokenDefinitionTest.php

<?php

use Fun\Lexing\TokenDefinition;
use Fun\Lexing\TokenType;

class TokenDefinitionTest extends PHPUnit_Framework_TestCase
{
    public function testMatchReturnsTokenObjectIfMatchedInput()
    {
        $pattern = '/\d+/';
        $tokenDefinition = new TokenDefinition($pattern, TokenType::Number);

        $token = $tokenDefinition->match('123');

        $this->assertInstanceOf('Fun\Lexing\Token', $token);

        $this->assertEquals('123', $token->getValue());
        $this->assertEquals(TokenType::Number, $token->getType());
    }
}

In this test, we make use of a TokenType class, which will serve as an enumeration of the different token types we will need. For this article, we only need 2:

// src/Fun/Lexing/TokenType.php

<?php namespace Fun\Lexing;

final class TokenType
{
    const Number = 1;
    const Operator = 2;
}

We also make use of the Token class, which holds the value of a matched token, and its type:

// src/Fun/Lexing/Token.php

<?php namespace Fun\Lexing;

class Token
{
    private $value;
    private $type;

    public function __construct($value, $type)
    {
        $this->value = $value;
        $this->type = $type;
    }

    public function getValue()
    {
        return $this->value;
    }

    public function getType()
    {
        return $this->type;
    }
}

If you run the test right now, you will see it fail with the following message:

1) TokenDefinitionTest::testMatchReturnsTokenObjectIfMatchedInput
Failed asserting that null is an instance of class "Fun\Lexing\Token".

Which is perfectly normal, as we did not implement the match method yet. It's a pretty simple function:

public function match($input)
{
    // Match the input with the regex pattern, storing any match found into the $matches variable,
    // along with the index of the string at which it was matched.
    $result = preg_match($this->pattern, $input, $matches, PREG_OFFSET_CAPTURE);

    // preg_match returns false if an error occured
    if ($result === false)
        throw new \Exception(preg_last_error());

    // 0 means no match was found
    if ($result === 0)
        return null;

    return $this->getTokenFromMatch($matches[0]);
}

private function getTokenFromMatch($match)
{
    $value = $match[0];
    $offset = $match[1];

    // If we don't match at the beginning of the string, it fails.
    if ($offset !== 0)
        return null;

    return new Token($value, $this->tokenType);
}

The test now passes. What's going on here is pretty simple. First we try to match the input with the regex. If an error occured we throw an exception. If no match were found, or if the match is not exactly at the start of the string, we return null. If everything is fine, we create a new token with the value that we matched, and the token type associated with that token definition.

It's important that you take the time to fully understand every code sample before continuing, as every step builds on top of the previous one.

Let's add a couple more unit tests to make sure that everything is what we intend it to be:

public function testNoMatchReturnsNull()
{
    $pattern = '/\d+/';
    $tokenDefinition = new TokenDefinition($pattern, TokenType::Number);

    $this->assertNull($tokenDefinition->match('abc'));
}

public function testMatchReturnsNullIfOffsetNotZero()
{
    $pattern = '/\d+/';
    $tokenDefinition = new TokenDefinition($pattern, TokenType::Number);

    $this->assertNull($tokenDefinition->match('abc123'));
}

And all those tests pass. However, I see a lot of duplication, every $tokenDefinition variable is the same in each test. Let's refactor that out. We'll introduce a setUp method for the suite, and use a member variable inside the individual tests. Add those lines at the very beginning of the TokenDefinitionTest class:

private $tokenDefinition;

public function setUp()
{
    $this->tokenDefinition = new TokenDefinition('/\d+/', TokenType::Number);
}

You can now remove the first two lines of every test, and replace the usage of the local $tokenDefinition variable with $this->tokenDefinition. If you did everything right, the tests should still pass.

The Lexer

The lexer is basically just a holder for our token definitions, that will loop through the source file (as a string) and match our tokens along the way. If the source is valid according to our lexical rules, the whole string will be converted into individual tokens.

Let's write the tests for our lexer:

// spec/Fun/Lexing/LexerTest.php

<?php

use Fun\Lexing\Lexer;
use Fun\Lexing\TokenDefinition;
use Fun\Lexing\TokenType;
use Fun\Lexing\Token;
use Fun\Lexing\UnknownTokenException;

class LexerTest extends PHPUnit_Framework_TestCase
{
    private $lexer;

    public function setUp()
    {
        $lexer = new Lexer();

        $lexer->add(new TokenDefinition('/\d+/', TokenType::Number));
        $lexer->add(new TokenDefinition('/\+/', TokenType::Operator));

        $this->lexer = $lexer;
    }

    public function testCanTokenizeNumber()
    {
        $tokens = $this->lexer->tokenize('325');

        $this->assertTokenEquals('325', TokenType::Number, $tokens[0]);
    }

    public function testCanTokenizeOperator()
    {
        $tokens = $this->lexer->tokenize('+');

        $t = $tokens[0];
        $this->assertTokenEquals('+', TokenType::Operator, $t);
    }

    public function testCanTokenizeNumbersAndOperators()
    {
        $tokens = $this->lexer->tokenize('3+5');

        $this->assertCount(3, $tokens);

        $this->assertTokenEquals('3', TokenType::Number, $tokens[0]);
        $this->assertTokenEquals('+', TokenType::Operator, $tokens[1]);
        $this->assertTokenEquals('5', TokenType::Number, $tokens[2]);
    }

    public function testExceptionIsThrownOnUnknownToken()
    {
        $this->setExpectedException(UnknownTokenException::class);

        $this->lexer->tokenize('abc');
    }

    private function assertTokenEquals($value, $type, Token $token)
    {
        $this->assertEquals($value, $token->getValue());
        $this->assertEquals($type, $token->getType());
    }
}

This is a lot less granular that I would like it to be, but that's the best I can do within a blog post. It's hard to demonstrate state-of-the-art TDD practice in writing. But we'll deal with it. The tests are pretty much self-explanatory, I don't feel like I really need to elaborate on what we are trying to do here. If you run this bunch of tests now however, they are going to fail miserably. Let's jump right into the implementation of the Lexer class. Here is an implementation that would satisfy all of the tests we just wrote:

// src/Fun/Lexing/Lexer.php

<?php namespace Fun\Lexing;

class Lexer
{
    private $tokenDefinitions = [];

    public function add(TokenDefinition $tokenDefinition)
    {
        $this->tokenDefinitions[] = $tokenDefinition;
    }

    public function tokenize($input)
    {
        // The list of tokens we'll eventually return
        $tokens = [];

        // The currentIndex indicates where we are inside the input string
        $currentIndex = 0;

        while ($currentIndex < strlen($input)) {

            // We try to match only what is after the currentIndex,
            // as the content before is already converted to tokens
            $token = $this->findMatchingToken(substr($input, $currentIndex));

            // If no tokens were matched, it means that the string has invalid tokens
            // for which we did not define a token definition
            if (!$token)
                throw new UnknownTokenException($currentIndex);

            // Add the matched token to our list of token
            $tokens[] = $token;

            // Increment the string index by the lenght of the matched token,
            // so we can now process the rest of the string.
            $currentIndex += strlen($token->getValue());
        }

        return $tokens;
    }

    private function findMatchingToken($input)
    {
        // Check with all tokenDefinitions
        foreach ($this->tokenDefinitions as $tokenDefinition) {
            $token = $tokenDefinition->match($input);

            // Return the first token that was matched.
            if ($token)
                return $token;
        }

        // Return null if no tokens were matched.
        return null;
    }
}

Please note that this is a minimal and very naive implementation that is suitable for our current needs only. It will most likely evolve into something more robust in the future. Some questions we did not answer here would be:

  • What happens if many token definitions match at the same time?
  • What about whitespaces and new lines?
  • This lexer is context-free, how do I deal with tokens that would be matched conditionally based on their surroundings?

For the moment, don't worry too much about those things, they will be covered when deemed necessary later in the series. Also, using a separate lexing phase for simple expressions like 3+5 is overkill. You could do the lexing and parsing (and to some extent the interpreting too) as a single phase. However, since our target is a more complete programming language, we will thank ourselves later for implementing a lexer from the get go.

Feel free to go ahead and read the next article: Part 2 - The Parser.