Marc-Olivier Fiset Aspiring at Life and Software Development

Programming Language Implementation - Part 1 - The Lexer

Let's get started for real this time. In the previous post, we setup our development environment. If you did not do that, go read that article first and come back when you're done.

Today we'll be building the first part of our programming language, the lexer.

The job of the lexer is to tokenize the source file. It transforms the source into a stream of tokens, making it easier for the parser to make assumptions about the formatting of the code. The less the parser has to deal with the lexical part, the easier we can modify how the language is lexed and parsed. It's mostly about separating concerns so that they do not interact with each other too much.

Tokens!

A token is simply a string of characters, with a type attached to it. It represents a symbol of the language, and it can be anything you want it to be. It can also be as specific or as general as you like. Let's start with an example. If we take the expression 3+5 and we feed it to the lexer, we get back the following list of tokens :

['3', 'number'] , ['+', 'operator'] , ['5', 'number']

If we take the first token, ['3', 'number'] means that the lexer encountered a 3 in the source file, and it decided that it was of type number. Same thing goes for the + sign that was identified as a token of the operator type. As I said, you can be more specific if you want, by identifying the + sign as a token of type plus_operator, but we won't go into that for the moment. Every level of granularity just makes it easier for the parser in the end.

How does the lexer decide which type is which token?

The lexer makes its decisions based on the token definitions we provide it. Token definitions are most of the time regular expressions that match a particular symbol in the language. The types of tokens could be something like:

  • Numbers (1, -1, 100, 475.12)
  • Operators (+, -, *, /)
  • Identifiers (a, x, count, firstName)
  • Symbols ((, ), ,, ;)
  • Assignment operators (=, +=, -=, *=, /=)
  • And so on

The fun part is that you get to set the rules. Regular expressions for the previous types could be:

  • /\d+/ to match positive integers
  • /[+\-*/]/ to match the different mathematical operators
  • /[a-zA-Z]+/ to match identifiers (identifiers can be variable names, function names, class names, etc)
  • And so on

Let's create the TokenDefinition class, which will hold the regular expression pattern, along with the type of token it's intended to match:

// src/Fun/Lexing/TokenDefinition.php

<?php namespace Fun\Lexing;

class TokenDefinition
{
    private $pattern;
    private $type;

    public function __construct($pattern, $tokenType)
    {
        $this->pattern = $pattern;
        $this->type = $type;
    }
}

Now that we have a class to hold that data, we need a method in order to figure out if this particular token definition matches the next token in the source file. We'll add the match method to the class, right under the constructor:

public function match($input)
{

}

Hey, wait! Didn't you say that we were going to use TDD throughout the series?

Oh, thanks for noticing ;) Indeed, I said we were going to use TDD, so let's first start by writing a test for the match method, describing the behavior we would like it to have.

// spec/Fun/Lexing/TokenDefinitionTest.php

<?php

use Fun\Lexing\TokenDefinition;
use Fun\Lexing\TokenType;

class TokenDefinitionTest extends PHPUnit_Framework_TestCase
{
    public function testMatchReturnsTokenObjectIfMatchedInput()
    {
        $pattern = '/\d+/';
        $tokenDefinition = new TokenDefinition($pattern, TokenType::Number);

        $token = $tokenDefinition->match('123');

        $this->assertInstanceOf('Fun\Lexing\Token', $token);

        $this->assertEquals('123', $token->getValue());
        $this->assertEquals(TokenType::Number, $token->getType());
    }
}

In this test, we make use of a TokenType class, which will serve as an enumeration of the different token types we will need. For this article, we only need 2:

// src/Fun/Lexing/TokenType.php

<?php namespace Fun\Lexing;

final class TokenType
{
    const Number = 1;
    const Operator = 2;
}

We also make use of the Token class, which holds the value of a matched token, and its type:

// src/Fun/Lexing/Token.php

<?php namespace Fun\Lexing;

class Token
{
    private $value;
    private $type;

    public function __construct($value, $type)
    {
        $this->value = $value;
        $this->type = $type;
    }

    public function getValue()
    {
        return $this->value;
    }

    public function getType()
    {
        return $this->type;
    }
}

If you run the test right now, you will see it fail with the following message:

1) TokenDefinitionTest::testMatchReturnsTokenObjectIfMatchedInput
Failed asserting that null is an instance of class "Fun\Lexing\Token".

Which is perfectly normal, as we did not implement the match method yet. It's a pretty simple function:

public function match($input)
{
    // Match the input with the regex pattern, storing any match found into the $matches variable,
    // along with the index of the string at which it was matched.
    $result = preg_match($this->pattern, $input, $matches, PREG_OFFSET_CAPTURE);

    // preg_match returns false if an error occured
    if ($result === false)
        throw new \Exception(preg_last_error());

    // 0 means no match was found
    if ($result === 0)
        return null;

    return $this->getTokenFromMatch($matches[0]);
}

private function getTokenFromMatch($match)
{
    $value = $match[0];
    $offset = $match[1];

    // If we don't match at the beginning of the string, it fails.
    if ($offset !== 0)
        return null;

    return new Token($value, $this->tokenType);
}

The test now passes. What's going on here is pretty simple. First we try to match the input with the regex. If an error occured we throw an exception. If no match were found, or if the match is not exactly at the start of the string, we return null. If everything is fine, we create a new token with the value that we matched, and the token type associated with that token definition.

It's important that you take the time to fully understand every code sample before continuing, as every step builds on top of the previous one.

Let's add a couple more unit tests to make sure that everything is what we intend it to be:

public function testNoMatchReturnsNull()
{
    $pattern = '/\d+/';
    $tokenDefinition = new TokenDefinition($pattern, TokenType::Number);

    $this->assertNull($tokenDefinition->match('abc'));
}

public function testMatchReturnsNullIfOffsetNotZero()
{
    $pattern = '/\d+/';
    $tokenDefinition = new TokenDefinition($pattern, TokenType::Number);

    $this->assertNull($tokenDefinition->match('abc123'));
}

And all those tests pass. However, I see a lot of duplication, every $tokenDefinition variable is the same in each test. Let's refactor that out. We'll introduce a setUp method for the suite, and use a member variable inside the individual tests. Add those lines at the very beginning of the TokenDefinitionTest class:

private $tokenDefinition;

public function setUp()
{
    $this->tokenDefinition = new TokenDefinition('/\d+/', TokenType::Number);
}

You can now remove the first two lines of every test, and replace the usage of the local $tokenDefinition variable with $this->tokenDefinition. If you did everything right, the tests should still pass.

The Lexer

The lexer is basically just a holder for our token definitions, that will loop through the source file (as a string) and match our tokens along the way. If the source is valid according to our lexical rules, the whole string will be converted into individual tokens.

Let's write the tests for our lexer:

// spec/Fun/Lexing/LexerTest.php

<?php

use Fun\Lexing\Lexer;
use Fun\Lexing\TokenDefinition;
use Fun\Lexing\TokenType;
use Fun\Lexing\Token;
use Fun\Lexing\UnknownTokenException;

class LexerTest extends PHPUnit_Framework_TestCase
{
    private $lexer;

    public function setUp()
    {
        $lexer = new Lexer();

        $lexer->add(new TokenDefinition('/\d+/', TokenType::Number));
        $lexer->add(new TokenDefinition('/\+/', TokenType::Operator));

        $this->lexer = $lexer;
    }

    public function testCanTokenizeNumber()
    {
        $tokens = $this->lexer->tokenize('325');

        $this->assertTokenEquals('325', TokenType::Number, $tokens[0]);
    }

    public function testCanTokenizeOperator()
    {
        $tokens = $this->lexer->tokenize('+');

        $t = $tokens[0];
        $this->assertTokenEquals('+', TokenType::Operator, $t);
    }

    public function testCanTokenizeNumbersAndOperators()
    {
        $tokens = $this->lexer->tokenize('3+5');

        $this->assertCount(3, $tokens);

        $this->assertTokenEquals('3', TokenType::Number, $tokens[0]);
        $this->assertTokenEquals('+', TokenType::Operator, $tokens[1]);
        $this->assertTokenEquals('5', TokenType::Number, $tokens[2]);
    }

    public function testExceptionIsThrownOnUnknownToken()
    {
        $this->setExpectedException(UnknownTokenException::class);

        $this->lexer->tokenize('abc');
    }

    private function assertTokenEquals($value, $type, Token $token)
    {
        $this->assertEquals($value, $token->getValue());
        $this->assertEquals($type, $token->getType());
    }
}

This is a lot less granular that I would like it to be, but that's the best I can do within a blog post. It's hard to demonstrate state-of-the-art TDD practice in writing. But we'll deal with it. The tests are pretty much self-explanatory, I don't feel like I really need to elaborate on what we are trying to do here. If you run this bunch of tests now however, they are going to fail miserably. Let's jump right into the implementation of the Lexer class. Here is an implementation that would satisfy all of the tests we just wrote:

// src/Fun/Lexing/Lexer.php

<?php namespace Fun\Lexing;

class Lexer
{
    private $tokenDefinitions = [];

    public function add(TokenDefinition $tokenDefinition)
    {
        $this->tokenDefinitions[] = $tokenDefinition;
    }

    public function tokenize($input)
    {
        // The list of tokens we'll eventually return
        $tokens = [];

        // The currentIndex indicates where we are inside the input string
        $currentIndex = 0;

        while ($currentIndex < strlen($input)) {

            // We try to match only what is after the currentIndex,
            // as the content before is already converted to tokens
            $token = $this->findMatchingToken(substr($input, $currentIndex));

            // If no tokens were matched, it means that the string has invalid tokens
            // for which we did not define a token definition
            if (!$token)
                throw new UnknownTokenException($currentIndex);

            // Add the matched token to our list of token
            $tokens[] = $token;

            // Increment the string index by the lenght of the matched token,
            // so we can now process the rest of the string.
            $currentIndex += strlen($token->getValue());
        }

        return $tokens;
    }

    private function findMatchingToken($input)
    {
        // Check with all tokenDefinitions
        foreach ($this->tokenDefinitions as $tokenDefinition) {
            $token = $tokenDefinition->match($input);

            // Return the first token that was matched.
            if ($token)
                return $token;
        }

        // Return null if no tokens were matched.
        return null;
    }
}

Please note that this is a minimal and very naive implementation that is suitable for our current needs only. It will most likely evolve into something more robust in the future. Some questions we did not answer here would be:

  • What happens if many token definitions match at the same time?
  • What about whitespaces and new lines?
  • This lexer is context-free, how do I deal with tokens that would be matched conditionally based on their surroundings?

For the moment, don't worry too much about those things, they will be covered when deemed necessary later in the series. Also, using a separate lexing phase for simple expressions like 3+5 is overkill. You could do the lexing and parsing (and to some extent the interpreting too) as a single phase. However, since our target is a more complete programming language, we will thank ourselves later for implementing a lexer from the get go.

This concludes this week's post. I'll sign off for the holiday period, and I'll be back with more articles, starting January 6th! Stay tuned.

Programming Language Implementation - Part 0 - Setup & Tools

This is part 0 of my ongoing series of tutorials on programming language implementation. I'm sorry, but this part is going to be particularly boring. We'll be setting up our environment so we can get started right away in the next part. I call it part 0 because we actually won't get any coding done today.

If you have not read my introductory post yet, I suggest you do so before continuing with this one.

We Live in a Great Time

If you go back in time for a couple of decades, you will find that developing a programming language (or any kind of software, really) was very far from a walk in the park. You very often had to deal with binary directly, and the only tools you had were cables, which you could plug into different outlets in a strategic manner. Your output was light bulbs that were either turned on or off. That was your program. If you wanted to run a different program, you had to unplug everything and start from scratch.

Thanks to some amazing people, this has changed for the best. We now have access to much better tools. We don't have to care about electronics, and voltages, and all those things that actually had nothing to do with programming itself. We can now really concentrate on the problem we have to solve, expressing a solution for it in a much higher level language. We are going to leverage one of those high level languages to implement our own.

The Tools

We will be using PHP to implement our programming language, for a couple of reasons. First, it is available for all major platforms, is easy to install, and second: we will benefit a lot from its object oriented capabilities. It's also a C-like language, so anybody with a little bit of exposure to languages like C, C++, Java, Javascript and C# will be able to perfectly understand every code sample presented here. We'll also be using TDD, with the PHPUnit framework. This will ensure that expectations about our code are met, and we will also be able to refactor without fear of introducing breaking changes.

As for the IDE, I'll be using PHPStorm, but it's not stricly required. Any text editor and a terminal will do. Pick whatever you're comfortable with for writing PHP. In case you are among the people who love to hate on PHP, you can follow along with your language of choice, but I recommend you stick to dynamic OO languages in order to stay as close to me as possible. I won't be answering any questions if you can't get my examples to work in another language.

Before continuing, it's important that you have PHP 5.5+ installed on your system, alongside Composer, a PHP dependency manager.

Setting up the Project

Open up a terminal, go to the folder where you usually store your code, and run the following commands:

mkdir fun-lang && cd fun-lang
mkdir src spec
mkdir src/Fun spec/Fun
touch composer.json

This will create the basic structure we will need: a source and a test folder, each with a Fun subfolder, for our namespace. Now if you list the content of the current directory using tree, here is what you should see.

$ tree
.
├── composer.json
├── spec
│   └── Fun
└── src
    └── Fun

Now, open up the whole fun-lang folder into your text editor or IDE and write the following lines in your composer.json file:

{
    "require-dev": {
        "phpunit/phpunit": "4.3.*"
    },

    "autoload": {
        "psr-4": {"Fun\\": "src/Fun/"}
    }
}

Go back to your terminal and run composer install. This will install PHPUnit and its dependencies, plus it will setup the autoloader. Next we'll create a sample test and source file, to make sure that everything works as it should. As a convention, I will always put the path to the file as the first line in the block of code, so you can locate it quickly. Go ahead and create these two files:

// src/Fun/Sample.php

<?php namespace Fun;

class Sample
{

}
// spec/Fun/SampleTest.php

<?php

use Fun\Sample;

class SampleTest extends PHPUnit_Framework_TestCase
{
    public function testSample()
    {
        $sample = new Sample();

        $this->assertInstanceOf(Sample::class, $sample);
    }
}

If you now go back to your terminal, and run vendor/bin/phpunit spec, you should have a passing test! If you don't, I suggest you go back and follow every step carefully.

$ vendor/bin/phpunit spec
PHPUnit 4.3.5 by Sebastian Bergmann.

.

Time: 61 ms, Memory: 1.75Mb

OK (1 test, 1 assertion)

Perfect! We're ready to take on the first actual part of the series about building our own programming language : Part 1 - The Lexer.

A Beginner's Guide to Programming Language Implementation

Programming languages look like complicated beasts. And they actually are! You are right to think that implementing a programming language is a difficult task. What you're not allowed to think, however, is that they are too hard. If you came accross this post, high is the chance that you are looking to learn more about how to build your own programming language, and that you are not sure how to tackle such a seemingly overwhelming project. Well, you are at the right place. Read on.

When I decided I wanted to create my own programming language, I had absolutely no idea how to go about it. I started researching on the subject by myself, reading articles, skimming over some books, and mostly by experimenting a lot. I've come to a point where I am confident enough that I can teach other people how to build simple programming languages of their own. I say simple because we won't dive into advanced stuff like native compilation, garbage collection or any optimisation techniques whatsoever. After all, those are decisions that you can defer until much later in the process, and they are far beyond the scope of this series. If you are looking for these kinds of thing, I'm afraid this is not the tutorial for you.

Will we be using flex/yacc/Bison/ANTLR/<insert the name of your favorite lexer/parser generator here>?

It's a fair question. However, considering this is a complete beginner's guide, I really wonder how you came to know these tools ;) If you don't know what I'm talking about, you can skip ahead. If this question resonates with you, keep reading.

While lexer and parser generators are great tools, they hide the details from you. And we are interested in those details. Everything we will do in this lesson will be done by hand. We won't rely on any external tools to generate stuff we can build ourselves fairly easily. If you are constrained by time, want to develop a production-grade language, and that you are ready to sacrifice some flexibility, then I suggest you take a look at them, but it's not the goal of this tutorial.

Introducing the Fun programming language

Programming languages don't come out of thin air. Someone has to design them, define the syntax, the type system (or lack thereof), the semantics and different behaviors, etc. After the design phase, guess what: someone has to program the language. Actually, you don't program the language itself, you program an interpreter (or compiler) for the language, following the specification you designed. In our case, we'll limit ourselves to an interpreter, as it is simpler to make.

I'm sorry I could not include you into the design process, but here is the spec for the language we'll be building together:

  • Our language will be called "Fun"
  • It will be interpreted instead of compiled
  • It will be dynamic (actually for the first pass we'll support numbers, strings and booleans only)
  • Numbers will all be floating point numbers
  • Strings will be delimited by single-quotes
  • Identifiers will be case sensitive, for enhanced consistency
  • Variables must be declared before use (using = as the assignment operator)
  • Functions will be defined with the keyword "fun" (hence the name of the language)
  • Functions must be called with the exact number of arguments that they expect
  • Functions will return the value of the last expression that is evaluated within their bodies
  • Functions will also support the return statement
  • Single-line functions can be defined using the simple = form
  • Scopes will be delimited by curly braces, which will be called blocks
  • Variables defined in a particular scope will be visible to that scope and their children (pretty much like every other language out there)
  • Statements will be terminated by semi-colons
  • Standard mathematical operators will be supported (+, -, *, /, %) and standard precedence will apply (also supporting parentheses)
  • More assignment operators will be supported (+=, -=, *=, /=, %=)
  • We will support if statements and while loops
  • Standard comparison operators will be supported (==, >, <, >=, <=)
  • We will support single-line comments, beginning with //
  • We will support the print built-in function, to output something to the console

Here is a sample of the Fun language, reflecting most bullet points from the list above:

print(name); // name is undefined, exception is thrown

x = 5;
X = 10; // X is a different variable than x.

x += 2;

fun isEven(n) = n % 2 == 0;

if (isEven(3))
    print('Even!');
else
    print('Odd!');

fun fib(n) {
    if (n <= 1) return 1;

    fib(n - 2) + fib(n - 1);
}

// n is not accessible from here

fun sayHello(name) {
    message = 'Hello, ' + name + '!';
    print(message);
}

myName = 'Marco';

sayHello(myName); // Prints 'Hello, Marco!'

If you already know a bit of programming (which you should if you intend to take on this tutorial), then nothing from the code sample above will look alien to you. The goal here is not to invent anything, but to learn about programming language implementation. You'll be free to come up with your own crazy ideas after that ;)

While it may look like a daunting task to even dare write an interpreter for this piece of code, it's actually not that difficult. Don't get me wrong, there are a lot of things that need to be done before we have a fully functionning interpreter, but it is very possible to do so with easy-to-understand concepts. We'll take one step at a time, and I will be explaining every detail in the simplest way possible. The goal here is that you perfectly understand everything that is going on so that when you finish this tutorial, you'll be able to come up with your own ideas and implement them by yourself.

I'm sold! When do we start?

If you made it this far into the post, just go ahead and follow through the next part: Programming Language Implementation - Part 0 : Tools and Setup.

Simulating Computed Properties in PHP

Coming from a .NET background, I definitely miss C# properties. Let's say you have a class that's got a FirstName and a LastName property, defined as follows :

public class Person
{
    public String FirstName { get; set; }
    public String LastName { get; set; }
}

Then you might want to have another property FullName, that concatenates the first and the last name together:

public String FullName
{
    get { return FirstName + " " + LastName; }
}

You would use it like so :

var person = new Person();

person.FirstName = "Marco";
person.LastName = "Fiset";

Console.WriteLine(person.FullName);
// => Marco Fiset

Pretty straigthforward. Now let's try to do the same thing with PHP :

class Person
{
    public $firstName;
    public $lastName;

    public function fullName()
    {
        return $this->firstName . ' ' . $this->lastName;
    }
}

$person = new Person();

$person->firstName = 'Marco';
$person->lastName = 'Fiset';

echo $person->fullName();
// => Marco Fiset

Oh God, this is awful. The thing is, in order to have a computed property in PHP, I had to use a function. I might be the only one, and you might call me crazy, but that annoys the hell out of me. Let's echo each property one after the other so you can better see what makes me cringe:

echo $person->firstName;
echo $person->lastName;
echo $person->fullName(); // Argh! Parentheses!!! >8(

I hate it! Why would fullName be different because he's a computed value? I'm pretty sure the other fields are making fun of his parentheses when we look away. Poor guy, he doesn't deserve this. He's only doing his job the best way he knows. fullName would love to have his handicap removed, but PHP won't let him do that. Fortunately, we can fix that for him.

But... how can we compute a value without using a function !?

Well, suffer no more, because there is a way we can do this. Let me introduce one of my favorite features of PHP:

Magic Methods

You know a language is awesome when it's got a feature called Magic Methods.

Magic methods are a set of functions that you can define on an object, that will be invoked magically in some given circumstances. Here is the full list of magic methods:

__construct(), __destruct(), __call(), __callStatic(), __get(), __set(), __isset(), __unset(), __sleep(), __wakeup(), __toString(), __invoke(), __set_state(), __clone(),

Head over to the PHP documentation if you want to know more on this topic. The function that we will use today is __get(), which has the following signature:

public mixed __get(string $name);

Don't worry, we'll be covering more of those functions in upcoming posts.

I'll start straight with an example and then I'll explain what's going on:

class Foo
{
    public $bar = 'bar';
    public $baz = 'baz';

    public function __get($name)
    {
        return "Property $name not found.";
    }
}

$foo = new Foo();

echo $foo->bar;
// => bar
echo $foo->baz;
// => baz
echo $foo->qux;
// => Property qux not found

I bet you can understand what just happened. __get() acts as a catch-all for object properties. When we try to access a property that's not defined on an object, its __get() method is magically invoked, receiving the name of the property we tried to call as an argument. Now we can use that in a way to make our computed properties work.

Computed Properties

How are we going to apply this to make our computed properties work? Reusing my fullName example from the beginning, I can implement it in a very simple way :

class Person
{
    public $firstName;
    public $lastName;

    public function __get($name)
    {
        if ($name == 'fullName')
            return $firstName . ' ' . $lastName;
    }
}

$person = new Person();

$person->firstName = 'Marco';
$person->lastName = 'Fiset';

echo $person->fullName;
// => Marco Fiset

Ha! Much better, much better! Now let's say we want to define another computed property, representing the last name, followed by a comma, then the first name. We could call this property reversedFullName. Let's go ahead and modify our __get() method to include the new property:

class Person
{
    // ...

    public function __get($name)
    {
        if ($name == 'fullName')
            return $firstName . ' ' . $lastName;

        if ($name == 'reversedFullName')
            return $lastName . ', ' . $firstName;
    }
}

// ...

echo $person->reversedFullName;
// => Fiset, Marco

And it works! However you can probably see that this will grow out of hand very quickly. We need to have a way to separate all those properties, instead of cluttering the __get() function with every one of them. Here is a better version using dynamic dispatching:

public function __get($name)
{
    if (method_exists($this, $name))
        return $this->$name();
}

private function fullName()
{
    return $firstName . ' ' . $lastName;
}

private function reversedFullName()
{
    return $lastName . ', ' . $firstName;
}

As you might have already guessed, it still works. And it's much cleaner. Now every computed property gets its own method. The __get() function only checks if the current object has defined a method with the same name as the property we tried to access, and if yes, returns the result of that function. Pretty simple, huh?

The next step would be to take this __get() function and put it somewhere reusable. We certainly don't want to redefine it in every object where we want to use computed properties.

What about a base class called ComputedProperties that we could derive from?

Well, a base class would work, that's for sure. However, that might not be the best solution. Since PHP does not allow multiple inheritance of classes, our objects could not derive any class other than ComputedProperties. So that's out of question. Fortunately for us, PHP has got another trick up its sleeves.

Traits

Traits have been introduced as part of PHP 5.4. You can more about that feature in the PHP documentation. Here is an exerpt from this page explaining what Traits consist of:

Traits are a mechanism for code reuse in single inheritance languages such as PHP. A Trait is intended to reduce some limitations of single inheritance by enabling a developer to reuse sets of methods freely in several independent classes living in different class hierarchies.

What this means is that traits allow us to import functionnality into a class. In our case, it fits the bill perfectly. We can create a simple trait that will contain our computed property shenanigan:

trait ComputedProperties
{
    public function __get($name)
    {
        if (method_exists($this, $name))
            return $this->$name();
    }
}

Can't really go simpler than that. We can now use the trait in our Person class:

class Person
{
    use ComputedProperties;

    // ...
}

Aaaaannd we're done! What this essentially does is that it includes in our class everything that is contained inside of the trait. You can think of it as copy-pasting the trait's content into our class. The trait has access to everything in the class that imports it, even private methods, and that is the reason why it still works even though our methods are private.

That sums up this tutorial, I hope you had fun and that you learnt something new! I created a PHP Runnable with the full code so you can try it and experiment on your own.