Evaluation in JS Part 1: Tokenizer

Or how to learn the basics of interpreters and compilers

So a long time ago when I first was getting into software development, I was tasked by a friend of mine to try an solve a simple problem. "How would you evaluate the string '1 + 1 - 1' using only Javascript. You're not allowed to use Eval or Function." It seemed like an easy question when he first asked it, but I was stumped pretty quickly. He pointed me in the right direction after a little bit of time, telling me to look up tokenization and lexical analysis. He also informed me that this was the same concept behind writing a programming language, and that seemed like useful knowledge to me, so I set about learning how to solve this problem!

For those of you who are not familiar, tokenization and lexical analysis is the process by which we evaluate expressions and code. This happens in three steps; the tokenization, the parsing, and the evaluation. First we run through a string and break it up into different tokens, which store what type of value is contained and the actual value. Second, we parse our tokens to form what is called an Abstract Syntax Tree, or AST. Third, we travel through our AST and evaluate our expressions, producing output.

I worked along with a couple of guides and was able to complete the problem, but I want to come back to it today and implement my own version. I've learned a lot since then and feel much more comfortable coming up with my own solutions, so I'm gonna start a small series where I work to complete this. This is without looking at anyone elses code, everything written by me! Having done this problem before, I know that parts 1 and 3 will be relatively simple, but the parsing can be a bit of a problem. I will go ahead and write my own parser for this, I know it might not be necessary nor efficient, but it's for the sake of learning!

The plan from here on out is to have this be a 3 part series, working through the Tokenization, Parsing, and Evaluation of our AST.

So, this will be the first part where we build the tokenizer. As I said above, tokenization is where we loop through our string and break up the various characters into tokens. A token will be an object that has two properties; value, which will hold the value of the character, and type, which will be what type of token this is. Let's go ahead and implement a generateToken function.

const generateToken = value => {
  let token = { value };
  if (/[+\-]/.test(value)) token = { ...token, type: 'operator' };
  if (/\d/.test(value)) token = { ...token, type: 'literal' };
  return token;
};

We currently have two token types; literals and operators. Literals will store values for us. We check to see if what has been given is a digit, and if it is we store it in a token with the literal type. Our other type of token is an operator token, which is what we use for the + or - operators. We will use this later when we build our AST, for now we just need to concern our selves with building tokens.

So we have a way to generate tokens, but now we just need to generate them. Let's write our tokenize function.

const tokenize = input => (
  input
    .split('')
    .filter(char => char !== ' ')
    .map(char => generateToken(char))
);

So we take in our input and split everything into individual characters. We also filter our white space, since we don't care about it whether things are seperated by spaces or not. Next we make our split up characters to their respective tokens. Running tokenize('1 + 1 - 2') produces the following output.

[
  { value: '1', type: 'literal' },
  { value: '+', type: 'operator' },
  { value: '1', type: 'literal' },
  { value: '-', type: 'operator' },
  { value: '2', type: 'literal' }
]

Pretty good! We now have an array of tokens that we can use when we go to build our AST! Honestly, tokenization is pretty simple. A lot of the work will come in part two, so stay tuned for that!

Tags