Today is #TheoryThursday 🧐!
I want to talk about a tiny part of a topic that is one of my passions: Formal Language Theory.
❓ Why can't you use regular expressions to process HTML? 🧵👇
❓ What is the one thing they all have in common?
👉 They can be infinitely nested.
In all these languages you have some constructions (e.g., expressions) which can be as complex as you desire. These constructions are defined recursively. For example:
✏️ An expression is either a number, or a sum, subtraction, multiplication, or division of two expressions.
❓ But is HTML a programming language or not?
Who cares? HTML has recursive constructions as well. From the syntax perspective, which is what we are discussing here, it is of the same nature as all the other.
(☝️ We are talking about syntax, not semantics here)
⭐ These are all called Context-Free languages.
Their fundamental characteristic is that they can be defined with a formalism called Context-Free Grammars (CFG).
(☝️ There are types of languages which are not context-free)
Informally, a CFG is a set of transformation rules that define which sequences of symbols are valid in a given language. For example, take arithmetic expressions:
- 👍 ( 31 + ( 245 * 15 ) ) -- is a valid sequence
- 👎 63 + ) 23 * ( -- is an invalid sequence
Here is one possible grammar for expressions.
Starting with the symbol
<Expression> and applying randomly any rule (called a production), you will always reach some combination of
NUMBER and the symbols
) that is valid.
This is called a derivation.
👉 What's more, any possible arithmetic expression can be formed by some sequence of applications of that grammar rules. Even better, in this particular example, there is exactly one such sequence for any valid expression.
❓ Why is this relevant?
Because we can analyze the syntactic complexity of all programming languages just by analyzing the language for expressions. They are all equally complex.
The fundamental question we want to answer is this 👇
❓ What is the simplest program than can recognize all (and only) valid expression?
This is called "parsing" an expression. I'm sure you've heard the term.
👉 Parsing is finding the sequence of transformations that goes from
<Expression> to whatever you want to parse.
If that sequence exists, then we know that it is a valid expression, and even more, we know everything we need to evaluate it.
💡 Parsing is the first step that all compilers and interpreters do, from Python to C++, to yes, the HTML parser in Chrome.
Now we come round back to regular expressions.
👉 It turns out that standard regular expressions are computationally insufficient to parse any language that requires a context-free grammar.
(😢 While proving this is not hard, sadly it is too long for this thread)
You can say, intuitively, that regular expressions cannot "count" the matching opening and closing parenthesis. This is because regexes, deep down, are just overpriced finite automata.
(🙄 Yes, I know this is more nuanced, I'm talking about standard regexes only)
So, next time someone suggests you to "just use a regex to parse that HTML", you can 🤦 and answer, in formal language lingo: "no, regexes cannot parse context-free languages".
You won't make any new friends, but at least you'll be right 😝
As usual, if you like this topic, reply in this thread or @ me at any time. Feel free to ❤️ like and 🔁 retweet if you think someone else could benefit from knowing this stuff.
🧵 Read this thread online at https://apiad.net/tweetstorms/theorythursday-cfg
Stay curious 🖖:
- 📄 https://en.wikipedia.org/wiki/Context-free_grammar
- 📚 https://suif.stanford.edu/dragonbook/
- 🏫 https://www.edx.org/course/compilers
- 😆 https://stackoverflow.com/questions/1732348/
- 🖼️ https://xkcd.com/208/
🗨️ You can see this tweetstorm as originally posted in this Twitter thread.