Jed v0.1 – Parsing the DSL with Lark

Camilo MATAJIRA Avatar

Following the previous post, this post deals with the parsing of the DSL or grammar of jed.
The idea of this mini project is to be able to parse the syntax the substitute command.

Below are example of the substitute command (S and s).

jed '/^comm.*/./aut.*/ s/Camilo MATAJIRA/Camilo A. MATAJIRA/g' file.json
jed '/author/ S/name/author_name/g' file.json
jed ':/fake_email@gmail.com/ S/email/commiter_mail/g' file.json
jed '/.*/:/fake_email@gmail.com/ S/email/commiter_mail/g' file.json
jed '/.*/./.*/:/fake_email@gmail.com/ S/email/commiter_mail/g' file.json
jed '/.*/./.*/./.*/:/fake_email@gmail.com/ S/email/commiter_mail/g' file.json

To parse the grammar of jed, I use Lark. (https://lark-parser.readthedocs.io/en/stable/)
Lark is “a modern parsing library for Python. Lark can parse any context-free grammar.”

Turns out that parsing the grammar was quite simple. Below is the code snippet to
parse the substitute command:

from lark import Lark, Transformer

grammar = r"""
start: command+

command: JREGEX* ":"? JVALUE? "s/" OLD_PATTERN "/" NEW_PATTERN "/" FLAGS   -> jed_substitute_value_regex
       | JREGEX* ":"? JVALUE? "S/" OLD_PATTERN "/" NEW_PATTERN "/" FLAGS   -> jed_substitute_key_regex

REGEX: /[a-zA-Z0-9 \[\]+.?*]+/
JREGEX: "/"REGEX"/""."?
JVALUE: "/"REGEX"/"
NEW_PATTERN: REGEX
OLD_PATTERN: REGEX
FLAGS: LETTER+

%import common.LETTER
%import common.WS
%ignore WS
"""

Lark uses regular expressions to define the grammar, and those regex expressions can be built on top others.
Check for example JREGEX, which is built on top of REGEX.
JREGEX can capture the “/.*/./.*/./.*/” succession of key regexes.

The most impressive part of Lark, is that simply by pointing an arrow after the regex, the interpreter
of Lark (the transformer) will execute the defined function. See the code snippet below:

JREGEX* ":"? JVALUE? "s/" OLD_PATTERN "/" NEW_PATTERN "/" FLAGS   -> jed_substitute_value_regex

This is, if the regex of the left matches, then function “jed_substitute_value_regex” would be executed.

I created a jed v0.1, just to show how Lark makes easy interpreting jed’s DLS and the output received.

#!/usr/bin/env -S uv run --script
#
# /// script
# requires-python = ">=3.12"
# dependencies = ["lark>=1.3.1"]
# ///


import argparse
import sys
from lark import Lark, Transformer

grammar = r"""
start: command+

command: JREGEX* ":"? JVALUE? "s/" OLD_PATTERN "/" NEW_PATTERN "/" FLAGS   -> jed_substitute_value_regex
       | JREGEX* ":"? JVALUE? "S/" OLD_PATTERN "/" NEW_PATTERN "/" FLAGS   -> jed_substitute_key_regex

REGEX: /[a-zA-Z0-9 \[\]+.?*]+/
JREGEX: "/"REGEX"/""."?
JVALUE: "/"REGEX"/"
NEW_PATTERN: REGEX
OLD_PATTERN: REGEX
FLAGS: LETTER+

%import common.LETTER
%import common.WS
%ignore WS
"""


class SedTransformer(Transformer):
    def __init__(self, text):
        super().__init__()
        self.text = text

    def jed_substitute_value_regex(self, args):
        print("Function: jed_substitute_value_regex")
        print(args)
        sys.exit()

    def jed_substitute_key_regex(self, args):
        print("Function: jed_substitute_key_regex")
        print(args)
        sys.exit()


if __name__ == "__main__":
    argument_parser = argparse.ArgumentParser(
        prog="jed",
        description="Sed for json!",
    )
    argument_parser.add_argument("jed_script")
    args = argument_parser.parse_args()

    grammar_parser = Lark(grammar)
    tree = grammar_parser.parse(args.jed_script)
    t = SedTransformer("")
    t.transform(tree)
    print(t.text)

# vim: set syntax=python filetype=python:

Below is the example output:

./jed "/url.*/./b.*/./[a-z]+/:/Helloold/ s/old/new/g"                 1 
Function: jed_substitute_value_regex
[Token('JREGEX', '/url.*/.'), Token('JREGEX', '/b.*/.'), Token('JREGEX', '/[a-z]+/'), Token('JVALUE', ':/Helloold/'), Token('OLD_PATTERN', 'old'), Token('NEW_PATTERN', 'new'), Token('FLAGS', 'g')]

./jed "/url.*/./b.*/./[a-z]+/:/hello_old/ s/old/new/g"
Function: jed_substitute_value_regex
[Token('JREGEX', '/url.*/.'), Token('JREGEX', '/b.*/.'), Token('JREGEX', '/[a-z]+/'), Token('JVALUE', ':/hello_old/'), Token('OLD_PATTERN', 'old'), Token('NEW_PATTERN', 'new'), Token('FLAGS', 'g')]

./jed "/url.*/./b.*/./[a-z]+/:/hello_old/ s/old/new/g"
Function: jed_substitute_value_regex
[Token('JREGEX', '/url.*/.'), Token('JREGEX', '/b.*/.'), Token('JREGEX', '/[a-z]+/'), Token('JVALUE', ':/hello_old/'), Token('OLD_PATTERN', 'old'), Token('NEW_PATTERN', 'new'), Token('FLAGS', 'g')]

./jed "/url.*/ S/old/new/g"
Function: jed_substitute_key_regex
[Token('JREGEX', '/url.*/'), Token('OLD_PATTERN', 'old'), Token('NEW_PATTERN', 'new'), Token('FLAGS', 'g')]

See, with this info it will be easy to work with. Lark is calling the corresponding function through the transformer.
The data structures received are also easy to work with.
The next step is to implement the logic of the substitute command.

Tagged in :

Camilo MATAJIRA Avatar