parsing EDL files using parsimonious

Today I had to review code that parses EDL files. In the code review I suggested using parsimonious instead of raw regex as this is my go to method to parse any structured text files. Since I wasn’t sure if EDL files were easy or hard to parse I decided to have a quick go at writing parsimonious grammar to make sure I’m not suggesting something in the review that is worse than the current, already working solution. Anyway, this is a good opportunity to show my approach at building the grammar. Parsimonious error are not always helpful and quite often seem totally meaningless. It tries hard but it is not easy to point out which part of the grammar failed and why. To avoid getting bogged in errors I usually start very simple with just one line of the file I need to parse and then slowly add more lines and extend the grammar as I go continuously checking if it is still parsing.

A simple EDL that I found online:

TITLE: TEST PAPEREDIT
FCM: NON-DROP FRAME

001  Card01Ky AA/V  C        00:02:26:21 00:02:30:12 00:00:00:00 00:00:03:16
* FROM CLIP NAME:  KYLE_INTERVIEW.MOV
* COMMENT:
FINAL CUT PRO REEL: Card01_Kyle_Interview REPLACED BY: Card01Ky

002  Card01Ky AA/V  C        00:02:30:12 00:02:34:13 00:00:03:16 00:00:07:17
* FROM CLIP NAME:  KYLE_INTERVIEW.MOV
* COMMENT:
FINAL CUT PRO REEL: Card01_Kyle_Interview REPLACED BY: Card01Ky

lets start with the header. we have two header lines (potentially there can be more) with very clear structure of key and value. a grammar for it could be something like:

grammar = Grammar(
    r"""
      edl             = header_entry+
      header_entry    = key ":" spaces value newline
      key             = ~"[A-z0-9_ ]+"i
      value           = ~".*"i
      newline         = ~"\n*"
      spaces          = ~"\s+"
      """)

edl = """TITLE: TEST PAPEREDIT
FCM: NON-DROP FRAME
"""

p = grammar.parse(edl)

I think this doesn’t require more explanation. One important fact about parsimonious grammar is that we have to declare everything, including new lines and space bars. Similarly to regex but in a more readable way. Since this parses we can start to extend it and the clip entries to it.

grammar = Grammar(
    r"""
      edl             = header_entry+ (empty entry)+
      header_entry    = key ":" spaces value newline
      key             = ~"[A-z0-9_ ]+"i
      
      entry           = (~".+"i newline)+
      empty           = spaces? newline
 
      value           = ~".*"i
      newline         = ~"\n*"
      spaces          = ~"\s+"
      """)

edl = """TITLE: TEST PAPEREDIT
FCM: NON-DROP FRAME

001  Card01Ky AA/V  C        00:02:26:21 00:02:30:12 00:00:00:00 00:00:03:16
* FROM CLIP NAME:  KYLE_INTERVIEW.MOV
* COMMENT:
FINAL CUT PRO REEL: Card01_Kyle_Interview REPLACED BY: Card01Ky
"""

p = grammar.parse(edl)

At this stage we can declare the entries is a very rudimentary way as any characters followed by new lines. We want to go slowly and keep it working without adding too much in each step. Now what is left to do is to extend the definition of “entry” to extract all the data from it we need. The definition of “edl” is already completed – a few lines of the header and then clip entries separated by empty lines.

Quite often I keep building the visitor structure at the same time while I’m extending the grammar, again to make sure it all works and I do not need to deal with complicated error messages and try to figure out what part of the code is not working. If it stops to work then whatever I did last is what broke it. It is a good practice to build unittests and iterate between the tests and the code slowly adding complexity.

The complete code to parse EDL could be something like this:

import attr

@attr.s
class Clip():
    index = attr.ib()
    reel = attr.ib()
    tracks = attr.ib()
    transition = attr.ib()
    out_start = attr.ib()
    out_end = attr.ib()
    src_start  = attr.ib()
    src_end = attr.ib()
    attrs = attr.ib()

from parsimonious import Grammar, NodeVisitor, VisitationError, rule

grammar = Grammar(
    r"""
      edl             = header_entry+ (empty entry)+
      header_entry    = key ":" spaces value newline
      key             = ~"[A-z0-9_ ]+"i
      title           = "TITLE:" spaces value newline
      fcm             = "FCM:" spaces value newline
      
      entry           = index spaces reel spaces tracks spaces transition spaces timings newline attrib+
      index           = ~"[0-9]+"
      
      attrib          = "*" spaces key ":" (newline/spaces) value newline
      
      tracks          = string (slash string)?
      timings         = timecode spaces timecode spaces timecode spaces timecode
      
      reel            = ~"[A-z0-9_]+"i 
      transition      = ~"[A-z]+"i 
      string          = ~"[A-z0-9_]+"i
      timecode        = time ":" time ":" time ":" time
      time            = ~"[0-9][0-9]"
      empty           = spaces? newline
      slash           = "/"
      value           = ~".*"i
      newline         = ~"\n*"
      spaces          = ~"\s+"
      """)


class V(NodeVisitor):

    def generic_visit(self, node, visited_children):
        return visited_children or node
    
    def visit_edl(self, node, visited_children):
        return { 
            'header': {k: v for d in visited_children[0] for k, v in d.items()}, 
            'clips':[ch[1] for ch in visited_children[1]]
        }
    
    def visit_header_entry(self, node, visited_children):
        return {visited_children[0].text: visited_children[3].text}   
    
    def visit_entry(self, node, visited_children):
        return Clip(index = visited_children[0].text,
                    reel = visited_children[2].text,
                    tracks = visited_children[4],
                    transition = visited_children[6].text,
                    out_start = visited_children[8]['out_start'],
                    out_end = visited_children[8]['out_end'],
                    src_start  = visited_children[8]['src_start'],
                    src_end = visited_children[8]['src_end'],
                    attrs = visited_children[-1]
                   )
        
    def visit_tracks(self, node, visited_children):
        return node.text

    def visit_attrib(self, node, visited_children):
        return {visited_children[2].text: visited_children[5].text}
    
    def visit_timecode(self, node, visited_children):
        return node.text
   
    def visit_timings(self, node, visited_children):
        return {'out_start':visited_children[0],
               'out_end':visited_children[2],
               'src_start':visited_children[4],
               'src_end':visited_children[6]}
  
from pprint import pprint    
p = grammar.parse(edl)
v = V()
pprint(v.visit(p)) 
{
'clips': [
           Clip(index='001', reel='Card01Ky', tracks='AA/V', transition='C', out_start='00:02:26:21', out_end='00:02:30:12', src_start='00:00:00:00', src_end='00:00:03:16', attrs=[{'FROM CLIP NAME': '  KYLE_INTERVIEW.MOV'}, {'COMMENT': 'FINAL CUT PRO REEL: Card01_Kyle_Interview REPLACED BY: Card01Ky'}]),
           Clip(index='002', reel='Card01Ky', tracks='AA/V', transition='C', out_start='00:02:30:12', out_end='00:02:34:13', src_start='00:00:03:16', src_end='00:00:07:17', attrs=[{'FROM CLIP NAME': '  KYLE_INTERVIEW.MOV'}, {'COMMENT': 'FINAL CUT PRO REEL: Card01_Kyle_Interview REPLACED BY: Card01Ky'}]),
           Clip(index='003', reel='Card02Je', tracks='AA/V', transition='C', out_start='00:00:26:12', out_end='00:00:27:00', src_start='00:00:07:17', src_end='00:00:08:05', attrs=[{'FROM CLIP NAME': '  JEFF_INTERVIEW.MOV'}, {'COMMENT': 'FINAL CUT PRO REEL: Card02_Jeff_Interview REPLACED BY: Card02Je'}]),
           Clip(index='004', reel='Card02Je', tracks='AA/V', transition='C', out_start='00:00:28:22', out_end='00:00:32:00', src_start='00:00:08:05', src_end='00:00:11:08', attrs=[{'FROM CLIP NAME': '  JEFF_INTERVIEW.MOV'}, {'COMMENT': 'FINAL CUT PRO REEL: Card02_Jeff_Interview REPLACED BY: Card02Je'}]),
           Clip(index='005', reel='Card01Ky', tracks='AA/V', transition='C', out_start='00:01:08:03', out_end='00:01:12:19', src_start='00:00:11:08', src_end='00:00:15:24', attrs=[{'FROM CLIP NAME': '  KYLE_INTERVIEW.MOV'}, {'COMMENT': 'FINAL CUT PRO REEL: Card01_Kyle_Interview REPLACED BY: Card01Ky'}])
           ],
'header': 
        {
         'FCM': 'NON-DROP FRAME', 
         'TITLE': 'TEST PAPEREDIT'
        }
}

I only based it on my one sample EDL so this is most likely not a complete solution.