WebVTT: Refactoring the Parser

So far I’ve gotten only a little bit done on the way to getting the cue text parsing portion of the parser finished. I have a pretty clear idea where I want to go though.


I’ve squashed the Node structs into one node and renamed it a cuetext_tag.

  • The renaming should clear up some misunderstandings/make the purpose of the struct more clear.
  • Squashing the structs into one makes the number of functions we need to maintain decrease, allows for easier code reading and maintenance
  • Since all the data is now in one struct and is not referenced by a void * we can easily see the data in real time when debugging. When it’s a void * you can’t see what the data is inside when debugging in real time.
	 * The specification asks for uni directional linked list, but we have added a parent
	 * in order to facilitate an iterative cue text parsing solution.
	webvtt_cuetext_tag *parent;
	webvtt_cuetext_tag_kind kind;

	 * This union will hold either an internal cue text tag data struct represeting the internal data of an internal cue text tag
	 * data type, a bytearray representing the text of a text cue text type, or a time stamp representing the time stamp of
	 * a time stamp cue text tag.
		webvtt_internal_cuetext_tag_data *internal_cuetext_tag_data;
		webvtt_bytearray text;
		webvtt_timestamp time_stamp;


	webvtt_string annotation;
	webvtt_bytearray_list *css_class_list;

	webvtt_uint alloc;
	webvtt_uint length;
	webvtt_cuetext_tag *children;

In accordance with this I’ve renamed everything having to do with Nodes to cuetext_tag.

Character Encoding

We’ve decided to switch over to UTF8 instead of UTF16 as this will decrease the complexity of the parser.

  • Input into the parser is already either supposed to by UTF8 or converted to UTF8 so an extra step of converting it to UTF16 is extra overhead
  • All the characters that need to be worked with to parse the WEBVTT file are the same values in UTF8 as in ASCII so we don’t have to write custom functions to work with strings
  • If we really want the output of the parser to be something else other then UTF8 then we can convert it before we output/make available the parsed WEBVTT file.

Another thing we’ll need to do is make a string list for the webvtt_bytearray (what we are using to handle UTF8) just like we did for the UTF16 stuff so that the cue text parser can store a list of CSS classes that have been applied to a cue text tag.

Other Parser Things

The other thing that I will be doing for sure are getting rid of the token structs in the cue text parser. Right now the cue text parser calls a tokenizer which creates a token that has basically the same data and structure as a node (cuetext_tag now). This token then gets mapped over to a particular cuetext_tag. This is how the WEBVTT algorithm wanted it, but I don’t really see the point if we are just going to translate it. So I’m going to take the tokens out and instead instantiate cuetext_tags straight away instead of going through a token struct first.

C++ Bindings

So we’ve been having a lot of memory leak problems with the Node C++ bindings (soon to be cuetext_tag bindings). I’ve thought about it and what I would suggest doing is to loop through the entire tree of cuetext_tags upon the instantiation of the cue which owns them and create a cue text tag for each one of them.

Then when we call child() in the Unit Tests we won’t be allocating any new memory, we’d just be retrieving the already created C++ cuetext_tag.

Then since cue knows about it’s tree of cuetext_tags it can start a cascading delete on the entire tree of cuetext_tags when it’s destructor is called.

Voila, no more memory leaks.

I’m going to be rewriting those bindings after I finish the changes to the parser, hopefully it won’t take too long.



  1. Jason Ronallo

    Interesting to see you working on a WEBVTT parser. I’d written one a while back in Ruby and just released the gem. It isn’t a conformant parser as I took a few shortcuts where it didn’t matter for the WEBVTT files we were producing. I’d be interested in seeing if I could make it pass a port of the official test suite to get it updated to the spec. Where would I go to get started with that?

  2. Rick Eyre

    There is no official test suite right now as currently every browser is doing its own thing for WEBVTT. You can check out the work we’ve been working on at https://github.com/humphd/webvtt/tree/seneca. The work is by no means completed and it’s in no way official, but we are hoping that it will get adopted by Firefox eventually and any other project that would like to use it. The main parser is written in C and uses Google test with C++ bindings for its test suite, which isn’t completed yet. You would have to do some work arounds to get it to work in Ruby. If you want to get involved you can hop on IRC at irc.mozilla.org channel #seneca. We could always use help!

  3. Pingback: WEBVTT: Refactoring the Parser « Technological Ramblings

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s