So we’re on the final day before our 0.2 release for WebVTT is due and my teams C parser is yet to be completed. Hopefully, my partner and I can finish off things later tonight and we can get it finished off for tomorrow. However, if we don’t get it completed not all is lost. The primary goal of this release was to learn about the WebVTT parser by having three separate teams work on coding three different parsers. The idea here was to follow the design principle “Build two and plan to throw one away”. Judged under those goals I believe my team has met them. We have learnt much about what we will need to do in order to make a good parser and much about how we should go about doing that. I’ll go through the most crucial things that I believe we need to get right to have a final parser that is good.
Error Handling and Parser Ignore Logging
We need a good design for error handling and ignore logging. Both of them represent similar ideas and similar implementations but have but have different end meanings. Errors are those things which cause the parser to completely break, whereas parser ignore logging is when the parser encounters malformed WebVTT text in the WebVTT byte stream. We want to log all of these ignores and errors and have a way to hook into ignore logger so that we can test properly. The Prof brought up the fact that a lot of the tests where we would expect an ignore will actually be Unit Tests, so that will simplify things a bit.
Efficient Way of Loading and Working With Lines of Text
By far the most annoying thing that we had to deal with during this initial parser development was figuring out how to load and work with lines efficiently while retaining a link to the context of the byte stream.
The parser asks you to work with text in a couple different ways.
- Read an entire line of text and compare a multi-character value to the string
- Split text on spaces and work with those split strings
- Loop through char by char and compare values
All of these things seem pretty simple but what makes it complicated is that we want to be able to use all these different methods while retaining a link to the actual position of what character we are on in the byte stream. This way when we ignore some cue text we can tell exactly where it ignored that cue text and spit that out into a logger. That means that we can’t be creating separate string, or pointers, that are completely decoupled from the byte stream, because we will lose the ability to see where it ignored cue text.
One thing to think about is whether or not loading lines from the stream to a separate string is even needed, maybe we can just work with offsets that can denote the current position and the beginning and end of the line that is going to be parsed.
I don’t think this problem is particularly hard. It just gave my team a lot of trouble because we didn’t know all the ways in which the parser would require us to work with text. If we rewrote it again, which we will, we could solve this problem probably pretty easily.
The main thing here for our track cue text data structure, I discussed this a while ago, is whether or not we should squash the different types of nodes into a couple of main nodes, or if we should keep them separate.
If we squash them it will simplify the code a bit but might be hard in the future to modify the code if the specification changes a lot and we end up having to separate the squashed data structure.
If we keep the data structure roughly as the way it is now it will allow more flexibility in the future if we need to change any one of the node type structs because the specification changes. Keeping the data structure as separate as possible will align us with the design principle of separation of concerns.
This is definitely something I think we need to discuss as a class and make a decision on.
As I worked on the C parser I started work on a general UTF-8 library that our parser could rely on. I did this because at the time I thought we needed to be able to work with UTF-8 in order to parse the WebVTT byte stream correctly.
As I learnt about UTF8 more while writing the library I realized one major thing – all the characters that the parser needs to work with in order to parse the byte stream are represented in UTF8 with the same code points that ASCII uses. This means that in order to parse the stream we do not need to have specific functions that can deal with for example a ‘<‘ UTF8 code point, we can simply use the regular standards for working with ASCII that C code classically uses.
The support that we do need to provide for UTF8, the only thing that the WebVTT specification mentions, is conversion of the byte stream to UTF8 upon the beginning of the parser routine.
The final thing that I found out was that the WebVTT specification makes no mention of what kind of character encoding the parser should be emitting at the end of the program i.e. for rendering purposes. I was talking to a class mate who was saying that we should probably use UTF16 as this is most compatible with Firefox and many other applications and frameworks. This is probably something that we should put up for discussion in class at some point.
You can checkout the code here.
We haven’t provided a way to parse the text track cue settings yet and the cue text function does not load the data structure with the appropriate data yet.
We haven’t provided or even looked at the capability to parse only pieces of the byte stream at a time. We need to have this because when the parser will be used on a browser the browser will only provide small pieces of the WebVTT as it is downloading it.
There’s a lot of smelly code in our C parser. We just need to rewrite it.