Tagged: c parser

November 15, 2012

WebVTT 0.3 Release : Final

Today our 0.3 release is due for the WebVTT parser. I’ve completed the cue text parsing portion of the parser and it’s sitting on my GitHub repo. The main places you can look for the code I have added are:

Much of what I discussed in my last blog post has stayed the same with my final version of the 0.3 release. The major structure of the algorithm has stayed the same. However, I have made changes to some of the syntax in order to get rid of minor bugs. I won’t re-post all that slightly changed code as it would make this blog post to long. You can ether look at the GitHub links for that stuff or you can check out my earlier blog post.

I’ll go over what I’ve done in the time since my last post:

I’ve completed the UTF16 append functions:

webvtt_status
append_wchar_to_wchar( webvtt_wchar *append_to, webvtt_uint len, webvtt_wchar *to_append, webvtt_uint start, webvtt_uint stop )
{
	int i;

	if( !append_to || !to_append )
		return WEBVTT_INVALID_PARAM;

	for(i = len; i < len + stop; i++, start++ )
		append_to[i] = to_append[start];
	append_to[i] = UTF16_NULL_BYTE;

	return WEBVTT_SUCCESS;
}

webvtt_status
webvtt_string_append_wchar( webvtt_string *append_to, webvtt_wchar *to_append, webvtt_uint len )
{
	webvtt_status status;

	if( !to_append || !append_to )
		return WEBVTT_INVALID_PARAM;

	if( ( status = grow( (*append_to)->length + len, &(*append_to) ) ) != WEBVTT_SUCCESS )
		return status;

	if( ( status = append_wchar_to_wchar( (*append_to)->text, (*append_to)->length, to_append, 0, len ) ) != WEBVTT_SUCCESS )
		return status;

	(*append_to)->length += len;

	return WEBVTT_SUCCESS;
}

webvtt_status
webvtt_string_append_single_wchar( webvtt_string *append_to, webvtt_wchar to_append )
{
	webvtt_wchar temp[1];

	if( !append_to )
		return WEBVTT_INVALID_PARAM;

	temp[0] = to_append;

	return webvtt_string_append_wchar( append_to, temp, 1 );
}

webvtt_status
webvtt_string_append_string( webvtt_string *append_to, webvtt_string to_append )
{
	webvtt_status status;

	if( ( status = webvtt_string_append_wchar( append_to, to_append->text, to_append->length ) ) != WEBVTT_SUCCESS )
		return status;

	return WEBVTT_SUCCESS;
}

I’ve added in functions that compare two strings or two wchars:

webvtt_uint
webvtt_compare_wchars( webvtt_wchar  *one, webvtt_uint one_len, webvtt_wchar *two, webvtt_uint two_len )
{
	int i;

	/* Should we return a webvtt_status to account for this case here? */
	if( !one || !two )
		return 0;

	if( one_len != two_len )
		return 0;

	for( i = 0; i < one_len; i++ )
	{
		if( one[i] != two[i] )
		{
			return 0;
		}
	}

	return 1;
}

webvtt_uint
webvtt_compare_strings( webvtt_string one, webvtt_string two )
{
	if( !one || !two )
		return 0;

	return webvtt_compare_wchars( one->text, one->length, two->text, two->length );
}

I’ve changed the webvtt_string_list struct and it’s functions since the last blog post:

struct
webvtt_string_list_t
{
	webvtt_uint alloc;
	webvtt_uint list_count;
	webvtt_string *items;
};

webvtt_status
webvtt_create_string_list( webvtt_string_list_ptr *string_list_pptr )
{
	webvtt_string_list_ptr temp_string_list_ptr = (webvtt_string_list_ptr)malloc( sizeof(*temp_string_list_ptr) );

	if( !temp_string_list_ptr )
		return WEBVTT_OUT_OF_MEMORY;

	temp_string_list_ptr->alloc = 0;
	temp_string_list_ptr->items = 0;
	temp_string_list_ptr->list_count = NULL;

	*string_list_pptr = temp_string_list_ptr;

	return WEBVTT_SUCCESS;
}

void
webvtt_delete_string_list( webvtt_string_list_ptr string_list_ptr )
{
	int i;

	for( i = 0; i < string_list_ptr->list_count; i++ )
	{
		webvtt_delete_string( string_list_ptr->items[i] );
	}
}

webvtt_status
webvtt_add_to_string_list( webvtt_string_list_ptr string_list_ptr, webvtt_string string )
{
	if( !string )
	{
		return WEBVTT_INVALID_PARAM;
	}

	if( string_list_ptr->alloc == string_list_ptr->list_count )
		string_list_ptr->alloc += 4;

	if( !string_list_ptr->alloc == 0 )
		string_list_ptr->items = (webvtt_string *)malloc( sizeof(webvtt_string) );
	else
		string_list_ptr->items = (webvtt_string *)realloc( string_list_ptr->items, sizeof(webvtt_string *) * string_list_ptr->alloc );

	if( !string_list_ptr->items )
		return WEBVTT_OUT_OF_MEMORY;

	string_list_ptr->items[string_list_ptr->list_count++] = string;

	return WEBVTT_SUCCESS;
}

I’ve changed the WEBVTT_CALLBACK that will call the webvtt_parse_cuetext function:

static void WEBVTT_CALLBACK
cue( void *userdata, webvtt_cue cue )
{
	webvtt_parse_cuetext( cue->payload->text, cue->node_head );
}

Before it didn’t call the parse cue text function and so the cue text wasn’t parsed.

I’ve changed the function signature for webvtt_parse_cuetext:

WEBVTT_EXPORT webvtt_status
webvtt_parse_cuetext( webvtt_wchar *cue_text, webvtt_node_ptr node_ptr )

I’ve gotten rid of the webvtt_parser pointer, the line number, the length of *cue text, and the length of node_ptr in the function signature that was there previously.

For the webvtt_parser pointer and line number I did this because there purpose was to be able to throw an error to the webvtt_parser pointer error call back and reference the line that it happened on, but currently the parser does not support this.
I got rid of the length of cue text because it should always be a null terminated pointer. So we can just tell that we are at the end of the line by checking for that. No need for the line length.
I got rid of the length of the node_ptr because the parser no longer returns an array of node_ptr it now returns a single node_ptr of type WEBVTT_HEAD_NODE, which contains an array of node_ptr underneath it.

I know we will be changing this in the future, but I got rid of it now to make it more clear.

The other major thing that caitp and I were talking about on IRC last night was the data structure of the nodes. Before caitp had it set up that an internal node and a leaf node would contain a node and so they would be subclasses of node. Then you could just return an array of nodes and cast it to a particular type of node based on it’s node kind.

The way I have it set up right now is similar but slightly different. In my version the node contains a pointer to a leaf node or internal node and based on its node kind you can cast it to either an internal node or a leaf node.

caitp made the case for converting it back to the old format as it might be more readable and possibly take up less space in memory. This is something that we should probably discuss in the future.

Some other things to note that we will need to take care of in the future:

I have not had the chance to test the parsing of escape characters, but the code for it is there.
It does not parse the new “lang” tag that was recently added to the W3C specification.
The memory operations in the node, token, and string list struct do not make use of the allocator functions that we have built into the framework.

Yupp, thats it. See ya.

November 9, 2012

WebVTT 0.3 Release

So it’s been a while since I posted and in that intervening time we have been hard at work on the 0.3 release of our parser.

For this release we are concentrating on mainly getting a full *working parser* out as well as getting our build system up to par with a unit testing strategy as well as making the parser be able to work across all platforms i.e. OS X, Linux, and Windows.

For our Unit testing we are going with a Node-ffi solution. Node-ffi will allow us to dynamically bind our C library into a Javascript Test Suite within which we can easily do unit tests. If you want to read up more about that you can check out my classmate Dales blog who has volunteered to work on that.

For our build system we are using Autotools which is a build system from GNU that is designed to assist in making cross platform build systems. You can check out my classmate Caitlins blog to read more about that.

I myself have been working more on the C parser. We chose to go with Caitlins version of the parser to go forward with when my class met to discuss our 0.2 release. This is related to the ‘build two and plan to throw one away’ idea that I talked about in my previous blog posts.

I’ve been implementing the cue text parser portion of the C parser. This is the part that parses the payload of a WebVTT text track i.e. the actual text and markup that will be rendered on screen. Going down this road has also led me to work on a couple other parts of the C parser such as:

Creating some utility functions to check whether or not a UTF16 character is a digit or an alphanumeric character respectively
Harnessing our other string code which Caitlin originally worked on to be able to append UTF16 strings together
Normalizing some of the function names in our C parser which Caitlin worked on. These have to do with changing function names from webvtt_x_delete or create to webvtt_delete_x. Not that hard.
We also discussed what character encoding to use internally in our parser. We decided on UTF16 as that gives some benefits such as it being the encoding that is used on the web as well as it being a simple encoding to use unlike UTF8. I will probably be working on getting the parser to use UTF16 strings after I finished the cue text parser.

I’ll briefly go over the code I have done so far.

Cue Text Parser

For the cue text parser I followed the algorithm provided by the W3C specification very closely. Here is the main parsing method:

/**
 * Currently line and len are not being kept track of.
 * Don't think pnode_length is needed as nodes track there list count internally.
 */
webvtt_status
webvtt_parse_cuetext( webvtt_parser self, webvtt_uint line, const webvtt_wchar *cue_text,
	const webvtt_uint len, webvtt_node *pnode, webvtt_uint *pnode_length )
{
	webvtt_wchar_ptr position_ptr = (webvtt_wchar_ptr)cue_text;
	webvtt_node_ptr current = pnode, temp_node;
	webvtt_cue_text_token_ptr token_ptr;
	webvtt_node_kind kind;

	if( !cue_text )
	{
		return WEBVTT_INVALID_PARAM;
	}

	/**
	 * Routine taken from the W3C specification - http://dev.w3.org/html5/webvtt/#webvtt-cue-text-parsing-rules
	 */
	do {

		webvtt_delete_cue_text_token( token_ptr );

		/* Step 7. */
		switch( webvtt_cue_text_tokenizer( position_ptr, token_ptr ) )
		{
		case( WEBVTT_UNFINISHED ):
			/* Error here. */
			break;
		/* Step 8. */
		case( WEBVTT_SUCCESS ):

			/**
			 * If we've found an end token which has a valid end token tag name and a tag name
			 * that is equal to the current node then set current to the parent of current.
			 */
			if( token_ptr->token_type == END_TOKEN )
			{
				if( webvtt_get_valid_token_tag_name( ((webvtt_cue_text_end_tag_token *) token_ptr->concrete_token)->tag_name, &kind ) == WEBVTT_NOT_SUPPORTED)
					continue;

				if( current->kind == kind )
					current = current->parent;
			}
			else
			{
				/**
				 * Attempt to create a valid node from the token.
				 * If successful then attach the node to the current nodes list and also set current to the newly created node
				 * if it is an internal node type.
				 */
				if( webvtt_create_node_from_token( token_ptr, temp_node, current ) != WEBVTT_SUCCESS )
					/* Do something here. */
					continue;
				else
				{
					webvtt_attach_internal_node( (webvtt_internal_node_ptr)current->concrete_node, temp_node );

					if( WEBVTT_IS_VALID_INTERNAL_NODE( temp_node->kind ) )
						current = temp_node;
				}
			}
			break;
		}

	} while( *position_ptr != UTF16_NULL_BYTE );

	return WEBVTT_SUCCESS;
}

In short – it loops, calling the tokenizer function until it has reached the end of the buffer. Based on the status returned by the tokenizer it will either emit an error (not added yet) or it will add a node to the node list depending on what kind of token is returned.

You can see in the code that I have created many utility functions that do things such as creating a node from a token and creating or deleting nodes or tokens. I won’t list those functions here because it would be too much.

The other main bulk of this parser is the actual tokenizer:

webvtt_status
webvtt_cue_text_tokenizer( webvtt_wchar_ptr position_ptr, webvtt_cue_text_token_ptr token_ptr )
{
	webvtt_cue_text_token_state token_state = DATA;
	webvtt_string result, annotation;
	webvtt_string_list css_classes;
	webvtt_timestamp time_stamp;
	webvtt_status status = WEBVTT_UNFINISHED;

	if( !position_ptr )
	{
		return WEBVTT_INVALID_PARAM;
	}

	/**
	 * Loop while the tokenizer is not finished.
	 * Based on the state of the tokenizer enter a function to handle that particular tokenizer state.
	 * Those functions will loop until they either change the state of the tokenizer or reach a valid token end point.
	 */
	while( status == WEBVTT_UNFINISHED )
	{
		switch( token_state )
		{
		case DATA :
			status = webvtt_cue_text_tokenizer_data_state( position_ptr, &token_state, result );
			break;
		case ESCAPE:
			status = webvtt_cue_text_tokenizer_escape_state( position_ptr, &token_state, result );
			break;
		case TAG:
			status = webvtt_cue_text_tokenizer_tag_state( position_ptr, &token_state, result );
			break;
		case START_TAG:
			status = webvtt_cue_text_tokenizer_start_tag_state( position_ptr, &token_state, result );
			break;
		case START_TAG_CLASS:
			status = webvtt_cue_text_tokenizer_start_tag_class_state( position_ptr, &token_state, css_classes );
			break;
		case START_TAG_ANNOTATION:
			status = webvtt_cue_text_tokenizer_start_tag_annotation_state( position_ptr, &token_state, annotation );
			break;
		case END_TAG:
			status = webvtt_cue_text_tokenizer_end_tag_state( position_ptr, &token_state, result );
			break;
		case TIME_STAMP_TAG:
			status = webvtt_cue_text_tokenizer_time_stamp_tag_state( position_ptr, &token_state, result );
			break;
		}

		if( *position_ptr != UTF16_GREATER_THAN && *position_ptr != UTF16_NULL_BYTE )
			position_ptr++;
	}

	/**
	 * Code here to handle if the tokenizer status returned is not WEBVTT_SUCCESS.
	 * Most likely means it was not able to allocate memory.
	 */

	/**
	 * The state that the tokenizer left off on will tell us what kind of token needs to be made.
	 */
	if( token_state == DATA || token_state == ESCAPE )
	{
		 return webvtt_create_cue_text_text_token( token_ptr, result );
	}
	else if(token_state == TAG || token_state == START_TAG || token_state == START_TAG_CLASS ||
			token_state == START_TAG_ANNOTATION)
	{
		return webvtt_create_cue_text_start_tag_token( token_ptr, result, css_classes, annotation );
	}
	else if( token_state == END_TAG )
	{
		return webvtt_create_cue_text_end_tag_token( token_ptr, result );
	}
	else if( token_state == TIME_STAMP_TAG )
	{
		/* Parse time stamp from result. */
		return webvtt_create_cue_text_time_stamp_token( token_ptr, time_stamp );
	}
	else
	{
		return WEBVTT_NOT_SUPPORTED;
	}

	return WEBVTT_SUCCESS;
}

This function takes the byte stream and interprets it into tokens that the parser will be able to understand. One of the main departures that I made away from the W3C specification is that I’ve farmed each tokenizer state out to a function and therefore I had to change a tiny bit of the logic i.e. instead of using only a result and a buffer to parse the text I have created separate webvtt_strings that can handle each one of the result, buffer, and annotation. This simplifies the code as you don’t have to pass back and forth only two parameters between everyone of these functions to keep track of the parsed output. I also created a webvtt_string_list struct that will be able to handle a list of strings for the classes of a start tag in the cue text.

Here is an example of one of the functions that parses a tokenizer state:

webvtt_status
webvtt_cue_text_tokenizer_start_tag_class_state( webvtt_wchar_ptr position_ptr,
	webvtt_cue_text_token_state_ptr token_state_ptr, webvtt_string_list css_classes )
{
	webvtt_string buffer;

	CHECK_MEMORY_OP( webvtt_create_string( 1, &buffer ) );

	for( ; *token_state_ptr == START_TAG_CLASS; position_ptr++ )
	{
		if( *position_ptr == UTF16_TAB || *position_ptr == UTF16_FORM_FEED ||
			*position_ptr == UTF16_SPACE || *position_ptr == UTF16_LINE_FEED ||
			*position_ptr == UTF16_CARRIAGE_RETURN)
		{
			CHECK_MEMORY_OP( webvtt_add_to_string_list( css_classes, buffer ) );
			webvtt_delete_string( buffer );
			*token_state_ptr = START_TAG_ANNOTATION;
		}
		else if( *position_ptr == UTF16_GREATER_THAN || *position_ptr == UTF16_NULL_BYTE )
		{
			CHECK_MEMORY_OP( webvtt_add_to_string_list( css_classes, buffer ) );
			webvtt_delete_string( buffer );
			return WEBVTT_SUCCESS;
		}
		else if( *position_ptr == UTF16_FULL_STOP )
		{
			CHECK_MEMORY_OP( webvtt_add_to_string_list( css_classes, buffer ) );
			webvtt_delete_string( buffer );
			CHECK_MEMORY_OP( webvtt_create_string( 1, &buffer ) );
		}
		else
		{
			CHECK_MEMORY_OP( webvtt_string_append_wchar( buffer, position_ptr, 1 ) );;
		}
	}

	webvtt_delete_string( buffer );
	return WEBVTT_UNFINISHED;
}

Each one of the tokenizer state functions loops until either it changes the state of the tokenizer, which means it needs to start parsing it in another one of the tokenizer state functions, or it reaches a ‘termination’ point i.e. a point where either a valid token has been parsed or where it has come across the end of the byte stream prematurely.

CHECK_MEMORY_OP is just a macro that takes the returned webvtt_status and compares it to see if it was a success. If it was not then it returns the status that the memory operation returned. One problem that I have here is that since it returns immediately there is no place to deallocate memory that may have been allocated in the function. Should be easy to fix but I haven’t gotten around to it yet.

UTF16 String Manipulations

I haven’t completed some of the functions that I call in the parser code for UTF16 strings such as the functions that handle appending webvtt_wchars or webvtt_strings to webvtt_strings, but I will be working on that next. I also need to implement a function that will hopefully take a string literal and append it to a webvtt_string.

The functions for webvtt_strings that I have working so far are the is_digit, is_alphanumeric, and add to webvtt_string_list functions,:

webvtt_status
webvtt_add_to_string_list( webvtt_string_list string_list, webvtt_string string )
{
	if( !string )
	{
		return WEBVTT_INVALID_PARAM;
	}

	if( !string_list.items )
	{
		string_list.list_count = 0;
		string_list.items = (webvtt_string *)malloc( sizeof( webvtt_string ) );
	}
	else
	{
		string_list.items = (webvtt_string *)realloc( string_list.items,
			sizeof( webvtt_string ) * ( string_list.list_count + 1 ) );
	}

	if( string_list.items )
	{
		string_list.items[string_list.list_count] = string;
		string_list.list_count++;
	}
	else
		return WEBVTT_OUT_OF_MEMORY;

	return WEBVTT_SUCCESS;
}

webvtt_uint
webvtt_is_alphanumeric( webvtt_wchar character )
{
	return ( character >= UTF16_DIGIT_ZERO && character <= UTF16_DIGIT_NINE ) ||
			  ( character >= UTF16_CAPITAL_A && character <= UTF16_CAPITAL_Z ) ||
			  ( character >= UTF16_A && character <= UTF16_Z );

}

webvtt_uint
webvtt_is_digit( webvtt_wchar character )
{
	return character >= UTF16_DIGIT_ZERO && character <= UTF16_DIGIT_NINE;
}

One final thing that I want to mention is that all this code is completely untested so far. Once I get the webvtt_string functions in place I will start to debug and test it. If you want to check out the entire code that I’ve been working on you can see that here. The main places you can look in are cuetext, cue, and string. Our 0.3 release is due this Thursday coming up. I’m aiming to have this cue text parser done along with the change over to use of UTF16 everywhere in the parser, as well as converting my old tests from the 0.1 release to Unit Tests using the new Node-ffi Javascript test suite.

Later!

October 26, 2012

WebVTT 0.2 Release Update – Oct 26, 2012

So my partner and I, Shayan Zafar, have been hard at work on the 0.2 release of our C parser. We’ve done a lot of work on it. I’ll briefly go over some of the changes we have made.

We presented our initial code and development to our class last Thursday and we got quite a few good comments on how to improve the code, all of which we later implemented:

Change the naming conventions we were using to more C style naming conventions. At the time we were using camel case to name our variables but this is not the way you do it in C. In C you usually name your variable and function names all lower case with _ to separate words.
Change some while loops that we had that could be changed into for loops and simplified.
Change our error logging system to be contained within our WebVTT buffer info struct instead of as a global variable.
Change one function that we used to advance a pointer past a single line ending to advance past any number of line endings that might occur.
Change our error logging code to be able to take in a -debug command on the command line in order to tell the program whether or not it needs to log errors.

The class was also really good because it got me thinking of all the things that we would need to be thinking about for our final implementation, such as the program being UTF-8 compliant. The Prof pointed out to us certain functions that we were using, such as many C string functions, which would not behave the way we would want when using UTF-8.

Overall from the class I began to realize how much harder this is going to be then what I was originally expecting.

There was also a comment from the Prof about possibly squashing the data structure that we made with structs because many of them contain the same data. As of yet I’m still thinking about what is the best thing to do. What will help is when we are able to begin to actually write the code that will be parsing a text track cue text. At that time we will be able to see how the data structure will fit in to our current C parser implementation and so we will be able to see what can be thrown away or changed more easily.

I’m hoping that my partner and I can finish the parser by this Monday… we have the majority of it done so far. However, some of the hardest parts are left such as parsing the text track cue text. Even if we don’t finish it I will still be happy that we can deliver what we have done so far. It doesn’t help that I also have a test coming up on that Monday followed by an assignment for another class due on the Tuesday…

The most interesting thing I did this week was to begin to work on a UTF-8 library that we can rely on for our C parser. To give you an overview of UTF-8:

UTF-8 ‘characters’ are called code points.
Each code point consists of a variable amount of bytes.
UTF-8 code points consist of one leading byte and a variable amount of continuation bytes.
A UTF-8 leading byte has two or more high order ones followed by a zero.
A UTF-8 continuation byte has one high order ones followed by a zero.
The amount of high order ones in a leading byte denotes how many continuation bytes there should be in the code point.

For more information you can checkout this.

The bulk of the UTF-8 code that I wrote for the library is within one function:

int is_utf8_code_point(char *s) {
	int i, required_number_of_bytes = 0;
	/* These variables hold the values that allow the program to check if the first byte of a UTF8 code point conforms to the specification. */
	char code_point_start = MAX_UTF8_CODE_POINT_IDENTIFIER,  code_point_start_and = MAX_UTF8_CODE_POINT_IDENTIFIER_AND;

	/*
	* UTF8 specifies a maximum of 6 bytes to be used in a code point.
	* This will check the first byte of a code point to see if it conforms to any one of the code point byte identifiers.
	* It starts with the maximum byte identifier available and shifts the bits to the left each iteration to check the next highest code point identifier.
	* If it reaches lowest code point identifier possible it will return a fail.
	*/
	for (i = 6; i > 1; i--) {
		if (*s & code_point_start_and == code_point_start) {
			required_number_of_bytes = i;
			s++;
		} else {
			code_point_start <<= code_point_start;
			code_point_start_and <<= code_point_start_and;
		}
	}

	/*
	* If we didn't find the required number of bytes that this code point specifies then that means that we did not find a valid UTF8 code point identifier.
	*/
	if (!required_number_of_bytes)
		return 0;

	/*
	* This will loop through the required number of bytes that the code point identifier needs as specified by UTF8.
	* It starts at 1 because the code point identifier byte has already been found and counted.
	* If each one of the bytes within the code point does not start with the hex value 10XX XXXX, where X represents possible bits to encode the character, than it will return a fail.
	*/
	for (i = 1; i < required_number_of_bytes; i++) {
		if (*s == NULL_BYTE)
			return 0;
		if (*s & UTF8_CODE_POINT_INTERNAL_BYTE_IDENTIFIER_AND == UTF8_CODE_POINT_INTERNAL_BYTE_IDENTIFIER)
			s++;
		else
			return 0;
	}

	return required_number_of_bytes;
}

The function determines if the passed char array contains a UTF-8 code point. It assumes that the first char pointed to is the leading byte. It will then do a series of & operations on the leading byte to determine if it has the proper amount of high order bits and zeros. If it finds a match it will then loop through the continuation bytes to see if the amount of continuation bytes that the leading byte requires is met.

Some things to note:

This code is completely untested right now. I have no idea if it is completely correct or if there are bugs in it. I might get the chance to start working with it before Monday.
I don’t know if this is the most efficient way to check for a code point. When it looks for a proper leading byte it starts by comparing the highest number of high bits that a leading byte could have and then proceeds to bit shift left until a match is found or it gets to the lowest amount possible. I was thinking that it might be better to start at the lowest amount of high order bits and then checking the next highest, but this would take more arithmetic then a simple bit shift.

Using this function I can do all kinds of things such as:

Calculate the length of a UTF-8 string in code points:

int utf8_string_length(char *s) {
	int i;

	for (i = 0; *s != NULL_BYTE; i++) {
		if (!is_utf8_code_point(s))
			return -1;
	}

	return i;
}

Check whether or not a string is UTF-8:

int string_is_utf8(char *s) {
	while (*s != NULL_BYTE) {
		if (!is_utf8_code_point(s))
			return 0;
	}

	return 1;
}

Retrieve a sub string from a UTF-8 string:

int utf8_substring(char *source, char *out, int number_of_code_points) {
	int byte_size = utf8_string_length_in_bytes(source, number_of_code_points);

	if (byte_size == -1)
		return 0;

	out = (char *)malloc(sizeof(char) * byte_size);
	strncpy(out, source, byte_size);

	return 1;
}

These functions are also untested, hopefully I will get the time to test them soon.

Even if we don’t end up using this code and instead we go with some open source UTF-8 library, this will have been one of the most fun things I have done so far this year. It’s very rare that I ever get to work with individual bits and bytes at my job or on projects. Most programming languages today are very high level and such low level things are already taken care of for you.

I also did other miscellaneous things such as organizing code better, filling in comments, fixing bugs, filling in utility functions we will need such as functions to create and destroy structs, etc.

The next two major things my partner and I will be working on from now until Monday are the last two things we will need to do for our 0.2 release. Those are parsing the text track cue text and parsing the time stamp and settings.

See yah!

October 17, 2012

WebVTT 0.2 Release

For the 0.2 release the Professor got us to each sign up on different aspects of the development process. We could choose from many different categories:

Documentation
Testing
Solving Bugs in other WebVTT projects such as the online JS Web Validator or the C++ implementation in WebKit
Turning the JS Validator into a full blown JS Parser in order to be able to use that instead of the C parser on browsers that are to old to support the track element like IE8
Writing the C Parser
Fuzz testing
Maintaining the Build System
Continuous Integration (the process of compiling and running the build on every commit to GitHub in order to know if a commit has broken the code)

I chose to sign up for writing the C parser and also creating and maintaining the build system. Currently we have three teams of two people working on three separate implementations of the C parser. We are doing this in order to adhere to the design philosophy of ‘write two and plan to throw one away’. When our 0.2 release lands we will go about selecting the best parts of each and integrating them into the real C parser that we will be releasing in the end.

So far we, my partner and I, have started work on the C parser and we have ran into a few issues that we had to think about pretty hard. The first is that we are writing this in C and so it cannot be object oriented, but WebVTT assumes that you will be using OO. You can tell this by looking at the some of the terms they use to describe the data structure that the parser will be emitting. They talk about using classes and ‘concrete classes’ to define implementations of interfaces, etc.

We started talking about this to class mates and we were trying to figure out ways to turn C into OO, but as soon as the Prof heard of this he told us – “When in C do as the Cs do”. Which makes sense. You want to use the language as the way it was intended, otherwise we should just use C++. I know there are some ways you can work around Cs lack of OO to get a general approximation, but these are all clunky and generally obfuscate the code in my opinion.

So we set out on trying to find a way to retain Cs lack of OO while generally conforming to an approximation of the specification. What we decided on doing was creating a kind of inheritance structure with structs by having a container struct that contains a void pointer, which points to the concrete struct, and an enumeration that identifies what that concrete struct is. This way the enumeration tells you what you must cast the void pointer too in order to get the appropriate data. The data structure that the WebVTT specification asks for is a tree. InternalNodes that can contain other InternalNodes and LeafNodes which are terminal nodes i.e. those which cannot contain other nodes.

Here is an example of what we came up with:

struct Node
{
	int mode;
	union
	{
		struct InternalNode *internalNode;
		struct LeafNode *leafNode;
	};
};

struct InternalNode
{
	struct Node *nodes;

	enum InternalNodeType internalNodeType;
	void *concreteNode;
};

struct LeafNode
{
	enum LeafNodeType leafNodeType;
	void *concreteNode;
};

The Node class is the base which can either be an implementation of LeafNode or InternalNode. Both of those implementations contain a void pointer and an enumeration that specifies what kind of struct the void pointer is. For example the InternalNode enumeration might be Bold, Italic, etc. The InternalNode class also has a list of Node structs that contains the nodes nested within it.

In this way if we wanted to render a Bold WebVTT cue text we would (in pseudocode):

if (mode == 1)
{
	switch (node->internalNode->internalNodeType)
	{
		case Bold:
			RenderBold((struct BoldNode)node->internalNode->concreteNode);
	}
}

I don’t know if this is the easiest, or best way of doing this, but I guess that’s what learning is for!

One of the other interesting things that we have implemented is a struct called WebVttBufferInfo that keeps track of the buffer information of the WebVTT file. That looks like this so far:

struct WebVttBufferInfo
{
	// Will hold the input buffer
	char *inputBuffer;
	// Pointer into input buffer that denotes the current position
	char *position;
	// Represents a line that has been collected from the input buffer i.e. from beginning of line until CR(LF)
	char *currentLine;

	enum WebVttBufferInfoState state;
};

If you want to check out the work done so far you can go here.

I have not started anything for the build system yet. That is mainly because my partner and I wanted to get the C parser more fleshed out first so we could more easily see what we needed to divide up. At that point, which I think we will reach in a day or two, we can assign different things and I can step away from it for some time to take a look at the build system.

I do know that for the build system we will need to:

Create an auto configure file to check and configure our build environment before we build
Make the build environment capable of cross platform development – Linux, OSX, and Windows
There are also some bugs that I need to take care of having to do with correct test failure and pass counting

We got a lot ahead of ourselves. The 0.2 release is due on Oct 29, so I have to get back to work!!

See yeah.