Wednesday, March 5, 2014

Table-based Template Translation in C++

This is a great technique I devised for simplifying the optimization of translation functions.  Its broadly applicable to performance-critical code beyond software graphics.  I’d love to know if there’s a proper name for this pattern, and if anyone else has ever used it and for what?

At UIQ (a Symbian smartphone UI maker) we needed really fast bitmap blitting performance in the Symbian software graphics library and we just weren’t getting it.

If two bitmaps are the same format, you can just memcpy (or blend or whatever) pixels from one to the other.

If two bitmaps are different formats, you have to translate the source bitmap’s pixels to the destination bitmap’s format.

However, the Symbian libraries themselves were shooting us in the foot.  There were some special-cased blits for same-format source and destinations, and a hand-coded special-case for 16-bit colour with a separate 8-bit alpha mask bitmap into a 24-bit destination with 32-bit alignment, and that was it.

All other conversions went via two virtual method calls… one to convert from source to a 32-bit RGBA universal internal format, and the other to convert to destination format.

Who would use virtual inheritance per-pixel for critical-path colour conversion on a constrained device?  I know their name ;)

This was killing us.  If a theme didn’t stick to 16-bit colour with separate 8-bit alpha mask, performance went down the drain.  We actually converted-on-load PNGs in themes to this combination, even if when the screen-buffers were 32-bit.  It was even quicker to drop a JPG down to 16-bit and give it an mask opaque mask than it was to blit from RGB24 to RGB32.

Something needed to be done.

The approach I came up with was to use a 2D table of function pointers:

I have some simplified SDL-based code to hand that illustrates this approach.  Imagine there are just three formats - 1 (1 byte luminance), 3 (RGB) and 4 (RGBA).

You can define a C function type, say:

typedef void (*scanline_blit_t)(const unsigned char*,unsigned char*,int);

Now you can now define a conversion table for all pairs of formats:

static const scanline_blit_t scanline_blit[4*4] = {
	blit_1_1,	NULL,		blit_1_3,	blit_1_4,
	NULL,		NULL,		NULL,		NULL,
	blit_3_1,	NULL,		blit_3_3,	blit_3_4,
	blit_4_1,	NULL,		blit_4_3,	blit_4_4,
};

You easily dispatch to the blit implementation e.g.

void blit(const pixmap_t& src,int x,int y,const pixmap_t& dest) {
	assert_always(src.fmt() > 0 && src.fmt() <= 4);
	assert_always(dest.fmt() > 0 && dest.fmt() <= 4);
	assert_always((x >= 0) && (src.w() >= 0) && (x+src.w() <= dest.w()));
	assert_always((y >= 0) && (src.h() >= 0) && (y+src.h() <= dest.h()));
	scanline_blit_t impl = scanline_blit[(src.fmt()-1)*4+dest.fmt()-1];
	assert_always(impl);
	for(int i=0; i<src.h(); i++) {
		const unsigned char* s = src.pixel(0,i);
		unsigned char* d = dest.pixel(x,y+i);
		impl(s,d,src.w());
	}
}

… its just to implement all those blit_1_1, blit_3_1, blit_3_4 functions to do the actual blitting for all combinations of format.  And this can be tedious!

So this is where we used just a handful of template functions:

template<int fmt> void scanline_copy(const unsigned char* src,unsigned char* dest,int count) {
	count *= fmt;
	while(count--)
		*dest++ = *src++;
}

This template function can be used to do all blits that are just memcpys.

For the rest, this big switch statement works:

template<int src_fmt,int dest_fmt> void scanline_conv(const unsigned char* src,unsigned char* dest,int count) {
	unsigned char r,g,b;
	float a = 1;
	while(count--) {
		switch(src_fmt) {
		case 1:
			r = g = b = *src++;
			break;
		case 3:
			r = *src++; g = *src++; b = *src++;
			break;
		case 4:
			r = *src++; g = *src++; b = *src++; a = 255./((float)*src++);
			break;
		default:
			assert_compile(false);
		}
		switch(dest_fmt) {
		case 1:
			*dest++ = (r+g+b)/3;
			break;
		case 3:
			*dest++ = r; *dest++ = g; *dest++ = b; // discard alpha
			break;
		case 4:
			*dest++ = r; *dest++ = g; dest++ = b; *dest++ = 255.*a;
			break;
		default:
			assert_compile(false);
		}
	}
}

Every mainstream compiler will eliminate all the unused switch cases and turn it into a function with just the combination of source and destination formats.  

Compilers do a pretty good job of this kind of naive C format conversion code, so we didn’t need to go down to ARM assembly for this.

So now the function table looks like this:

static const scanline_blit_t scanline_blit[4*4] = {
	scanline_copy,	NULL,	scanline_conv<1,3>,	scanline_conv<1,4>,
	NULL,			NULL,	NULL,			NULL,
	scanline_conv<3,1>,	NULL,	scanline_copy,	scanline_conv<3,4>,
	scanline_conv<4,1>,	NULL,	scanline_conv<4,3>,	scanline_copy,
};

This example using integers 1, 3 and 4 is actually based on some SDL code I have to hand; at UIQ, we had rather more formats in enums, endian conversions, lots of blend modes and so on to deal with.  We hoisted the loop itself into template functions, and we looked at disassemblies of the generated code, and we benchmarked it on real hardware.  We hand-made some of the most critical conversions, reworked divisions and even used Duff’s Device in anger.  And we were very happy.

With all our blend modes, mask bitmaps, scaling and other ops, we got this down to just a few screenfuls of code.  A dramatic drop in repetitive boilerplate code, and much easier to understand and maintain.

This same approach is applicable beyond software blitting (who does that any more?).  It can be considered for, to invent examples, code that does string searching and supports many character encodings.  Or the code deep in a driver to convert DB column types to runtime types.

Happy coding with table-based template translation functions in C++ :) 

Notes

  1. do-nothing reblogged this from williamedwardscoder
  2. williamedwardscoder posted this

 ↓ click the "share" button below!