Monday, October 19, 2009

The new XML: High Performance Serializers

While JSON has been regarded for some time as a good alternative to XML, binary data serializers such as Thrift and Protocol Buffers have more recently been gaining traction for their performance and compact output. Just the other day I came across Avro (now part of the Hadoop project) which puts itself as a direct competitor to Thrift and ProtoBufs.  Avro looks neat, but there are a few other data marshallers worth looking at too.

The Thrift ProtoBuf Compare project is a great resource.  They've already added Avro to their benchmarks, and the results look stellar. Personally I was never fond of the generated-code interface tactic used by Thrift and PB. Avro takes a different approach from the former, being less strictly typed and a little more dynamic to make the manual serialization process relatively easy. More importantly, since Avro uses named (rather than indexed) fields, it's still easy to handle schema changes and maintain forward/backward compatibility when the data model changes.

Admittedly, the serialization and deserialization process is still a bit fragile - it's definitely a step up from Java's Externalizable interface, where you're not only responsible for maintaining the correct order in which fields are serialized and deserialized, but you also need to worry about null fields, which often means adding an additional boolean into the stream which indicates if the proceeding field is null or not... Fun times. So we're moving in the right direction but not quite there yet in terms of simplicity.

Analysing the TPC results further, it seems that the other alternative is the category of streaming XML serializers - using FastInfoset (i.e. binary XML) it seems that you can pick up most of the performance and size benefit without moving to a binary format or a serializer which relies on code generation. However, looking at the source code, it becomes apparent that these APIs are meant for speed, not simplicity, and I've quickly come to the conclusion that for all but the simplest domain model, the serialization code becomes too complex to manage efficiently.

Personally, I'm excited about the JsonMarshaller results. Not because it's particularly competitive in terms of relative performance, but because the code is so simple. Compared to the amount of code required by some of the other serializers, the triviality of the JsonMarshaller API is a relief. It should also be considered whether or not the performance of a simpler mechanism like JsonMarshaller aren't already "good enough." That is, unless the serialized data needs to be extremely compact or extremely fast, a 3x decrease in performance might well be acceptable given the considerably simplified code. One might argue "that's why Thrift and ProtoBufs are so good! You get the performance and a simplified (albeit generated) API!"

Ok, so maybe you're right. But I guess I wrote this to talk about the alternatives to Thrift and PB since they're probably already the most widely used.

No comments:

Post a Comment