Friday, November 06, 2009

Serialization/Streaming Protocols: What we got?

It's takes a huge effort to build a friendly API, and build a community around it. But once you have a popular service API, the next thing is the handling the traffic. It doesn't have to be external API, it can be a your web front-end posting requests to the backend service layer.

As the user base explodes, a bit saved is bandwidth and money saved. This applies to mobile clients as well. With things hosted in clouds these days, it does matter how much bandwidth you use and how less resources you consume.

Two things magnifies the problem:

1) User Base - if the user base is really large then even transferring 1MB per user over wire is going to hit the wall. Imagine 1 million users trying to access your webpage.

2) Amount of data transfer - if you are transferring huge amount of data, say your website is cloud based storage system or online cloud database, then again it's going to hit the wall in performance soon again.

So to move you objects from server to client, you need to see several serialization options. I will start with some standard ones, and list some recents ones that sounds interesting.

XML:

Human readable, and machine parse-able at the same time. But probably the most verbose serialization option we have. Also the human readable advantage goes down very quickly as the size of the XML file goes up.

JSON:

JSON (pronounced as Jason), stands for JavaScript Object Notation. Its pretty popular with AJAX, and JavaScript based web libraries. It keeps the data compact, and saves us from verbosity of XML. JSON format supports only text data, and does'nt have native support for binary data.

Hessian:

Hessian is been there for a while, and it is quite popular with J2ME world because of the small required dependencies, and efficient binary protocol. Starting from Hessian 1.0 Spec, it has now come to Hessian 2.0. Hessian 2.0 spec seems to be quite comparable with any of the new age/recent protocols that were released.

Protocol Buffers:

Coming from google, we can definitely assume it should have great scalability & performance. It supports both text and binary format. All your text representation will be converted to a binary format before sending it across the wire. You have to first create  a interface file (.proto) describing the fields, and compile them to Java/Any supported language classes. Then you can serialize/deserialize from binary format to Objects in your language. The main drawback is for you to specify the interface and compile them to objects, but having things statically compiled will give you some performance advantages. It does support binary data as well in the message structure.

Apache Thrift:

Thrift is originally created and used within FaceBook team, and later released as Apache OpenSource project. It pretty much similar to google with define-compile-use cycle. You need to define the message structure using .thrift file, and compile them using thrift compiler, and use them in you services/clients. Apache Thrift has poor documentation when compared to other protocols.

Apache Avro:

This is one of sub-projects of Apache Hadoop, a 'Google Map-Reduce' inspired framework for Java. This project is contributed heavily by Yahoo! and they said to use it extensively for their infrastructure. Avro's design goal is as well to support Dynamic Typing; that is be able to exchange information without the compile-use cycle. The schema of the data structure is defined in JSON format, and its exchanged on the initial interaction; and the rest of the transfers client uses the schema to read the data.

BERT & BERT-RPC:

BERT stands for Binary ERlang Term. It is based on the Erlang's binary serialization format. The author of this format is founder of the GitHub. The git-hub team posted a article on how they improved the performance of their site using this new protocol. Their main reason for not using Protocol Buffers & Thrift is that you have to go through mundane define-compile-use cycle. Instead they created this protocol which supports dynamic data format definition, so the actual data itself will contain meta-information about the data structure (the client can read them on the go). GitHub being a huge repository of open source projects, and people forking out branches, checking in/checking out huge code bases we can assume the traffic they could be handling; BERT should have been really comparable in-order to be a better alternative compared to Protocol Buffers & Thrift.

Lets see what improvements, and comparison reports could future bring about these protocols.

Links:

Click on the protocol name on the above article to go to relevant page. And some more links below.

http://hessian.caucho.com/doc/hessian-serialization.html#anchor2

http://github.com/blog/531-introducing-bert-and-bert-rpc

Recommended Blog Posts