Friday, December 25, 2009

It's time to git

Every now and then a new technology comes, but few gather momentum and finally get adapted by masses. Git is certainly on the right track. GitHub certainly fueled the adaptation of git to masses.

Git is mainly effective/faster when used at command line. There are efforts in building UI around it like Eclipse plugin, but they aren't completely done. I am more comfortable at terminal, so haven't checked the UI progress lately.

With agile practices like pair programming, combined with distributed development - people want a distributed source control system that is snappier and comes with tools.

Couple of interesting things I liked from git are:

Git-Daemon: git-daemon utility bundled with the git release is a quick way to share your code across the network. Say you are at a barcamp or a cofee talk meet with a friend. he can share his local git repository over the network just by running

git-daemon --base-path=parent_path_to_the_repo

And you could clone his repository to your local by

git clone git://server-location/repo

Git-SVN: Those who are using SVN as production repository for your source code could still use git locally. git-svn helps you to sync the current workspace code into SVN directly. This is another reason for people to start using git locally, to get all the benefits of it; and still check into SVN as corporate needs you to.

Git-Stash: Git stash could be said as a coding context, say you have modified couple of files to fix bug 121 - you could create a context that store the files that were changed. Then it reverts the code to the HEAD (clean) state, so you could attack bug 75 and commit it before merging back the code for bug 121. These contexts are easy to create and so convenient in labeling them correctly.

The Dilemma:

For those still saying - 'yeah, git is cool. but with the whole distributed thing - isn't there a chance that I loose control of the code my developers do for me? How do I track them?'

Checking the code into the repository often is a practice of discipline, it could happen with use of any repository. With git you could ask your new developer to share his local git repository so you could give an overview, rather waiting until he gets access to the central repository & checks in his crap. In fact git gives the ability to pull code/feedback earlier, than until something gets checked in.

Getting developer access to central repository is a longer process normally in any corporate, instead of waiting for that time the developer can start coding, and as a lead you could keep track on progress.

Those who are looking for patterns to control the repository effectively look at this presentation: http://www.slideshare.net/err/git-machine starting from slide no.72 the author have pointed out several patterns (Anarchy, Blessed, Lieutenant, & Centralized) to manage the repository.

With all said, its time everyone should consider a distributed source control system - because it enables developers, and with a pattern you chose to control your repository its a win-win.

More Links:
  • How Git Index/Staging Area simplifies commit - http://plasmasturm.org/log/gitidxpraise/


  • A Git Branching Model - http://nvie.com/archives/323
  • Friday, November 06, 2009

    Serialization/Streaming Protocols: What we got?

    It's takes a huge effort to build a friendly API, and build a community around it. But once you have a popular service API, the next thing is the handling the traffic. It doesn't have to be external API, it can be a your web front-end posting requests to the backend service layer.

    As the user base explodes, a bit saved is bandwidth and money saved. This applies to mobile clients as well. With things hosted in clouds these days, it does matter how much bandwidth you use and how less resources you consume.

    Two things magnifies the problem:

    1) User Base - if the user base is really large then even transferring 1MB per user over wire is going to hit the wall. Imagine 1 million users trying to access your webpage.

    2) Amount of data transfer - if you are transferring huge amount of data, say your website is cloud based storage system or online cloud database, then again it's going to hit the wall in performance soon again.

    So to move you objects from server to client, you need to see several serialization options. I will start with some standard ones, and list some recents ones that sounds interesting.

    XML:

    Human readable, and machine parse-able at the same time. But probably the most verbose serialization option we have. Also the human readable advantage goes down very quickly as the size of the XML file goes up.

    JSON:

    JSON (pronounced as Jason), stands for JavaScript Object Notation. Its pretty popular with AJAX, and JavaScript based web libraries. It keeps the data compact, and saves us from verbosity of XML. JSON format supports only text data, and does'nt have native support for binary data.

    Hessian:

    Hessian is been there for a while, and it is quite popular with J2ME world because of the small required dependencies, and efficient binary protocol. Starting from Hessian 1.0 Spec, it has now come to Hessian 2.0. Hessian 2.0 spec seems to be quite comparable with any of the new age/recent protocols that were released.

    Protocol Buffers:

    Coming from google, we can definitely assume it should have great scalability & performance. It supports both text and binary format. All your text representation will be converted to a binary format before sending it across the wire. You have to first create  a interface file (.proto) describing the fields, and compile them to Java/Any supported language classes. Then you can serialize/deserialize from binary format to Objects in your language. The main drawback is for you to specify the interface and compile them to objects, but having things statically compiled will give you some performance advantages. It does support binary data as well in the message structure.

    Apache Thrift:

    Thrift is originally created and used within FaceBook team, and later released as Apache OpenSource project. It pretty much similar to google with define-compile-use cycle. You need to define the message structure using .thrift file, and compile them using thrift compiler, and use them in you services/clients. Apache Thrift has poor documentation when compared to other protocols.

    Apache Avro:

    This is one of sub-projects of Apache Hadoop, a 'Google Map-Reduce' inspired framework for Java. This project is contributed heavily by Yahoo! and they said to use it extensively for their infrastructure. Avro's design goal is as well to support Dynamic Typing; that is be able to exchange information without the compile-use cycle. The schema of the data structure is defined in JSON format, and its exchanged on the initial interaction; and the rest of the transfers client uses the schema to read the data.

    BERT & BERT-RPC:

    BERT stands for Binary ERlang Term. It is based on the Erlang's binary serialization format. The author of this format is founder of the GitHub. The git-hub team posted a article on how they improved the performance of their site using this new protocol. Their main reason for not using Protocol Buffers & Thrift is that you have to go through mundane define-compile-use cycle. Instead they created this protocol which supports dynamic data format definition, so the actual data itself will contain meta-information about the data structure (the client can read them on the go). GitHub being a huge repository of open source projects, and people forking out branches, checking in/checking out huge code bases we can assume the traffic they could be handling; BERT should have been really comparable in-order to be a better alternative compared to Protocol Buffers & Thrift.

    Lets see what improvements, and comparison reports could future bring about these protocols.

    Links:

    Click on the protocol name on the above article to go to relevant page. And some more links below.

    http://hessian.caucho.com/doc/hessian-serialization.html#anchor2

    http://github.com/blog/531-introducing-bert-and-bert-rpc

    Recommended Blog Posts