Tuesday, September 14, 2010

Behind the Instant


Google released Instant Search feature on Sept 8, 2010 creating quite a frenzy over the web. Yahoo claimed that they did it first and 5 years back. Creative hackers forming Google Instant Search like mash-ups over other google services (Youtube Instant, Google Maps).

Google Instant Search redefines the way people use search, people used to click on a search button and manually navigate across search results to find something they wanted. With Instant Search, they can change the query until they relevant results.

This changes how SEO experts work on keyword optimization for search engines, and influences ad placement revenue for searches as this marginally reduces the case wherein the user need to visit next page to navigate over search results.

Though Google Instant Search is just ajaxifying the current google site, what really amazes me is how much they have to scale behind the scenes to deliver results as you type. I think performance tuning or scalability is like a magic comprised of a series of tricks: well performed, repeatable tricks. Rest of this post is my attempt to explain the magic.

Some facts from Google:
  • Instant Search measures 20x times increase in query traffic (already at a billion queries a day)
  • They didn't scale just by increasing number of servers, instead they found new ways to squeeze more out of their existing infrastructure.
  • They have used 15 new technologies in building this infrastructure.

How it works:






As you type the query, the browser requests the server for list of suggestions. As soon as what you type matches one of the suggestions, the actual search results are fetched for that suggestion and the web page is updated.

Let say we querying the web for 'man on wire'* using Google Instant Search. Here is the trace of HTTP requests between the browser and the site.


*has no significance other than is random as far as query is concerned.

The traffic can be classified between the JSON data, and the fetching of actual web page. As you can see most of the traffic here is the JSON data, which corresponds to the AJAX calls that happen as you type the query in the search box. The JSON data comprises of possible suggestions for the query you just started typing, and it includes the search result data only if there is a potential match in the suggestion.

What really happens?

  • Only suggestions are fetched from the server as you type, not the real search results.
  • Search results are retrieved only for the top matching suggestion, not for the actual keyword you type. Even if you start typing 'ma' the search results are fetched only for the top suggestion (say 'map quest'). The search results for the keyword 'ma' is fetched only after you hit the 'Search' button.
  • Results are retrieved either when you pause for several milliseconds or when your query matches (word match, not on just character matches) the top suggestion on the list.
  • Google most likely caches or pre-caches may interim results
By restricting the retrieval of search results to suggestions, Google gets an incredible advantage in caching the search results that could possibly be shared between many users. Now, the permutation is limited to the possibility of a suggestion appearing in the user query, rather than the possible permutations of what the could type next.

HTTP Trace log when searching for the same query 'man on wire' next time.


If you type the same query within minutes, Google is smart enough to load it from the local cache rather going to server to fetch again. As you can see in above table, there is no JSON traffic and it is just 204 server response. 204 means 'The server successfully processed the request, but it not returning any content'.

Layers of cache:


In order to scale, the search results has to be cached at various layers.




Google introduced various new layers of cache,
  • prioritized search queries cached (like hot trends)
  • user specific cache - so they could give you more personalized results,
  • general result cache
  • misc - and other layers of caching (even browser level cache)
Caching is the most common contributor to instant search results, but not without its own drawback of becoming stale. Google should have massively improved their current cache techniques in order to serve results instantly.

How Google keep search results relevant?


The sooner the results are fetched (as you type), the more layers of caches; the search results are probably stale as you consume it. Google revamped its the indexing infrastructure recently to provide more accurate and real-time results. Remember, they integrated the real-time tweets in to the search results?

Google mostly used MapReduce paradigm to build its index, until recently. MapReduce helps to handle tons of data, but it has to process data sequentially. Building a real-time search engine just using MapReduce is not going to work well, so they added additional indexing techniques with announcement of Caffeine.

Summary:


Google must have vastly improved it's indexing, caching, and other layers of its search infrastructure (like Google File System, Clustering techniques., ) to be able to serve us results instantly. I assume they would have even used the expertise of the recently cancelled project - Google Wave.

We can probably expect Google to release a research paper for at-least some of the 15 new technologies that powers Google Instant Search.


Friday, February 26, 2010

Git - check your config

If you are working with various projects; say work project and a sideline fun project or a open-source project for that matter. It's important that you get author name or author email id right when you commit.

You don't want your marty@localhost or xxx1337.AT.yahoo.COM to be showing in as author id in the corporate repository. Some opensource projects reject the commits if it is not from the registered email id.

You normally would have updated the git global config to a username/email id and forgot about it. So the tips is that

1) Set the global config to a name/id to the primary purpose you use that machine to, like home machine set to your personal id. office machine set to your office id.

git config --global user.name "My name is..."
git config --global user.email my_email@domain.com


2) Use git local config on sensitive projects to override global config settings. The easiest way to override to config is to run
git config user.name "My name for this project is..." on your local project folder.

You can edit the whole config file in vi/command line editor easily by using the command
git config --global -e
git config -e (to edit local config)


Finally use git config -l on the project folder to see merged config information that will be used when you commit.

N.B.
Install bash completion for git if you are not using it already.
http://justamemo.com/2009/02/09/bash-completion-along-with-svn-and-git-tab-completion/

Friday, December 25, 2009

It's time to git

Every now and then a new technology comes, but few gather momentum and finally get adapted by masses. Git is certainly on the right track. GitHub certainly fueled the adaptation of git to masses.

Git is mainly effective/faster when used at command line. There are efforts in building UI around it like Eclipse plugin, but they aren't completely done. I am more comfortable at terminal, so haven't checked the UI progress lately.

With agile practices like pair programming, combined with distributed development - people want a distributed source control system that is snappier and comes with tools.

Couple of interesting things I liked from git are:

Git-Daemon: git-daemon utility bundled with the git release is a quick way to share your code across the network. Say you are at a barcamp or a cofee talk meet with a friend. he can share his local git repository over the network just by running

git-daemon --base-path=parent_path_to_the_repo

And you could clone his repository to your local by

git clone git://server-location/repo

Git-SVN: Those who are using SVN as production repository for your source code could still use git locally. git-svn helps you to sync the current workspace code into SVN directly. This is another reason for people to start using git locally, to get all the benefits of it; and still check into SVN as corporate needs you to.

Git-Stash: Git stash could be said as a coding context, say you have modified couple of files to fix bug 121 - you could create a context that store the files that were changed. Then it reverts the code to the HEAD (clean) state, so you could attack bug 75 and commit it before merging back the code for bug 121. These contexts are easy to create and so convenient in labeling them correctly.

The Dilemma:

For those still saying - 'yeah, git is cool. but with the whole distributed thing - isn't there a chance that I loose control of the code my developers do for me? How do I track them?'

Checking the code into the repository often is a practice of discipline, it could happen with use of any repository. With git you could ask your new developer to share his local git repository so you could give an overview, rather waiting until he gets access to the central repository & checks in his crap. In fact git gives the ability to pull code/feedback earlier, than until something gets checked in.

Getting developer access to central repository is a longer process normally in any corporate, instead of waiting for that time the developer can start coding, and as a lead you could keep track on progress.

Those who are looking for patterns to control the repository effectively look at this presentation: http://www.slideshare.net/err/git-machine starting from slide no.72 the author have pointed out several patterns (Anarchy, Blessed, Lieutenant, & Centralized) to manage the repository.

With all said, its time everyone should consider a distributed source control system - because it enables developers, and with a pattern you chose to control your repository its a win-win.

More Links:
  • How Git Index/Staging Area simplifies commit - http://plasmasturm.org/log/gitidxpraise/


  • A Git Branching Model - http://nvie.com/archives/323
  • Friday, November 06, 2009

    Serialization/Streaming Protocols: What we got?

    It's takes a huge effort to build a friendly API, and build a community around it. But once you have a popular service API, the next thing is the handling the traffic. It doesn't have to be external API, it can be a your web front-end posting requests to the backend service layer.

    As the user base explodes, a bit saved is bandwidth and money saved. This applies to mobile clients as well. With things hosted in clouds these days, it does matter how much bandwidth you use and how less resources you consume.

    Two things magnifies the problem:

    1) User Base - if the user base is really large then even transferring 1MB per user over wire is going to hit the wall. Imagine 1 million users trying to access your webpage.

    2) Amount of data transfer - if you are transferring huge amount of data, say your website is cloud based storage system or online cloud database, then again it's going to hit the wall in performance soon again.

    So to move you objects from server to client, you need to see several serialization options. I will start with some standard ones, and list some recents ones that sounds interesting.

    XML:

    Human readable, and machine parse-able at the same time. But probably the most verbose serialization option we have. Also the human readable advantage goes down very quickly as the size of the XML file goes up.

    JSON:

    JSON (pronounced as Jason), stands for JavaScript Object Notation. Its pretty popular with AJAX, and JavaScript based web libraries. It keeps the data compact, and saves us from verbosity of XML. JSON format supports only text data, and does'nt have native support for binary data.

    Hessian:

    Hessian is been there for a while, and it is quite popular with J2ME world because of the small required dependencies, and efficient binary protocol. Starting from Hessian 1.0 Spec, it has now come to Hessian 2.0. Hessian 2.0 spec seems to be quite comparable with any of the new age/recent protocols that were released.

    Protocol Buffers:

    Coming from google, we can definitely assume it should have great scalability & performance. It supports both text and binary format. All your text representation will be converted to a binary format before sending it across the wire. You have to first create  a interface file (.proto) describing the fields, and compile them to Java/Any supported language classes. Then you can serialize/deserialize from binary format to Objects in your language. The main drawback is for you to specify the interface and compile them to objects, but having things statically compiled will give you some performance advantages. It does support binary data as well in the message structure.

    Apache Thrift:

    Thrift is originally created and used within FaceBook team, and later released as Apache OpenSource project. It pretty much similar to google with define-compile-use cycle. You need to define the message structure using .thrift file, and compile them using thrift compiler, and use them in you services/clients. Apache Thrift has poor documentation when compared to other protocols.

    Apache Avro:

    This is one of sub-projects of Apache Hadoop, a 'Google Map-Reduce' inspired framework for Java. This project is contributed heavily by Yahoo! and they said to use it extensively for their infrastructure. Avro's design goal is as well to support Dynamic Typing; that is be able to exchange information without the compile-use cycle. The schema of the data structure is defined in JSON format, and its exchanged on the initial interaction; and the rest of the transfers client uses the schema to read the data.

    BERT & BERT-RPC:

    BERT stands for Binary ERlang Term. It is based on the Erlang's binary serialization format. The author of this format is founder of the GitHub. The git-hub team posted a article on how they improved the performance of their site using this new protocol. Their main reason for not using Protocol Buffers & Thrift is that you have to go through mundane define-compile-use cycle. Instead they created this protocol which supports dynamic data format definition, so the actual data itself will contain meta-information about the data structure (the client can read them on the go). GitHub being a huge repository of open source projects, and people forking out branches, checking in/checking out huge code bases we can assume the traffic they could be handling; BERT should have been really comparable in-order to be a better alternative compared to Protocol Buffers & Thrift.

    Lets see what improvements, and comparison reports could future bring about these protocols.

    Links:

    Click on the protocol name on the above article to go to relevant page. And some more links below.

    http://hessian.caucho.com/doc/hessian-serialization.html#anchor2

    http://github.com/blog/531-introducing-bert-and-bert-rpc

    Wednesday, July 09, 2008

    using Spring Web Flow 2

    I have got opportunity to work with Spring WebFlow 2 recently in a project, here I share my personal views on that with you.


    Let me first tell you all nice things about recent spring stack (spring 2.5 and above). Two things which  improved a lot with recent release are: annotation support, specific namespaces.


    Annotations lets you spend your time more on writing code than to wiring components through xml. Off-course spring fails fast if you have messed up a configuration, but still annotations are lot better to avoid that in first place. With improved @Repository, @Service and @Component it's easy to configure beans with required specific responsibilities by default.


    Namespace improvements, help to keep the xml configuration minimal and typo-error free. Schema definitions helps to validate you configuration as you type, and also with convention over configuration approach they have reduced the lines of XML we need to wire up objects. If you want to replace a component with your custom implementation, sometimes its easy by using auto-wire option; sometime you have to configure them by the old way (i.e. using beans namespace and manually declaring most of the configuration) which is more painful after you getting used to the new way.


    With SpringTest framework it's fairly easy to write integration test cases. With simple annotation spring will automatically loads the application context on the test start up. Also with @Timed you could even clock your test method, and make it fail if it exceeds specified time. And it also supports Transactional test with automatic rollback on default, so if you could write tests which doesn't dirties up the database.


    Let's come back to the original topic Spring web flow. Spring webflow works as advertised for, i.e. they are for application which has a natural flow behind in business, and UI acts as a way to capture input for the flow and to display something back. Not for an application that has a different requirement than stated above.


    Everything is a flow, each flow has a starting point and a end point, and could have any number of transitions in between. As a part of transition you could go to a sub-flow and come back to the original flow later, but these transitions could only happen at the pre-defined places on the flow. It will be tough to implement a free-flow (random browse) kind of applications with it.


    It serializes all the information you add to the flow context and restores them as you resume a flow after UI interaction, so every object like entities, repositories, and whatever should implement Serializable. This restricts what you could share in the flow context.


    Most of the decision for transition could be easily handled in the flow definition, this avoids creating Action classes which returns just the outcome.


    in JSF UI:


    <h:commandButton action="save" />



    in Flow definition:


    <view-state ...


     <transition on="save" >


        <expression ="validator.validate(model)" />


    </transition>



    As you could see, you don't need to have the Action class which returns outcome 'save', you could direct specify a transition on the command button. Ok, now you could ask what if the save has to be returned only on certain condition (say after only validation passes on the entity). For that you could have a expression executed on the transition, the transition will execute only if the validator returns true, if the validator returns false it will come back to the same view. The expression will accept any EL method expression, need not be just a validator. So you could run any action before the transition. As you could see the method executions in the action class are moved to the flow definition. This will look elegant only if the number of calls made at transition is small, or your application is well thought and designed to share less number of information in state, and keeping the method calls down. (Basically this is a nice feature , but would go awry for huge apps, and for apps which there is no certain business flow behind it)


    Spring web flow also supports inheritance of flows, so you could inherit common transition rules from a parent flow. Which is a nice feature to keep the definition DRY as far as possible.


    What makes flow definition looks ugly? Whenever there are more no. of mere actions which is called in the transitions to set a variable, to retrieve a variable from flowScope and setting back to the viewScope or so. One thing I had to do multiple times in flow definitions are to transform a List to dataModel for the UI, so I could use listName.selectedRow to identify item selected by the user.


    Adding this kind of non-business related method executions and transformations, etc ., to the flow definitions makes it bulky, and also alienates the flow from resembling the business definitions. This defeats the very own cause of having a flow definition.


    WebFlow provides convenient default variables like resourceBundle, currentUser, messageContext available in the flow context, which you could refer directly in the flow definition or pass it as arguments to bean action methods, or call actions on them.


    When a root flow ends, all the information will be discarded. This is nice for cleaning unwanted data in the  memory but that also means that you cannot share anything with the user after the flow is ended. Suppose I would like to say that the user have successfully placed an order at the end of the flow, I could not do that! You could ask that why not keep the confirmation as part of the flow, well it depends on what time you are committing the changes to the db, or how you are sharing a persistent context, or even like its just a end message, there should not be interaction after that from the view to end the flow.


    It's like redirecting to the home page after successfully placing the order and showing a banner "Thank you for shopping with us!", which is not just possible.


    One last point is that with UrlMapper definition in the configuration you could make a simple url as a starting point of the flow, but otherwise generally can't use a RESTFUL GET url to reach a page on the flow.


    What's your experience with Spring Web Flow?

    Monday, June 09, 2008

    Quick Groovy Scripting

    Recently I have to port some data from mainframe database to SQL based db for testing purposes. I have started with some text report files generated from mainframe. I have fond of using unix awk, grep for these kind of data munging. Also used perl and ruby for some scripting activities in the past. But given that I had to do this on windows and also with fading knowledge of perl, thought of getting in donw with groovy. Since eclipse also supports groovy it became easy to start with.

    I got something running which spits SQL statements (using println) for every line of the input. Sooner my eclipse console started eating the output because of the buffer size for console display I had in my settings! Though I had the huge monolithic script which works fine, I cannot able to get the output in single shot. I had to rerun them in parts to get the final collective output. This slowed me on tweaking the final script. Given we didn't have much re-factoring support in eclipse, I couldn't either easily extract them as functions as I could in Java. But I am able to use a more powerful tool i.e. define a closure immediately and redirect the inputs to the println statements to a File without much changes to the original script.
    println "insert into table_name (col1, col2, col3) into values (${col1},'${col2}', ${col3})"

    def file = new File( "C:\output.txt")

    def println = { line ->  file.append(line)}

    Just adding these two line saved me a lot of time, also now I can switch back to see the output in command line or to capture them in a file very easily.

    Other things that helped me to get things done quickly are the ability to refer the variables inside the string directly as " '${col2}'". This is especially useful where I have to qualify the column of string data type with quotes, otherwise for which I have to use endless escaping and + con-catenations!

    Also for the next script I did, I started writing in small classes than single file, so made things easier to change at last minute.  Another gotcha for beginner for the groovy script is the use of '=='. Remember in groovy use of '==' is actually converted to this.equals(that) before the execution. I ran into endless self-recursive calls as I used the == for reference comparison as we do in Java.

    As I got the script completed there were lot of duplicate SQL statements in the output. As we get errors due to integrity constraints in database, I have to find some way to remove duplicate statements. In unix, I normally use `uniq` to get this done. Since I have to get that done quickly, i just looped thru the output file and added each line to the Set and dumped it back out to remove the duplicates.

    Being used Perl, and Ruby in the past I know the libraries support in perl or ruby are far huge when compared to groovy. But the single fact that I have used to Java in past years and have to work with windows, Groovy was a life saver!

    N.B. No data conversion is possible without effective use of Regular Expressions. I did used regular expressions to format the input files before running groovy scripts against them. I used Textpad to do  find/replace with regular expressions. The regular expression support in eclipse editor find/replace tool still needs improvements before could it could be really useful.

    Thursday, March 20, 2008

    # tricks in url

    We all may know that # symbol in html is especially used with anchors. They mark the particular anchor within the single html page.
    For example in the seam doc reference (single html page) http://docs.jboss.com/seam/2.1.0.A1/reference/en/html_single/#d0e336

    in the url the #d0e336 marks the section 'Understanding the code' within the whole html page. If you do a view source you could see that section of the page is marked with anchor href to #d0e336

    URI combined with this #mark points to the particular section of the page, this helps people to bookmark page, and return exactly there when they come back.

    Lets get into some more interesting stuff with the # sign. Whenever you request a page with #mark in the end; the browser sends GET request only with url up to the #mark. The part that comes after the # sign is never sent to server.
    If you request for http://mypage.com/page#profile, the browser sends the request as 'http://mypage.com/page' ripping off the # sign and the text after that. Once the browser loads the page, it tries to locate the anchor with matching href '#profile' () and positions the page. If the browser cannot find the specified anchor href it just ignores it and shows the page as it is.

    Given that the text after the #mark concerns only for the client and also that browser ignores it for taking any action if the particular anchor is missing in the markup. There are some potential uses for the # sign.

    • fancy url

    • could be potentially used to maintain client-side state!

    • generate unique bookmark-able url


    fancy url:
    http://mail.google.com/mail/#inbox
    http://mail.google.com/mail/#sent

    As you see the server url is just http://mail.google.com/mail/, but the browser displaying #inbox denotes that you are in inbox view.

    maintain client-state:

    Say there are 2 tabs in a page, the user wants to bookmark the page along with the current tab that he is working with. Thereby whenever he loads the page with the saved bookmark; the page should be loaded with the same tab highlighted of the group.

    You could add an identifier with the # sign on the url, and use client side javascript to parse the location and pick the identifier to determine which tab should be highlighted.

    Some javascript libraries use this as trick to generate part of the page in the browser. The iUI library which generates iphone style webpages actually uses the same trick. It maintains client state by this identifier, and uses javascript to re-render part of the page in iphone style mock up.

    http://m.linkedin.com/#_home

    unique bookmark-able url:

    Say you use greasemonkey to customize a webpage. And you set up that custom script to run for a particular url/site. Now you want to test a new script with the same url, you could add some identifier along the pound sign to create a unique url. Map the script to be triggered for this new unique url, so the same site will be handled by different greasemonkey scripts based on the url you load.

    reference:
    http://gbiv.com/protocols/uri/rfc/rfc3986.html#fragment

    Recommended Blog Posts