Showing posts with label Software Development. Show all posts
Showing posts with label Software Development. Show all posts

Tuesday, September 14, 2010

Behind the Instant


Google released Instant Search feature on Sept 8, 2010 creating quite a frenzy over the web. Yahoo claimed that they did it first and 5 years back. Creative hackers forming Google Instant Search like mash-ups over other google services (Youtube Instant, Google Maps).

Google Instant Search redefines the way people use search, people used to click on a search button and manually navigate across search results to find something they wanted. With Instant Search, they can change the query until they relevant results.

This changes how SEO experts work on keyword optimization for search engines, and influences ad placement revenue for searches as this marginally reduces the case wherein the user need to visit next page to navigate over search results.

Though Google Instant Search is just ajaxifying the current google site, what really amazes me is how much they have to scale behind the scenes to deliver results as you type. I think performance tuning or scalability is like a magic comprised of a series of tricks: well performed, repeatable tricks. Rest of this post is my attempt to explain the magic.

Some facts from Google:
  • Instant Search measures 20x times increase in query traffic (already at a billion queries a day)
  • They didn't scale just by increasing number of servers, instead they found new ways to squeeze more out of their existing infrastructure.
  • They have used 15 new technologies in building this infrastructure.

How it works:






As you type the query, the browser requests the server for list of suggestions. As soon as what you type matches one of the suggestions, the actual search results are fetched for that suggestion and the web page is updated.

Let say we querying the web for 'man on wire'* using Google Instant Search. Here is the trace of HTTP requests between the browser and the site.


*has no significance other than is random as far as query is concerned.

The traffic can be classified between the JSON data, and the fetching of actual web page. As you can see most of the traffic here is the JSON data, which corresponds to the AJAX calls that happen as you type the query in the search box. The JSON data comprises of possible suggestions for the query you just started typing, and it includes the search result data only if there is a potential match in the suggestion.

What really happens?

  • Only suggestions are fetched from the server as you type, not the real search results.
  • Search results are retrieved only for the top matching suggestion, not for the actual keyword you type. Even if you start typing 'ma' the search results are fetched only for the top suggestion (say 'map quest'). The search results for the keyword 'ma' is fetched only after you hit the 'Search' button.
  • Results are retrieved either when you pause for several milliseconds or when your query matches (word match, not on just character matches) the top suggestion on the list.
  • Google most likely caches or pre-caches may interim results
By restricting the retrieval of search results to suggestions, Google gets an incredible advantage in caching the search results that could possibly be shared between many users. Now, the permutation is limited to the possibility of a suggestion appearing in the user query, rather than the possible permutations of what the could type next.

HTTP Trace log when searching for the same query 'man on wire' next time.


If you type the same query within minutes, Google is smart enough to load it from the local cache rather going to server to fetch again. As you can see in above table, there is no JSON traffic and it is just 204 server response. 204 means 'The server successfully processed the request, but it not returning any content'.

Layers of cache:


In order to scale, the search results has to be cached at various layers.




Google introduced various new layers of cache,
  • prioritized search queries cached (like hot trends)
  • user specific cache - so they could give you more personalized results,
  • general result cache
  • misc - and other layers of caching (even browser level cache)
Caching is the most common contributor to instant search results, but not without its own drawback of becoming stale. Google should have massively improved their current cache techniques in order to serve results instantly.

How Google keep search results relevant?


The sooner the results are fetched (as you type), the more layers of caches; the search results are probably stale as you consume it. Google revamped its the indexing infrastructure recently to provide more accurate and real-time results. Remember, they integrated the real-time tweets in to the search results?

Google mostly used MapReduce paradigm to build its index, until recently. MapReduce helps to handle tons of data, but it has to process data sequentially. Building a real-time search engine just using MapReduce is not going to work well, so they added additional indexing techniques with announcement of Caffeine.

Summary:


Google must have vastly improved it's indexing, caching, and other layers of its search infrastructure (like Google File System, Clustering techniques., ) to be able to serve us results instantly. I assume they would have even used the expertise of the recently cancelled project - Google Wave.

We can probably expect Google to release a research paper for at-least some of the 15 new technologies that powers Google Instant Search.


Friday, December 25, 2009

It's time to git

Every now and then a new technology comes, but few gather momentum and finally get adapted by masses. Git is certainly on the right track. GitHub certainly fueled the adaptation of git to masses.

Git is mainly effective/faster when used at command line. There are efforts in building UI around it like Eclipse plugin, but they aren't completely done. I am more comfortable at terminal, so haven't checked the UI progress lately.

With agile practices like pair programming, combined with distributed development - people want a distributed source control system that is snappier and comes with tools.

Couple of interesting things I liked from git are:

Git-Daemon: git-daemon utility bundled with the git release is a quick way to share your code across the network. Say you are at a barcamp or a cofee talk meet with a friend. he can share his local git repository over the network just by running

git-daemon --base-path=parent_path_to_the_repo

And you could clone his repository to your local by

git clone git://server-location/repo

Git-SVN: Those who are using SVN as production repository for your source code could still use git locally. git-svn helps you to sync the current workspace code into SVN directly. This is another reason for people to start using git locally, to get all the benefits of it; and still check into SVN as corporate needs you to.

Git-Stash: Git stash could be said as a coding context, say you have modified couple of files to fix bug 121 - you could create a context that store the files that were changed. Then it reverts the code to the HEAD (clean) state, so you could attack bug 75 and commit it before merging back the code for bug 121. These contexts are easy to create and so convenient in labeling them correctly.

The Dilemma:

For those still saying - 'yeah, git is cool. but with the whole distributed thing - isn't there a chance that I loose control of the code my developers do for me? How do I track them?'

Checking the code into the repository often is a practice of discipline, it could happen with use of any repository. With git you could ask your new developer to share his local git repository so you could give an overview, rather waiting until he gets access to the central repository & checks in his crap. In fact git gives the ability to pull code/feedback earlier, than until something gets checked in.

Getting developer access to central repository is a longer process normally in any corporate, instead of waiting for that time the developer can start coding, and as a lead you could keep track on progress.

Those who are looking for patterns to control the repository effectively look at this presentation: http://www.slideshare.net/err/git-machine starting from slide no.72 the author have pointed out several patterns (Anarchy, Blessed, Lieutenant, & Centralized) to manage the repository.

With all said, its time everyone should consider a distributed source control system - because it enables developers, and with a pattern you chose to control your repository its a win-win.

More Links:
  • How Git Index/Staging Area simplifies commit - http://plasmasturm.org/log/gitidxpraise/


  • A Git Branching Model - http://nvie.com/archives/323
  • Friday, November 06, 2009

    Serialization/Streaming Protocols: What we got?

    It's takes a huge effort to build a friendly API, and build a community around it. But once you have a popular service API, the next thing is the handling the traffic. It doesn't have to be external API, it can be a your web front-end posting requests to the backend service layer.

    As the user base explodes, a bit saved is bandwidth and money saved. This applies to mobile clients as well. With things hosted in clouds these days, it does matter how much bandwidth you use and how less resources you consume.

    Two things magnifies the problem:

    1) User Base - if the user base is really large then even transferring 1MB per user over wire is going to hit the wall. Imagine 1 million users trying to access your webpage.

    2) Amount of data transfer - if you are transferring huge amount of data, say your website is cloud based storage system or online cloud database, then again it's going to hit the wall in performance soon again.

    So to move you objects from server to client, you need to see several serialization options. I will start with some standard ones, and list some recents ones that sounds interesting.

    XML:

    Human readable, and machine parse-able at the same time. But probably the most verbose serialization option we have. Also the human readable advantage goes down very quickly as the size of the XML file goes up.

    JSON:

    JSON (pronounced as Jason), stands for JavaScript Object Notation. Its pretty popular with AJAX, and JavaScript based web libraries. It keeps the data compact, and saves us from verbosity of XML. JSON format supports only text data, and does'nt have native support for binary data.

    Hessian:

    Hessian is been there for a while, and it is quite popular with J2ME world because of the small required dependencies, and efficient binary protocol. Starting from Hessian 1.0 Spec, it has now come to Hessian 2.0. Hessian 2.0 spec seems to be quite comparable with any of the new age/recent protocols that were released.

    Protocol Buffers:

    Coming from google, we can definitely assume it should have great scalability & performance. It supports both text and binary format. All your text representation will be converted to a binary format before sending it across the wire. You have to first create  a interface file (.proto) describing the fields, and compile them to Java/Any supported language classes. Then you can serialize/deserialize from binary format to Objects in your language. The main drawback is for you to specify the interface and compile them to objects, but having things statically compiled will give you some performance advantages. It does support binary data as well in the message structure.

    Apache Thrift:

    Thrift is originally created and used within FaceBook team, and later released as Apache OpenSource project. It pretty much similar to google with define-compile-use cycle. You need to define the message structure using .thrift file, and compile them using thrift compiler, and use them in you services/clients. Apache Thrift has poor documentation when compared to other protocols.

    Apache Avro:

    This is one of sub-projects of Apache Hadoop, a 'Google Map-Reduce' inspired framework for Java. This project is contributed heavily by Yahoo! and they said to use it extensively for their infrastructure. Avro's design goal is as well to support Dynamic Typing; that is be able to exchange information without the compile-use cycle. The schema of the data structure is defined in JSON format, and its exchanged on the initial interaction; and the rest of the transfers client uses the schema to read the data.

    BERT & BERT-RPC:

    BERT stands for Binary ERlang Term. It is based on the Erlang's binary serialization format. The author of this format is founder of the GitHub. The git-hub team posted a article on how they improved the performance of their site using this new protocol. Their main reason for not using Protocol Buffers & Thrift is that you have to go through mundane define-compile-use cycle. Instead they created this protocol which supports dynamic data format definition, so the actual data itself will contain meta-information about the data structure (the client can read them on the go). GitHub being a huge repository of open source projects, and people forking out branches, checking in/checking out huge code bases we can assume the traffic they could be handling; BERT should have been really comparable in-order to be a better alternative compared to Protocol Buffers & Thrift.

    Lets see what improvements, and comparison reports could future bring about these protocols.

    Links:

    Click on the protocol name on the above article to go to relevant page. And some more links below.

    http://hessian.caucho.com/doc/hessian-serialization.html#anchor2

    http://github.com/blog/531-introducing-bert-and-bert-rpc

    Wednesday, July 09, 2008

    using Spring Web Flow 2

    I have got opportunity to work with Spring WebFlow 2 recently in a project, here I share my personal views on that with you.


    Let me first tell you all nice things about recent spring stack (spring 2.5 and above). Two things which  improved a lot with recent release are: annotation support, specific namespaces.


    Annotations lets you spend your time more on writing code than to wiring components through xml. Off-course spring fails fast if you have messed up a configuration, but still annotations are lot better to avoid that in first place. With improved @Repository, @Service and @Component it's easy to configure beans with required specific responsibilities by default.


    Namespace improvements, help to keep the xml configuration minimal and typo-error free. Schema definitions helps to validate you configuration as you type, and also with convention over configuration approach they have reduced the lines of XML we need to wire up objects. If you want to replace a component with your custom implementation, sometimes its easy by using auto-wire option; sometime you have to configure them by the old way (i.e. using beans namespace and manually declaring most of the configuration) which is more painful after you getting used to the new way.


    With SpringTest framework it's fairly easy to write integration test cases. With simple annotation spring will automatically loads the application context on the test start up. Also with @Timed you could even clock your test method, and make it fail if it exceeds specified time. And it also supports Transactional test with automatic rollback on default, so if you could write tests which doesn't dirties up the database.


    Let's come back to the original topic Spring web flow. Spring webflow works as advertised for, i.e. they are for application which has a natural flow behind in business, and UI acts as a way to capture input for the flow and to display something back. Not for an application that has a different requirement than stated above.


    Everything is a flow, each flow has a starting point and a end point, and could have any number of transitions in between. As a part of transition you could go to a sub-flow and come back to the original flow later, but these transitions could only happen at the pre-defined places on the flow. It will be tough to implement a free-flow (random browse) kind of applications with it.


    It serializes all the information you add to the flow context and restores them as you resume a flow after UI interaction, so every object like entities, repositories, and whatever should implement Serializable. This restricts what you could share in the flow context.


    Most of the decision for transition could be easily handled in the flow definition, this avoids creating Action classes which returns just the outcome.


    in JSF UI:


    <h:commandButton action="save" />



    in Flow definition:


    <view-state ...


     <transition on="save" >


        <expression ="validator.validate(model)" />


    </transition>



    As you could see, you don't need to have the Action class which returns outcome 'save', you could direct specify a transition on the command button. Ok, now you could ask what if the save has to be returned only on certain condition (say after only validation passes on the entity). For that you could have a expression executed on the transition, the transition will execute only if the validator returns true, if the validator returns false it will come back to the same view. The expression will accept any EL method expression, need not be just a validator. So you could run any action before the transition. As you could see the method executions in the action class are moved to the flow definition. This will look elegant only if the number of calls made at transition is small, or your application is well thought and designed to share less number of information in state, and keeping the method calls down. (Basically this is a nice feature , but would go awry for huge apps, and for apps which there is no certain business flow behind it)


    Spring web flow also supports inheritance of flows, so you could inherit common transition rules from a parent flow. Which is a nice feature to keep the definition DRY as far as possible.


    What makes flow definition looks ugly? Whenever there are more no. of mere actions which is called in the transitions to set a variable, to retrieve a variable from flowScope and setting back to the viewScope or so. One thing I had to do multiple times in flow definitions are to transform a List to dataModel for the UI, so I could use listName.selectedRow to identify item selected by the user.


    Adding this kind of non-business related method executions and transformations, etc ., to the flow definitions makes it bulky, and also alienates the flow from resembling the business definitions. This defeats the very own cause of having a flow definition.


    WebFlow provides convenient default variables like resourceBundle, currentUser, messageContext available in the flow context, which you could refer directly in the flow definition or pass it as arguments to bean action methods, or call actions on them.


    When a root flow ends, all the information will be discarded. This is nice for cleaning unwanted data in the  memory but that also means that you cannot share anything with the user after the flow is ended. Suppose I would like to say that the user have successfully placed an order at the end of the flow, I could not do that! You could ask that why not keep the confirmation as part of the flow, well it depends on what time you are committing the changes to the db, or how you are sharing a persistent context, or even like its just a end message, there should not be interaction after that from the view to end the flow.


    It's like redirecting to the home page after successfully placing the order and showing a banner "Thank you for shopping with us!", which is not just possible.


    One last point is that with UrlMapper definition in the configuration you could make a simple url as a starting point of the flow, but otherwise generally can't use a RESTFUL GET url to reach a page on the flow.


    What's your experience with Spring Web Flow?

    Monday, June 09, 2008

    Quick Groovy Scripting

    Recently I have to port some data from mainframe database to SQL based db for testing purposes. I have started with some text report files generated from mainframe. I have fond of using unix awk, grep for these kind of data munging. Also used perl and ruby for some scripting activities in the past. But given that I had to do this on windows and also with fading knowledge of perl, thought of getting in donw with groovy. Since eclipse also supports groovy it became easy to start with.

    I got something running which spits SQL statements (using println) for every line of the input. Sooner my eclipse console started eating the output because of the buffer size for console display I had in my settings! Though I had the huge monolithic script which works fine, I cannot able to get the output in single shot. I had to rerun them in parts to get the final collective output. This slowed me on tweaking the final script. Given we didn't have much re-factoring support in eclipse, I couldn't either easily extract them as functions as I could in Java. But I am able to use a more powerful tool i.e. define a closure immediately and redirect the inputs to the println statements to a File without much changes to the original script.
    println "insert into table_name (col1, col2, col3) into values (${col1},'${col2}', ${col3})"

    def file = new File( "C:\output.txt")

    def println = { line ->  file.append(line)}

    Just adding these two line saved me a lot of time, also now I can switch back to see the output in command line or to capture them in a file very easily.

    Other things that helped me to get things done quickly are the ability to refer the variables inside the string directly as " '${col2}'". This is especially useful where I have to qualify the column of string data type with quotes, otherwise for which I have to use endless escaping and + con-catenations!

    Also for the next script I did, I started writing in small classes than single file, so made things easier to change at last minute.  Another gotcha for beginner for the groovy script is the use of '=='. Remember in groovy use of '==' is actually converted to this.equals(that) before the execution. I ran into endless self-recursive calls as I used the == for reference comparison as we do in Java.

    As I got the script completed there were lot of duplicate SQL statements in the output. As we get errors due to integrity constraints in database, I have to find some way to remove duplicate statements. In unix, I normally use `uniq` to get this done. Since I have to get that done quickly, i just looped thru the output file and added each line to the Set and dumped it back out to remove the duplicates.

    Being used Perl, and Ruby in the past I know the libraries support in perl or ruby are far huge when compared to groovy. But the single fact that I have used to Java in past years and have to work with windows, Groovy was a life saver!

    N.B. No data conversion is possible without effective use of Regular Expressions. I did used regular expressions to format the input files before running groovy scripts against them. I used Textpad to do  find/replace with regular expressions. The regular expression support in eclipse editor find/replace tool still needs improvements before could it could be really useful.

    Thursday, March 20, 2008

    # tricks in url

    We all may know that # symbol in html is especially used with anchors. They mark the particular anchor within the single html page.
    For example in the seam doc reference (single html page) http://docs.jboss.com/seam/2.1.0.A1/reference/en/html_single/#d0e336

    in the url the #d0e336 marks the section 'Understanding the code' within the whole html page. If you do a view source you could see that section of the page is marked with anchor href to #d0e336

    URI combined with this #mark points to the particular section of the page, this helps people to bookmark page, and return exactly there when they come back.

    Lets get into some more interesting stuff with the # sign. Whenever you request a page with #mark in the end; the browser sends GET request only with url up to the #mark. The part that comes after the # sign is never sent to server.
    If you request for http://mypage.com/page#profile, the browser sends the request as 'http://mypage.com/page' ripping off the # sign and the text after that. Once the browser loads the page, it tries to locate the anchor with matching href '#profile' () and positions the page. If the browser cannot find the specified anchor href it just ignores it and shows the page as it is.

    Given that the text after the #mark concerns only for the client and also that browser ignores it for taking any action if the particular anchor is missing in the markup. There are some potential uses for the # sign.

    • fancy url

    • could be potentially used to maintain client-side state!

    • generate unique bookmark-able url


    fancy url:
    http://mail.google.com/mail/#inbox
    http://mail.google.com/mail/#sent

    As you see the server url is just http://mail.google.com/mail/, but the browser displaying #inbox denotes that you are in inbox view.

    maintain client-state:

    Say there are 2 tabs in a page, the user wants to bookmark the page along with the current tab that he is working with. Thereby whenever he loads the page with the saved bookmark; the page should be loaded with the same tab highlighted of the group.

    You could add an identifier with the # sign on the url, and use client side javascript to parse the location and pick the identifier to determine which tab should be highlighted.

    Some javascript libraries use this as trick to generate part of the page in the browser. The iUI library which generates iphone style webpages actually uses the same trick. It maintains client state by this identifier, and uses javascript to re-render part of the page in iphone style mock up.

    http://m.linkedin.com/#_home

    unique bookmark-able url:

    Say you use greasemonkey to customize a webpage. And you set up that custom script to run for a particular url/site. Now you want to test a new script with the same url, you could add some identifier along the pound sign to create a unique url. Map the script to be triggered for this new unique url, so the same site will be handled by different greasemonkey scripts based on the url you load.

    reference:
    http://gbiv.com/protocols/uri/rfc/rfc3986.html#fragment

    Thursday, February 14, 2008

    Understanding JBoss Seam

    We are currently working in a project using JBoss Seam extensively. The interesting and key feature of JBoss Seam is conversations. Conversation combined with bi-jection feature of Seam, just makes state management in web applications slicker and clean.


     


    On the first look you may think that seam just provides one more scope (like REQUEST, SESSION, etc) for state management. But it provides a lot more. If you really want to see how conversations can fix some common issues with web applications (like. back buttoning) I would highly recommend this blog of Jacob Orshalick.


     


    Jacob is also co-authoring the second edition of JBoss Seam: Simplicity and Power Beyond JavaEE with Michael Yuan. The second edition of their book will be released out this year.


     


     Recently preview of some of the chapters of this upcoming book is released. Even if you are already using Seam in your projects, definitely you will find this book more insightful.


     


    So better understand your conversations, before you are timed-out!


     


     

    Sunday, January 20, 2008

    listening on 0.0.0.0

    After you start your Tomacat/Apache HTTPD Server:

    just go to the command line and use netstat -an command to check the network statistics. You might have noticed
    foobar:~ nrs$ netstat -an | grep LISTEN
    tcp46 0 0 *.8080 *.* LISTEN
    tcp4 0 0 192.168.2.101.3873 *.* LISTEN

    that the listening port is listed as either *.8080 or 0.0.0.0:8080.

    Basically this means that your server is listening for connection from all the network interfaces in your machine. i.e. if you have Wi-Fi, ethernet or couple of other VirtualMachine ethernet port configured. Then you can reach the server using any of those interfaces (IP address).

    You could reach the server using 127.0.0.1 (local host), and any IP address of one of the network interfaces you have. So even when you write socket programming code, use the server host address as 0.0.0.0 if you want that your server to be reachable through all the interfaces.

    You could also use the same feature to gain precise control of how your application can be reached. When you start the server in production or other critical environments it just be better that the server listens only in single IP address that is the expected interface for reaching the service.

    In JBoss application server you can control this attribute either in the configuration file, or through system property jboss.bind.address. This property can also have multiple values separated by comma (i.e. jboss.bind.address=127.0.0.0,232.213.232.12). This helps to control precisely through which interface your service were accessible.
    C:\jboss-home\bin>.\run.bat -Djboss.bind.address=0.0.0.0 -c default

    Sunday, January 13, 2008

    Continuous testing w/Ant

    UPDATED - It Works

    As we write code, a continuous feedback will help us know how we are progressing. And what code are we breaking as we add functionality. A way to run unit tests, as we code and save java files will be great!

    I know there was a plugin for Eclipse, Continuous Testing from MIT. So I immediately downloaded the plugin, and tried to integrate with my eclipse IDE. but unluckily the plugin didn't worked with the version of the eclipse I work with. There also seems to be no activity in that plugin development. So I thought of ways to get this started, through some simple ways.

    After checking other option with Eclipse, I came to know that you can create a task in ant build.xml, and assign that task to execute as part of build process (clean/rebuild) from within IDE. (My bad, I didn't realized that you can't trigger the ant task on every save command on resources. you can trigger only by manual build. Before I did checked this I went ahead on trying it out. so I will explain below, how much I reached there) It works!


    Ok, so made an simple ant target with JUnit task in it. It executes all the unit tests in the project. As this would be time consuming, you will never use this. So this is not at all worth. Lets create test suites that represent a single unit that would be executed one every save operation. This would be the best approach, as test suites can represent a behavior/specification so it will give a larger perspective of what failed. And each package will have a test suite which could be triggered. But I want something that will work with my current setup.

    So thought, how about an ant task that figures out itself what are the Test Cases that are affected by my currently working file. All I need is to find the right test cases that need to be executed, and pass it on the JUnit task. For this we don't even need to write an ant task, instead we just need to create a custom file selector ant component which can be used inside any ant <fileselector> task.

    So I checked any existing tool/task that could list all source files that uses given class file. But I couldn't find something out of my inpatient quick search. So I looked into some reflections library so, I could trace if current file is dependent of the given file. I tried out Apache BCEL, as even some the core ant tasks uses the same library to do some bytecode engineering.

    Here is the code for the custom selector.
    public class DependentClassSelector extends org.apache.tools.ant.types.selectors.BaseExtendSelector {

    String changedClassName;

    public void setChangedClassName(String changedClassName) {
    this.changedClassName = changedClassName.replace('.', '/');
    }

    @Override
    public boolean isSelected(File basedir, String filename, File file)
    throws BuildException {
    boolean testable = filename.endsWith("Test.java");

    if(testable && changedClassName != null)
    {
    testable = false;
    //check if this Unit Test, depends on the changed class
    //System.out.println("$$$$$$$$"+filename + "$$$$$$$$" + changedClassName);

    filename = filename.replace(".java", "");

    //System.out.println(filename);

    com.sun.org.apache.bcel.internal.classfile.JavaClass javaFile = com.sun.org.apache.bcel.internal.Repository.lookupClass(filename);
    com.sun.org.apache.bcel.internal.classfile.Constant[] constants = javaFile.getConstantPool().getConstantPool();

    //System.out.println("loaded java file");

    for (Constant constant : constants) {

    //check if the constants pool has an entry for given class

    if (constant != null && constant.toString().contains("L"+changedClassName+";"))
    {
    //System.out.println(constant);

    testable = true;
    break;
    }

    }

    }

    return testable;
    }

    After you coded the custom selector's isSelected method. just drop it in the class path, and add the typedef lines in the build.xml
    <property environment="env"/>

    <typedef name="selected-tests"
    classname="org.countme.ant.tasks.DependentClassSelector" />

    <target name="continuous_testing">

    <junit  printsummary="yes" haltonfailure="yes">
    <classpath>
    <pathelement path="${classpath}"/>
    <fileset dir=".">
    <include name="**/*.jar"/>
    </fileset>
    <pathelement location="bin"/>
    <dirset dir="bin">
    <include name="**/*.class"/>
    </dirset>

    </classpath>

    <batchtest fork="yes" todir="reports/">
    <fileset dir="src">
    <selected-tests changedClassName="${env.java_type_name}"/>
    </fileset>
    </batchtest>

    </junit>
    </target>

    The ant script gets the current file open in the eclipse, by the environment variable java_type_name. To get this working, you should launch your ant script from within eclipse. The custom selector uses this information to decide where this test should be passed to the unit task or not. This works amazingly good, but still needs some more improvements, like: This code handles only, one level dependency, not the whole chain of dependency. This script need lot of improvements, but this looked like a good way to start.

    Test Result: When I ran my script, after coding and saving the changes in just a single file BusinessDomain.java
    It picks up the related test cases automatically.

    Eclipse continuous testing



    Since I cant get this ant script triggered on every file save within eclipse, I will lose value of this script if I forget to run it on every save. Unfortunately eclipse ant builder cant be triggered on automatic builds (i.e. when eclipse compiles your file). If you know a way to get this fixed, let me know. I will be happy to use it.

    Otherwise instead of starting with the files dependent on current file, we could run all tests that dependent on the files that changed since last run. This way I will have some gain over running the whole test suite.

    Please comment on any continuos testing approach that worked for you!

    UPDATE



    After spending some more time today, I am able to get this working.  You should be able to set the Ant Builder to run any task, on Auto-Build (i.e. for every save). If you get NullPointerException, then you are missing some library in the class path. Also configure to export the {java_type_name} to the environment, by adding in the Environment Tab of the Ant Builder Configuration. Probably I will post with screenshots, on my next post

    The other feature, which I thought about is to increase the chaining depth in the dependency. But it will be of too much cost to execute TestCases that are more than one block away from your modified class file.

    Of the choices between:

    • don't execute test case as you modify a class

    • execute all of them

    • execute all the dependent test cases


    This sounds most pragmatic way for me: On every change to your class file, execute the Test Cases that are one-block away from your class.

    Monday, January 07, 2008

    Testing Naturally, and Agile

    Behaviour-Driven Development is what you are already doing, if you are doing Test-Driven Development properly...

    Test-Driven Development is commonly used everywhere, however the term 'test' makes people to think it as something they do extra to their coding activity. The same makes it hard to convince managers about the value of the unit-testing. Also the term unit-testing means different to different people.

    How BDD, helps in: In BDD you capture every test you write under a specification of the requirement. The user stories could be directly translated to test specification. The greatest value here is this grouping of unit-test cases to a behavior that matches to a requirement from user story. So its more traceable on test coverage, from the business perspective. This business perspective helps management team to understand the value of testing.

    The BDD concept is quite spoken around for years, but people misunderstood it as something where they don't have to do testing. But it actually the same test-driven development. This makes again to think that, BDD has nothing new.

    BDD could be considered as DSL for testing. It uses concepts such as Story, Behavior (terms that are more common in agile practice and Object Oriented Design*) to directly describe the test cases that we write. These test cases are described using common terms called as Specification.

    Dan North first described this term and he had taken efforts to come up with specification framework in java, called as JBehave. If you check the API, you will know that its more are like JUnit, just with test method naming conventions. In JUnit, we have every test-method starting with 'test' keyword. Here it starts with 'should'. As I told you earlier, its not different in API much, but its concepts and the domain terms you use to describe the test-cases makes them better. Also here you should note that from JUnit4, ou dont need to have the methods named starting with test. Also later versions of JUnit also introduces higher level domain terms such as Theory, etc.,

    rSpec is a ruby testing framework which uses behavioral driven development concepts.

    Since lately there are many such common abstraction is coming up in development testing, and practices. And finally one of the abstraction is going to be commonly, used in organizing your test cases.

    Ok coming back to the point, what makes me to write about this now. The ruby or any dynamic language I liked most is because of the human readable form they give to the code. But still Java gives the more robust JVM platform, that is tested and more compatible with any middleware which is required to support production level high-volume transactions. The dynamic languages are something that is most available for writing tools, scripts that helps to speed up development. I have used perl to write to monitoring scripts which automate as bot logging into production system, as a test user, and executes standard test cases. And it checks back the response to confirm the functionality of the code in production. We have used them even to load test applications in small-scale. With the rich library the perl/ruby has you can automate any task in simple steps. Using java for these is a over-kill.

    Unit Testing can also be one such activity, which could be handled well by these languages provided a good flexibility between them. Said this naturally Groovy and JRuby are natural choice for testing our Java code.

    And very recently, actually just weeks back there are some projects started on this perspective.

    easyb, a project from Andrew Glover, would be a good choice given that its build on Groovy. Go check out yourself. I had written below some of my first hand experiences with easyb.

    jtestr, released by Ola Bini, a Thoughtworks employee of UK. This framework is about coding in Java, and Testing in Ruby = > JtestR. I didn't got my hands dirty with this yet, will try this soon and write about it.

    easyb: EasyB has very good documentation, so I am not gonna repeat here on using it. I just show you a sample test case Story.
    package com.bdd.test;

    given "new IM instance", {
    im = new com.bdd.InstantMessenger()
    }

    when "somebody logs in", {
    im.login()
    }

    then "status message should show Available", {
    ensureThat(im.getStatusMessage().equals("Available"),eq(true))
    }

    Test Result:
    .
    Time: 0.568s


    Total: 1. Success!

    Instead of the simple output we can generate, striping the closure method definition and could generate something like:
    Login Story: Success

    given new IM instance
    when somebody logs in

    then status message should show Available


    Since we use groovy closures to wrap a given definition, we can pass that around the Test Story; thereby reusing blocks of code.

    As you could see, this way the test report becomes more natural, and next time when a test fails; even your business team can see what's failing without digging into the method to see the details. Off-course you can cheat, but we wont with all intentions of a good developer!

    LINKS:
    Google Tech Talk from rSpec developer:
    http://video.google.com/videoplay?docid=8135690990081075324
    *Object Design: Roles, Responsibilities and Collaborations:
    http://www.amazon.com/Object-Design-Responsibilities-Collaborations-Addison-Wesley/dp/0201379430

    Wednesday, December 05, 2007

    How to get friendly urls

    URLs are face of an website, the urls are indexed by search engines. They are the one links your website. if your urls change, you lose any search ranking advantage you have.

    With the technology behind websites changes often, what happens to the urls?

    The items list page of an shopping site changes like:

    www.shopping.com/shoppinglist.html

    www.shopping.com/shoppinglist.jsp

    www.shopping.com/shoppinglist.php

    www.shopping.com/shoppinglist.jsf

    www.shopping.com/shoppinglist.seam

    Also most urls are not readable, or too long.. with parameters being added to the url. http://shopping.com/list.jsp?itemid=1234

    it would be more readable if it were: http://shopping.com/list/item/1234

    the advantages here is not just readable urls, you are also abstracting our urls from implementation. it protects your technology changes, protects from http get (params in url) or http post implementation..

    Samples:

    http://mail.google.com/mail/#inbox

    http://mail.google.com/mail/#sent

    also you can have permalinks like http://shopping.com/deals/today

    Enough talked on the advantages, let see how we can get that working.

    Apache http server has a module named: mod_rewrite http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html which can transform between urls, by url rewriting. It's basically can match any expression on incoming url, and replace with different pattern. It uses regular expression to find and replace patterns.

    There is a similar implementation in java, which is http://tuckey.org/urlrewrite/

    These modules are efficient, and at the same time complex to learn. So always you could implement your own url rewrite module.

    The web.xml can have servlet mapping with url-pattern=/ which acts as front controller and redirects to appropriate url.

    There are some limitations in this approach.

    to be continued..

    Friday, September 14, 2007

    Feeds and REST-ful URL Schemes

    Feeds are also known as RSS. It was once a extra feature. But now its integral part of content sharing between sites. With improvements to Google Reader interface, its getting easier to read at one place rather scattering around with multiple tabs in browser. Further you can share the favored links as rss feeds. So integrating this content in your blog, or sharing with other friends is easier.

    All this seems as a step towards semantic web.

    While mentioning about feeds, full story feeds is what I prefer, as I don't have to come out of my feed reader to get the full information. Having just the topic in the feed is not worth the feed subscription. As Scoble said FaceBook is becoming an huge aggregator, Google is also catching up will lot of feed integration between their apps. Their own social network site (Orkut) is also catching up with feeds, and other face book like changes. Yahoo has a service named pipes, which helps to create custom feeds out of web-pages.

    as the number of feeds grow, one thing I noticed id REST-Style urls are being used commonly. This might be because, rest-style url supports natural customization to feeds by adding words to the url which is intuitive rather a complex url,

    somesite.com/feed/all

    somesite.com/feed/history

    somesite.com/feed/today

    somsite.com/feed/userName/tag

    All of them use this kinda of url rather complex query parameter style. Check google picasweb feed , del.icio.us feeds, etc

    Though REST style in architecture is not widely adopted, rest-style URL is widely accepted by all and solving part of the problem in web.

    Tuesday, June 12, 2007

    Faking the performance

    Read this article to know, how applications fake their performance report.

    http://blogs.msdn.com/oldnewthing/archive/2005/03/11/394249.aspx

    Applications by adding part of their work to system start-up, terribly slow the start-up time. This is evil. The user loses time, even if he is not gonna use that application. Applications that want to gain such performance benefit should atleast consider using idle CPU cycles to do so, thereby the wait time for user to start with his work would be less.

    The application should have a process, which monitors the cpu usage and triggers the appropriate pre-loader program for its fast start-up functioning. If more applications want to preload their application, then there would be more similar processes that runs just to monitor free cpu cycle.

    If the OS provides an asynchronous loader (which loads the registered components when it finds idle cpu time) it would be easier further. This shifted work from the application to OS will benefit, not as it was in earlier case.

    Wednesday, March 14, 2007

    With mashups, webapps becoming legacy

    It might all started with screen scrapping of legacy systems. Screen scraping is a technique used match a legacy system user UI, as an interface for input to an newly developed system.

    If its for legacy systems, then why its being applied to web sites these days. With the fast of growth in application development scenario, the web applications become legacy!

    Web scrapping, could be termed as process of extracting a piece of information of your interest from a webpage online. Recently there have been significant work is going on the things that would require taking the webapps to the next level.

    Web scrapping is lot easier than screen scrapping of legacy systems. The output of web apps is being a HTML code, which could be represented as an DOM tree, and could be navigated easily by machines/bots.

    Yes its easier to navigate, but is it easier to locate an item of interest? Not actually. HTML code is mostly about styling, to say how the date may appear to the user. Usually a page will contain less data, and more styling demarcations added for proper presentations; like: <b> - for bold data, <u> - for underline. Other than these there is also lot of styling code is mixed with the potent data that webpage is showing up. So it's tougher for a machine to separate data, from style information.

    GreaseMonkey might be the first tool released, which help people to customize the webpage on the client side. Like the next time you wont like the blue background on the MSN home page, you can change it before the website loads on your browser. It's simple in functionality, but you need to know DOM Structure (tree representation of the web page). Later people started posting their script on the web (http://userscripts.org/).

    Chickenfoot is another recent tool on the rise. Writing script here doesn't need knowledge of DOM representation. Read my earlier post on this. I too had my hands on trying out these two, sometime back.

    These are just the start of the road that reaches to our dream (Semantic Web). The Semantic Web is all about adding meaning to data, which is mingled with style information in various web sites. If a consultant puts his appointment list online. Then web crawler scanning it should make sense of it rather just seeing it as numbers and text; that it's a calendar data and it belongs to him.

    A webpage is seen as propriety information of the owner of the website. Extracting a part of it, and using it elsewhere is a copyright or legal issue. But lately this kind of outlook is changing, atleast they are willing to share even if not giving it free. Some websites like Google Maps, Flickr, del.icio.us, Amazon are providing an alternative API that fetches the information, which you usually get only through browsing their web pages.

    These alternate API are the way for bots, to extract the data they wanted out of the website. This is one step towards semantic web, where the data is presented in web as directly machine readable form. Here alternate via for reading the data is provided as API service. These API calls are generally SOAP calls, as part of web service. Even debate about using REST architecture/SOAP RPC architecture goes on. These API kind of interaction within an enterprise system, is called SOA (Service Oriented Architecture) when rightly modeled and built.

    As more number of websites expose there data a webservice via API, the re-mix style of applications came online. They were called as mashups. Mashups is/are applications that are formed from mix-up of data from various other applications. They generally don't have data of their own; they rather mix up and form a complete view from others.

    With API's it easier to extract data, than previously used method web scrapping, which heavily dependent on the current structure of the site. It breaks even for minor changes in layout or style change.

    The mashups are extremely grown now. You could see almost new mashups forming every day.  See this page, the programmable web. According to this source, right now there are 1668 mashup applications, 395 services are available as API, and almost 3 Mashups are constructed everyday.

    Most of these API's are free, but some needs paid license. Amazon requires a special license if you need to use their book search API. But still if you could invite reasonable revenue to amazon via orders placed through your site, then you could make some money too.

    On the marketing front, exposing your site data as API services definitely increases your chance of higher revenue than selling all of it by yourself. Say a local chinese portal gets your global data and show translated versions to their users. This increases your global reach. A popular site for classical music discussions, selling related artists' tracks right there have higher chance of getting sold than that of a show-case site of the record company.

    A site that shows books listed by user personal interest, is rather lucrative than huge common-to-all showcase site. This kind of site is now easier to form, with two API services. One from the site maintaining user's personal interests, maybe manually collected preferences or even collected automatically by the users web browsing tastes. And other API calls to amazon books store.

    If you  planning to launch a GPS website, that pin points you position on this globe. Then you don't need to build the map of the world all by yourself, of course very tedious work. Alternate would be borrowing the map service from Google, and then overlapping your positions on the map.

    So mashups are fun, faster, and fruitful too. :)

    Saturday, November 11, 2006

    Sunday, October 22, 2006

    making readable urls

    urls should be readable, sharable, memorable, ...

    A site best feature would be recall ability. i.e. u navigate, and go thru all the site and u locate an resource or link, then again u losing it is pain.

    Most of the site url which shows a catalog for example, will have an url like
    catalog.jsp?pageno=5

    What you see in page no 5 won't be there when you visit again. This kind of url is not sharable.

    A longer url is also tough to share with others, or remember then even.

    REST (Representational State Transfer) is a kind of architecture, which has stress on simple url schemes, which takes actions as part of url than part of request parameter. REST has various other aspects to it, so its simply wrong to just mentioning abt its url scheme here.
    http://countme.wordpress.com/2006/10/04/affordability-pricing/

    This above url speaks for itself, and most could understand when it was written, and how its organised.

    Tuesday, October 03, 2006

    Google TechTalks, where are they?

    For the past few months, I was watching the tech talk videos of presentations happening at Google. It’s a great thing for a company to share its privileged video on talks by various scholars of our time around the world. Google has done that.

    Till then, watching them is my favorite past time for me in office.

    But since August 25, 2006 there were no updates in the tech talk section in Google videos.

    I had waited over month now; still there aren’t any signs of new videos... :( It’s quite a loss.

    Hoping to see them back soon...

    ----------------------------------------

    Extensive list of google tech videos...

    http://video.google.com/videosearch?q=Google+engEDU

    http://video.google.com/googleplex.html

    Friday, September 15, 2006

    email address validation

    Recently this post talks about Google’s feature called plus-addressing.
    Gmail has an interesting quirk where you can add a plus sign (+) after your Gmail address, and it’ll still get to your inbox. It’s called plus-addressing, and it essentially gives you an unlimited number of e-mail addresses to play with.

    My immediate thought for me is: Does most sites allow ‘+’ as a valid character to appear in email address? Most sites will reject this email address. If this is a standard then this breaks our application, not able to accept valid email address.

    Other mail server like “Fastmail” also claims to have this feature.

    Somebody opposed that this not a feature at all, its part of standard RFC 2822 for email. This is a way to send comments in-line with the email address.

    So what would be regular expression to validate email address as per RFC 2822 standards? Not sure, but though according to this page, a regular expression for validation according a previous RFC (822) for email addressing is
    http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

    The grammar described in RFC 822 is suprisingly complex. Implementing validation with regular expressions somewhat pushes the limits of what it is sensible to do with regular expressions, although Perl copes well:

    (?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t] )+|Z|(?=[["()
    <>@,;:\".[]]))|"(?:[^\"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*
    (?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^\"r]|.|
    (?:(?:rn)?[t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:
    (?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?
    :rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))
    |[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z
    |(?=[["()<>@,;:\".[]]))|"(?:[^\"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)
    ?[ t])*(?:@(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".
    []]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]
    +(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*
    (?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>
    @,;:\".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[
    ] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r]|.)*](?:(?:rn)?
    [ t])*))*)*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[
    t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^\"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])
    *)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["(
    )<>@,;:\".[]]))|"(?:[^\"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[
    t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([
    ^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?
    :(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*>(?:(?:r
    n)?[ t])*)|(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]
    ]))|"(?:[^\"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*:(?:(?:rn)?[ t])*(?:(?:(?:
    [^()<>@,;:\".[]00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^\"r]|.|
    (?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 0
    0-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^\"r]|.|(?:(?:rn)?[ t
    ]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)
    ?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn
    )?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[
    ]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn
    )?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^\"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?
    [ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=
    [["()<>@,;:\".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<
    >@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r]|.)*](
    ?:(?:rn)?[ t])*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)
    ?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)
    ?[ t])*(?:[^()<>@,;:\".[]00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))
    |[([^[]r]|.)*](?:(?:rn)?[ t])*))*)*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:\".[]00-31
    ]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^\"r]|.|(?:(?:rn)?[
    t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?
    :(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^\"r]|.|(?:(?:rn)?[ t]))*"(?:(
    ?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])
    +|Z|(?=[["()<>@,;:\".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])
    *(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]
    r]|.)*](?:(?:rn)?[ t])*))*>(?:(?:rn)?[ t])*)(?:,s*(?:(?:[^()<>@,;:\".[] 00-31]+
    (?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^\"r]|.|(?:(?:rn)?[ t]))*"(?:(?:
    rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(
    ?=[["()<>@,;:\".[]]))|"(?:[^\"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn
    )?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|
    [([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:
    (?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^(
    )<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^\"r]|.|(
    ?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:\".[] 00-
    31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])
    *)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["(
    )<>@,;:\".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<
    >@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r]|.)*]
    ?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?
    [ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*)*:(?:(?:rn)
    ?[ t])*)?(?:[^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]
    ]))|"(?:[^\"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[
    ^()<>@,;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^
    "r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,
    ;:\".[] 00-31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r]|.
    )*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 00-31]+(?:(
    ?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*
    ))*>(?:(?:rn)?[ t])*))*)?;s*)


    This regular expression will only validate addresses that have had any comments stripped and replaced with whitespace (this is done by the module).

    There could be ways of breaking this single expression into smaller modules, but still this is the one :-o

    Friday, August 18, 2006

    unit testing xslts

    Some XSLT testing frameworks

    XSLTUnit
    http://xsltunit.org

    Outdated
    Tough to setup the testing environment

    Tennison’s testing framework
    http://tennison-tests.sourceforge.net/
    http://www.jenitennison.com/xslt/utilities/unit-testing/

    Test cases could be written in xml itself
    Easy to write test cases
    Supports xpath based expressions testing of nodes and values
    Tests are more readable than XSLT unit
    Don’t support global variable and params setting properly

    UTF-X
    (http://utf-x.sourceforge.net/)

    Test cases could be written in xml itself
    Supports template generation for writing test cases
    Support for ant tasks to run test while executing builds
    Needs java 1.5
    Supports Junit
    Don’t support advanced xslt testing needs

    Juxy
    (http://juxy.tigris.org/)

    Java Based
    Needs to have knowledge of java programming
    Could be integrated with JUnit
    Support param set-up for xslt and global variable setup and other options for xslt testing
    Drawback: need knowledge of java to write test cases.

    Recommendations:

    If you are okie to write xslt test cases in java, then Juxy will provide us with much flexible framework.

    Or else if you need to stick with XML based test case building (which can be written just with knowledge of xml/xslt alone), then we could use the Tennison Testing framework.

    Recommended Blog Posts