It might all started with screen scrapping of legacy systems. Screen scraping is a technique used match a legacy system user UI, as an interface for input to an newly developed system.
If its for legacy systems, then why its being applied to web sites these days. With the fast of growth in application development scenario, the web applications become legacy!
Web scrapping, could be termed as process of extracting a piece of information of your interest from a webpage online. Recently there have been significant work is going on the things that would require taking the webapps to the next level.
Web scrapping is lot easier than screen scrapping of legacy systems. The output of web apps is being a HTML code, which could be represented as an DOM tree, and could be navigated easily by machines/bots.
Yes its easier to navigate, but is it easier to locate an item of interest? Not actually. HTML code is mostly about styling, to say how the date may appear to the user. Usually a page will contain less data, and more styling demarcations added for proper presentations; like: <b> - for bold data, <u> - for underline. Other than these there is also lot of styling code is mixed with the potent data that webpage is showing up. So it's tougher for a machine to separate data, from style information.
GreaseMonkey might be the first tool released, which help people to customize the webpage on the client side. Like the next time you wont like the blue background on the MSN home page, you can change it before the website loads on your browser. It's simple in functionality, but you need to know DOM Structure (tree representation of the web page). Later people started posting their script on the web (http://userscripts.org/).
Chickenfoot is another recent tool on the rise. Writing script here doesn't need knowledge of DOM representation. Read my earlier post on this. I too had my hands on trying out these two, sometime back.
These are just the start of the road that reaches to our dream (Semantic Web). The Semantic Web is all about adding meaning to data, which is mingled with style information in various web sites. If a consultant puts his appointment list online. Then web crawler scanning it should make sense of it rather just seeing it as numbers and text; that it's a calendar data and it belongs to him.
A webpage is seen as propriety information of the owner of the website. Extracting a part of it, and using it elsewhere is a copyright or legal issue. But lately this kind of outlook is changing, atleast they are willing to share even if not giving it free. Some websites like Google Maps, Flickr, del.icio.us, Amazon are providing an alternative API that fetches the information, which you usually get only through browsing their web pages.
These alternate API are the way for bots, to extract the data they wanted out of the website. This is one step towards semantic web, where the data is presented in web as directly machine readable form. Here alternate via for reading the data is provided as API service. These API calls are generally SOAP calls, as part of web service. Even debate about using REST architecture/SOAP RPC architecture goes on. These API kind of interaction within an enterprise system, is called SOA (Service Oriented Architecture) when rightly modeled and built.
As more number of websites expose there data a webservice via API, the re-mix style of applications came online. They were called as mashups. Mashups is/are applications that are formed from mix-up of data from various other applications. They generally don't have data of their own; they rather mix up and form a complete view from others.
With API's it easier to extract data, than previously used method web scrapping, which heavily dependent on the current structure of the site. It breaks even for minor changes in layout or style change.
The mashups are extremely grown now. You could see almost new mashups forming every day. See this page, the programmable web. According to this source, right now there are 1668 mashup applications, 395 services are available as API, and almost 3 Mashups are constructed everyday.
Most of these API's are free, but some needs paid license. Amazon requires a special license if you need to use their book search API. But still if you could invite reasonable revenue to amazon via orders placed through your site, then you could make some money too.
On the marketing front, exposing your site data as API services definitely increases your chance of higher revenue than selling all of it by yourself. Say a local chinese portal gets your global data and show translated versions to their users. This increases your global reach. A popular site for classical music discussions, selling related artists' tracks right there have higher chance of getting sold than that of a show-case site of the record company.
A site that shows books listed by user personal interest, is rather lucrative than huge common-to-all showcase site. This kind of site is now easier to form, with two API services. One from the site maintaining user's personal interests, maybe manually collected preferences or even collected automatically by the users web browsing tastes. And other API calls to amazon books store.
If you planning to launch a GPS website, that pin points you position on this globe. Then you don't need to build the map of the world all by yourself, of course very tedious work. Alternate would be borrowing the map service from Google, and then overlapping your positions on the map.
So mashups are fun, faster, and fruitful too. :)
No comments:
Post a Comment