Tuesday, December 14, 2010

Web bug server & log analyzer

I got a task to build a web bug system and then analyze the saved information.

What is a web bug? see what Wikipedia has to say about it http://en.wikipedia.org/wiki/Web_bug

As this might seems like an easy task, there were few interesting challenges:

  • Enormous amount of data to handle (tens of millions of requests every day)

  • Make the system fully scalable

  • Make the system redundant

  • Implementation of the saving mechanism

  • Track and analyze the saving mechanism

  • Archive the data for a long period of time

  • Analyze where requests are coming from (countries)

  • Count unique users

  • Phew, is there some time for coffee???




So, after getting the requirements, scratching the head, making coffee (it appears there's always time for coffee), scratching the head some more, drawing some weird boxes on the erasable board while mumbling some buzz words, scratching the head again and then sitting down to think; comes the time for some decisions.

starting with the seems like easy decisions:
* How to save the data? - just throw it to the database
* Archive? - take the saved data, and put it some data-store
* Where requests are coming from? - find a GEO IP service and use it while analyzing the data
* Unique users? - use cookies
* Coffee? - kitchen --> coffee machine


Design #1 - Simple
1 web server
1 database server

Get all the traffic to the web server with a server side code (PHP, Python, .Net, etc.) which will read the request and save it in the database.

Have a daily job which will get all the saved data and analyze it.

Pros:
* very easy to implement
* very short time to develop

Cons:
* Non effective way to handle enormous amount of requests -
analyzing tens of millions of raw data on RDBMS takes eons!
* Not scalable and not redundant

Conclusion:
As it might be a very nice start for a small system, this solution fails the requirement of analyzing enormous amount of data.
While testing this design, it took the web server less than 30 seconds to crash.



Back to the drawing board, draw more boxes and mumble a bit more and find how to improve the first idea. new decisions to make:
* Make it scalable - use more than one web server with load balancing
* Redundancy? - log the web requests. If data won't be saved in the database, we can get it from the log files.

Design #2 - Simple++
3 web servers
1 database server

Use DNS load balancing to split the traffic between the web server. This solution is very easy to use for scaling up the system, we only need to put up another server and register it with the DNS system.

Turn on the web server's logging and the rest is like the first design:
Get all the traffic to the web server with a server side code (PHP, Python, .Net, etc.) which will read the request and save it in the database.

Have a daily job which will get all the saved data and analyze it.

Pros:
* very easy to implement
* very short time to develop
* Scalable
* Redundant

Cons:
* need some basic knowledge on how to set DNS load balancing
* still not addressing the analyzing time which still takes eons on RDBMS

Conclusion:
We got one step forward getting our system to work. The servers where able to withhold the amount of requests and even had good time response. But we are still stuck with the analyzing time.




OK, so we really need to address this analyzing time. New decisions to make:
* Improve analysis time - solution = reduce, reduce, reduce... yes reduce where ever we can.

Design #3 - Smart Reduce
2 Web servers
1 Analysis server
1 database server

This design will try to reduce the end result while maintaining scalability and redundancy.
* Use web server's log files instead of saving the requests directly to the database - we reduced the process time for handling the request

* Add a module to the web server to track the users (with cookies) - we reduced the need for writing any code

* Add a module to the web server to analyze the request and get the GEO location - we reduced the need to do it when analyzing the data

* Limit every log file to short amount of time (minutes)

* When log file is done (the web server is working on a newer log file), ship it to analysis queue on the analysis server - we reduced the amount of data which needs to be analyzed at a time and if failure occur, it will only effect a small portion of the data. If we'll track this right, we can easily find the problem later and reanalyze the data

* Save the analyzed data to the database

* Have a daily job which will get all the saved data and summarize it if necessary

Pros:
* Scalable
* Redundant
* More cost effective, less web servers can handle more requests
* Analysis time improvement, we can even have partial analyzed data during the day

Cons:
* Harder to implement
* Takes more time to develop


Conclusion:
This design works and meets the requirements. Less web servers are needed for handling the requests(first tests without accelerators showed ability to answer 10 times more requests per server in this design than using server side code).

We can always improve this design by replacing the DNS load balancing to a more robust load balancing solution, so even if one server is down, all the other traffic will be transferred to the other web servers. And probably can find smarter ways to reduce the processing of data on each layer.




More about archiving, tracking and making coffee later on.....

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...