Friday, February 14, 2014

Monitoring Production Deployment

Monitoring can be categorized into two based on what is monitored:
  • System Monitoring 
  • Application Monitoring
From another angle, monitoring are of two types based on how monitoring is done:
  • Trend Monitoring
  • Event Monitoring

System Monitoring

System monitoring involves monitoring OS resources like CPU, memory, hard disk, network traffic, etc. It also involves monitoring resources used by middleware, database, load balancer, cache, etc.

Trend monitoring - It is very useful and and important to have a graph of chart illustrating how certain resource was utilized over time. This helps with advanced planning, and predict potential overloads.

Event monitoring - Email or SMS alerts need to be set if utilization of certain resource goes beyond a certain threshold, or comes below a certain threshold.

Organizations typically use existing tools for system monitoring. Infrastructure-As-A-Service providers also provides their own monitoring tools. We used Cacti, Nagios, AWS CloudWatch, and also wrote some simple bash scripts for alerts. In our case following system commands have been most useful so far:
  • uptime - for load average in web servers
  • iostat - for disk i/o in database servers

Application Monitoring

Application monitoring involves monitoring data created in the application. It could just be the volume of data, or certain specific criteria in the data.

Trend monitoring - Email reports and preferably web-based reports provide a summarized view transactional data being created in the application. These reports can be further be digged down into by certain criteria. For example, we have payment attempts and payment success rate report, which can further be dissected by geographies. An unexpected trend requires looking into business logic and fine tune processes, UI algorithm, etc.

Event monitoring - An unexpected spike in data created, either upward or downward, should send alerts for immediate attention. A downward spike typically suggests incorrect configuration somewhere in the application, or in case of us, certain biller is unavailable. An upward spike may suggest DOS attacks, or an automated test suite gone wrong. Both upward and downward events should be monitored.

Application monitoring is particularly crucial immediately after a release, to catch issues quickly that were not caught in QA cycles.

No comments:

Post a Comment