Friday, May 20, 2016

On REST APIs

If I join a company that has standardized on REST-style APIs, I will be a good citizen and follow the established standards.

However I am not a big fan of REST APIs, and if starting from scratch, I will advocate against it. My reasons are follows:

Four verbs are not descriptive enough in all cases. It can be made to work, but it is often like hammering a square peg in a round hole. For example, consider actions like "assign", "approve", "merge", etc.

Second, I need to expose the object mode in the URL path; I may not want to do that always, or the object model may change.

Third, I may want to combine multiple actions into a single request for performance reasons or simple to reduce work. Examples: createOrUpdate (if available then update, else create), createAndGet (create object and return the object which now has backend-generated properties).

Last, HTTP status codes contain a lot of junk and most often not exactly what I need. Status codes are too application-specific to have a single generic list.

Friday, June 12, 2015

Analytics

Analytics can be hard, not only technically, but also operationally, qualitatively and in usefulness.

  • Technically
    • making sure right data is collected at the right time - no data is missed and no duplicates
    • making sure right tools and techniques are used for summarizations 
  • Operationally
    • making sure data is collected from all sources and nothing is missed or truncated
    • making sure summarization scripts runs at proper time on proper data
    • making sure re-summarizations can be performed on past data in case of any errors
  • Qualitatively
    • making sure results are correct - big challenge when data set is large
  • Usefulness
    • making sure results are actually useful to users who need them and not just an overload of numbers and graphs
Lets dissect further from different angles:

What are the different types of analytics

  • Absolutes numbers
    • like unique visitors to a site
  • Conversion funnels
    • like among uniques visitors, how many attempted to purchase an item, and among them how many actually purchased
  • Cohort Analysis
    • how many users who visited last month, also visited this month
    • how many users who made a purchase attempt last month purchased the item this month
  • etc.
Cohort analysis requires "self-join"-type operation and can be very time consuming. I will not discuss them further in this post.

Report types


  • Time series graphs
    • Absolute numbers and conversion funnels are relatively simple, and both can be represented as time-series graphs, with absolute numbers having a single line, and conversion funnel having multiple lines for each metric. 
  • Histograms
    • Histograms can be useful in some cases.
  • Pie charts, funnels, etc.
    • For a strong impact, pie charts and funnels are useful.

Filters


Almost always, the reports need to be filtered by some criteria, like number of visitors broken down by countries.

Filters are too important to ignore, but too many filters and too many values for each filter causes data to bloat.

Pre-define combination of filters for each report. All possible combinations are not at all useful.

Collection: log file or database


Some data like HTTP requests automatically get collected in log files, but I am assuming it is not enough, and the application needs to proactively log additional data.

I am also assuming that the database is sharded if needed and it is possible to log either in database or in file system.

Transactional data involving money, stays in the primary transactional database, and summarizations are run against a replicated slave of the primary database. This works pretty well.

Other types of data like user activities, ad-impressions or product-impressions, etc., are very high volume, can be logged either in log files or in the database. Logging in database has some great advantages:
  • Ability run SQL, in particular, "group by"s in relational databases; obviously not for peta-byte scale data
  • Ability to go back and re-summarize data in specific time-interval
  • Ability to go back and update records by the application when a specific event happens for a user

Summarization techniques: real time or batched


If possible, perform real time processing, with ability to re-summarize if needed. Other than being real-time, there is a huge operational benefit - no need to store or transfer data from one system to another.

In general, less the number of steps transferring, transforming and summarizing data, less are the chances of errors.

Ensuring quality


Few ways to ensure correctness in reports:
  • generate reports in 2 different ways and then compare
  • compare reports from 2 different dates; there should not be huge discrepancies if there are no known reasons
  • sampling of actual data
We generate reports both from database and log files, which gives us the best possible way to ensure correctness. Whenever we find a discrepancy, guess which method has some issues (always).

Summary

Proposing to use databases for logging can raise eye brows and even hostilities. But some of the reporting requirements we received were so much easier when databases were used for logging.

Going back to original 4 points discussed:
  • Technical
    • if possible log into no-sql databases, dedicated for logging, multiple if needed
      • integrity constraints can be enforced - no duplicates
      • ability to update records and add attribute values to same record
      • ability to re-summarize for specific intervals
    • if possible log into relational databases, dedicated for logging, multiple if needed
      • develop scripts in SQL in minutes rather than hours or days
    • if neither possible, log into file system and use map-reduce techniques to extract summaries
  • Operational
    • eliminate data transfers by logging directly into target destination and reduce data transforms to a single summarization step
    • can also log directly into a reliable external service like amazon kinesis
  • Qualitative
    • compare reports generated locally and those generated by an external service
  • Usefulness
    • target 5 to 10 meaningful reports with pre-defined filters after long discussions with product managers and cut rest of the crap

Friday, October 24, 2014

Experiences with a Large Table

We gained some experiences dealing with one of our large tables recently. This is not really in big-data scale, and we use MySQL for storage and we run SQL queries and so we don't use no-SQL databases.

The table is growing at the rate of 2.6 million records per day. Row size is about 2.3KB. At this rate the table grows by 6GB per day, or 180GB per month.

In one year, it would grow above 2TB, provided traffic remains constant, which actually wont.

Now, this is not one of our primary tables that we deal with everyday. This table is important, but we don't want to invest too much time and resources managing it. We have a SSD drive of 1.6TB, and want to avoid server or disk addition/upgrade at this point.

What we did was reduce the row size to about 350 bytes by eliminating bunch of diagnostic data that were being inserted. We put the diagnostic data in log files and kept the data required for queries and analytics in the table.

With this change, the table would grow by 900MB per day, or 27GB per month, or 330GB per year.

We have an SSD drive of 1.6TB, so we won't have to worry about disk space for 2-3 yrs.

Few points to remember while dealing with large tables:

  • Always have an auto-increment numeric primary key. Even if the table has UUID or other types of unique keys, it just helps with administration when there is an auto-increment numeric primary key.
  • Always have a create-date column, which is of-course a no-brainer. Index the create-date column. Archival or deletion will require create-date column almost all the time.
  • As learnt from above experience, keep the table size small, and do not add columns that can be avoided.
  • Allocate enough space to in the disk to take full backups.

Friday, February 28, 2014

When To Start Partitioning Your Database

As traffic volume grows, load on primary transactional database becomes the main bottleneck, in particular the disk I/O. In 3 or 4-tier architecture, if the middle-tiers are stateless, they can be scaled horizontally pretty easily. Main transactional database can also be horizontally or vertically partitioned and scaled but it is not that easy.

We have recently vertically partitioned the database and put two fairly independent modules into two separate databases, but it took considerable time and effort from the entire team and forced us to postpone certain other projects. Horizontal partitioning or sharding, which we had done once before, also takes time in terms of setting up and monitoring, although typically entire team does not get involved.

Engineers are trigger-happy when talking about sharding, but for a small team, it is better to postpone such tasks as long as possible and focus on adding application functionality. One obvious choice is to use higher performance systems. We have started using AWS storage-optimized I2 instances and got amazing I/O throughput. From their web site:

"I2 and HI1 instances provide very fast SSD-backed instance storage optimized for very high random I/O performance, and provide high IOPS at a low cost. We recommend using I2 and HI1 instances for NoSQL databases like Cassandra and MongoDB which benefit from very high random I/O performance and the low request latency of direct-attached SSDs, and are an excellent choice for transactional systems that can manage their own resiliency."

However there is a limit on how long we can scale out using high performance systems, and it is not the IOPS limit. It is the backup.

Replication is a great way to backing up databases, but alas it is not enough. If an unwanted delete or truncate or an incorrect update gets executed in the master database by mistake, it propagates to all slave databases, and original data cannot be recovered. Which is why point-in-time backup and recovery are important. It lets us go back one hour or one day or whatever to get a correct snapshot of the data.

Backup strategy is another topic for another day, but suffice to say here, that periodically we need take a full backup of the database. That combined with incremental or differential backups lets us go back in time to get a correct snapshot of the data.

Based on how far we go back in time, some data loss could happen. If we are going one hour, then one hour data could be lost, if we are going back one day, then one day's data can be lost. (Hopefully all or some lost data can be recovered from replicated servers).

How much data we are willing to lose in the worst case, is directly dependent on how long it takes to take a full backup of the database. That really puts a cap on how much we can scale out the database with high performance systems. After that we need to start partitioning.

We need to decide on how much data we are willing to lose in the worst case if something bad happens. As an example, lets say we decide 6 hours. Then the full backup should complete well within 6 hours. On m1.xlarge AWS instance, a MySQL DB of size 200Gb took about 6 hours to complete a full backup. So the database needs to be partitioned as the size approaches 200 Gb.

The above is just an example. On a high I/O system, backup is much faster, and in our case, 1 TB is the cutoff for now.

Correction: Actually, time required for full backup need not be the cap on the database size. If database is being replicated to multiple slaves, backups can be taken at each slave and strategically scheduled so that the data loss window is brought down.

Friday, February 14, 2014

Monitoring Production Deployment

Monitoring can be categorized into two based on what is monitored:
  • System Monitoring 
  • Application Monitoring
From another angle, monitoring are of two types based on how monitoring is done:
  • Trend Monitoring
  • Event Monitoring

System Monitoring

System monitoring involves monitoring OS resources like CPU, memory, hard disk, network traffic, etc. It also involves monitoring resources used by middleware, database, load balancer, cache, etc.

Trend monitoring - It is very useful and and important to have a graph of chart illustrating how certain resource was utilized over time. This helps with advanced planning, and predict potential overloads.

Event monitoring - Email or SMS alerts need to be set if utilization of certain resource goes beyond a certain threshold, or comes below a certain threshold.

Organizations typically use existing tools for system monitoring. Infrastructure-As-A-Service providers also provides their own monitoring tools. We used Cacti, Nagios, AWS CloudWatch, and also wrote some simple bash scripts for alerts. In our case following system commands have been most useful so far:
  • uptime - for load average in web servers
  • iostat - for disk i/o in database servers

Application Monitoring

Application monitoring involves monitoring data created in the application. It could just be the volume of data, or certain specific criteria in the data.

Trend monitoring - Email reports and preferably web-based reports provide a summarized view transactional data being created in the application. These reports can be further be digged down into by certain criteria. For example, we have payment attempts and payment success rate report, which can further be dissected by geographies. An unexpected trend requires looking into business logic and fine tune processes, UI algorithm, etc.

Event monitoring - An unexpected spike in data created, either upward or downward, should send alerts for immediate attention. A downward spike typically suggests incorrect configuration somewhere in the application, or in case of us, certain biller is unavailable. An upward spike may suggest DOS attacks, or an automated test suite gone wrong. Both upward and downward events should be monitored.

Application monitoring is particularly crucial immediately after a release, to catch issues quickly that were not caught in QA cycles.

Friday, December 6, 2013

Load Testing

This is a simplified recipe on load testing for a web-based app like a web-store.

There are three stages in load testing:
  1. Plan Test and Prepare Environment
  2. Run Test and Measure Variables
  3. Extract Results and Produce Reports

Plan Test and Prepare Environment

Load testing environment should mimic production environment closely. Size of instances and network topology should match. It is not practical to have 100 node clusters for load testing, so 2-4 nodes in a cluster is okay.

Instead of having one client to issue the HTTP requests, it is ideal to have multiple clients.

Request "Mix"

  • Identify all URLs that will be accessed by load test. If the system is already live, then use access log to identify the URLs.
  • Identify what percentage of users will be accessing what URLs (request "mix"). Once again, the percentage is best derived from access log.
  • URLs should be accessed by different users or user sessions, typically identified by browser cookies. A user can access a sequence of URLs that represent page flows. But the same user should not access the flow multiple times.

Data Volume

  • Data volume should closely match production data
  • Transactions, sessions, users data should be in millions to properly test data access

Run Test and Measure Variables

As mentioned above, and being repeated here for its importance, the URLs should not be accessed by the same user session.

"Runs"

  • Clients should access identified URLs simultaneously at identified percentages (request "mix")
  • Start with a number of requests per second which the system can easily handle, say 10 request per second. Let it run for for certain duration, say 30 minutes.
  • Reset everything - delete if too much new data created, restart systems, etc.
  • Run with increased number of requests per second, say 12 requests per second for 30 minutes.
  • Repeat while gradually increasing number of requests per second until system fails, say 50 requests per second

Measurements

  • Ignore first 5 minute of of each 30 minute run as warm-up time
  • For the last 25 minutes, measure average response time and throughput of the requests
  • For each server the request goes through – load balancer, web server, database server – measure average CPU load, memory utilization, disk IO, network traffic.
  • For database servers, measure average response time of database accesses

Extract Results and Produce Reports

  • Produce graphs of response time, throughput, CPU load, disk I/O, etc against number of requests per second

Friday, October 18, 2013

Managing Large Transactional Tables Part I of III

As transactional tables grow, it becomes increasingly difficult to perform DDL operations on those tables, like adding columns or adding indexes. These operations take hours, and various gymnastics are to be performed to ensure tables are not locked and the application is available always. I will cover the "gymnastics" in part II, and this post will cover the sharding strategies.

Sharding is the primary strategy for scalability of the Internet giants, but should be implemented by smaller applications as well for operational efficiency. There are excellent articles on sharding available on the Internet, like this one: http://www.codefutures.com/database-sharding/.

In summary, sharding involves using multiple database instances, where the reference data are replicated in all instances and large transactional tables are horizontally partitioned across instances.

Reference data are smallish tables, not exceeding 100k rows per table, and typically much less - in 100s or 1000s. Transactional data are in tables that grow to 10s or 100s of millions of rows per table and keep growing day by day.

Key deciding factor is, based on what criteria transaction tables are to be partitioned. Key constraint is, no SQL query should span across database instances. Partitioned data, together with the reference data in an instance should be self contained to run the application.

In our case, which is a multi-tenant SaaS application, the decision is simple. Tenant account is the deciding criteria for partitioning. Each database instance holds transactional data for a group of tenant accounts, and the grouping is currently based on geographies.

Currently we are targeting to keep about 10-20 million rows per table per database instance. By the way, we do need certain relational database features in our application like joins (both inner and outer), aggregation, sorting and searching based on different columns, sub-queries, etc., so NoSQL databases are not an option.