Architecture Musings: Analytics

Analytics can be hard, not only technically, but also operationally, qualitatively and in usefulness.

Technically

making sure right data is collected at the right time - no data is missed and no duplicates
making sure right tools and techniques are used for summarizations

Operationally

making sure data is collected from all sources and nothing is missed or truncated
making sure summarization scripts runs at proper time on proper data
making sure re-summarizations can be performed on past data in case of any errors

Qualitatively

making sure results are correct - big challenge when data set is large

Usefulness

making sure results are actually useful to users who need them and not just an overload of numbers and graphs

Lets dissect further from different angles:

What are the different types of analytics

Absolutes numbers

like unique visitors to a site

Conversion funnels

like among uniques visitors, how many attempted to purchase an item, and among them how many actually purchased

Cohort Analysis

how many users who visited last month, also visited this month
how many users who made a purchase attempt last month purchased the item this month

etc.

Cohort analysis requires "self-join"-type operation and can be very time consuming. I will not discuss them further in this post.

Report types

Time series graphs

Absolute numbers and conversion funnels are relatively simple, and both can be represented as time-series graphs, with absolute numbers having a single line, and conversion funnel having multiple lines for each metric.

Histograms

Histograms can be useful in some cases.

Pie charts, funnels, etc.

For a strong impact, pie charts and funnels are useful.

Filters

Almost always, the reports need to be filtered by some criteria, like number of visitors broken down by countries.

Filters are too important to ignore, but too many filters and too many values for each filter causes data to bloat.

Pre-define combination of filters for each report. All possible combinations are not at all useful.

Collection: log file or database

Some data like HTTP requests automatically get collected in log files, but I am assuming it is not enough, and the application needs to proactively log additional data.

I am also assuming that the database is sharded if needed and it is possible to log either in database or in file system.

Transactional data involving money, stays in the primary transactional database, and summarizations are run against a replicated slave of the primary database. This works pretty well.

Other types of data like user activities, ad-impressions or product-impressions, etc., are very high volume, can be logged either in log files or in the database. Logging in database has some great advantages:

Ability run SQL, in particular, "group by"s in relational databases; obviously not for peta-byte scale data
Ability to go back and re-summarize data in specific time-interval
Ability to go back and update records by the application when a specific event happens for a user

Summarization techniques: real time or batched

If possible, perform real time processing, with ability to re-summarize if needed. Other than being real-time, there is a huge operational benefit - no need to store or transfer data from one system to another.

In general, less the number of steps transferring, transforming and summarizing data, less are the chances of errors.

Ensuring quality

Few ways to ensure correctness in reports:

generate reports in 2 different ways and then compare
compare reports from 2 different dates; there should not be huge discrepancies if there are no known reasons
sampling of actual data

We generate reports both from database and log files, which gives us the best possible way to ensure correctness. Whenever we find a discrepancy, guess which method has some issues (always).

Summary

Proposing to use databases for logging can raise eye brows and even hostilities. But some of the reporting requirements we received were so much easier when databases were used for logging.

Going back to original 4 points discussed:

Technical

if possible log into no-sql databases, dedicated for logging, multiple if needed

integrity constraints can be enforced - no duplicates
ability to update records and add attribute values to same record
ability to re-summarize for specific intervals

if possible log into relational databases, dedicated for logging, multiple if needed

develop scripts in SQL in minutes rather than hours or days

if neither possible, log into file system and use map-reduce techniques to extract summaries

Operational

eliminate data transfers by logging directly into target destination and reduce data transforms to a single summarization step
can also log directly into a reliable external service like amazon kinesis

Qualitative

compare reports generated locally and those generated by an external service

Usefulness

target 5 to 10 meaningful reports with pre-defined filters after long discussions with product managers and cut rest of the crap

Architecture Musings

Friday, June 12, 2015

Analytics