- Technically
- making sure right data is collected at the right time - no data is missed and no duplicates
- making sure right tools and techniques are used for summarizations
- Operationally
- making sure data is collected from all sources and nothing is missed or truncated
- making sure summarization scripts runs at proper time on proper data
- making sure re-summarizations can be performed on past data in case of any errors
- Qualitatively
- making sure results are correct - big challenge when data set is large
- Usefulness
- making sure results are actually useful to users who need them and not just an overload of numbers and graphs
Lets dissect further from different angles:
What are the different types of analytics
- Absolutes numbers
- like unique visitors to a site
- Conversion funnels
- like among uniques visitors, how many attempted to purchase an item, and among them how many actually purchased
- Cohort Analysis
- how many users who visited last month, also visited this month
- how many users who made a purchase attempt last month purchased the item this month
- etc.
Cohort analysis requires "self-join"-type operation and can be very time consuming. I will not discuss them further in this post.
Report types
- Time series graphs
- Absolute numbers and conversion funnels are relatively simple, and both can be represented as time-series graphs, with absolute numbers having a single line, and conversion funnel having multiple lines for each metric.
- Histograms
- Histograms can be useful in some cases.
- Pie charts, funnels, etc.
- For a strong impact, pie charts and funnels are useful.
Filters
Almost always, the reports need to be filtered by some criteria, like number of visitors broken down by countries.
Filters are too important to ignore, but too many filters and too many values for each filter causes data to bloat.
Pre-define combination of filters for each report. All possible combinations are not at all useful.
Collection: log file or database
Some data like HTTP requests automatically get collected in log files, but I am assuming it is not enough, and the application needs to proactively log additional data.
I am also assuming that the database is sharded if needed and it is possible to log either in database or in file system.
Transactional data involving money, stays in the primary transactional database, and summarizations are run against a replicated slave of the primary database. This works pretty well.
Other types of data like user activities, ad-impressions or product-impressions, etc., are very high volume, can be logged either in log files or in the database. Logging in database has some great advantages:
- Ability run SQL, in particular, "group by"s in relational databases; obviously not for peta-byte scale data
- Ability to go back and re-summarize data in specific time-interval
- Ability to go back and update records by the application when a specific event happens for a user
Summarization techniques: real time or batched
If possible, perform real time processing, with ability to re-summarize if needed. Other than being real-time, there is a huge operational benefit - no need to store or transfer data from one system to another.
In general, less the number of steps transferring, transforming and summarizing data, less are the chances of errors.
Ensuring quality
Few ways to ensure correctness in reports:
- generate reports in 2 different ways and then compare
- compare reports from 2 different dates; there should not be huge discrepancies if there are no known reasons
- sampling of actual data
We generate reports both from database and log files, which gives us the best possible way to ensure correctness. Whenever we find a discrepancy, guess which method has some issues (always).
Summary
Proposing to use databases for logging can raise eye brows and even hostilities. But some of the reporting requirements we received were so much easier when databases were used for logging.
Going back to original 4 points discussed:
- Technical
- if possible log into no-sql databases, dedicated for logging, multiple if needed
- integrity constraints can be enforced - no duplicates
- ability to update records and add attribute values to same record
- ability to re-summarize for specific intervals
- if possible log into relational databases, dedicated for logging, multiple if needed
- develop scripts in SQL in minutes rather than hours or days
- if neither possible, log into file system and use map-reduce techniques to extract summaries
- Operational
- eliminate data transfers by logging directly into target destination and reduce data transforms to a single summarization step
- can also log directly into a reliable external service like amazon kinesis
- Qualitative
- compare reports generated locally and those generated by an external service
- Usefulness
- target 5 to 10 meaningful reports with pre-defined filters after long discussions with product managers and cut rest of the crap
No comments:
Post a Comment