Wednesday, January 1, 2014

Keep your data aligned!

Many big data systems analyse large, periodical data streams.

These data streams are sometimes event based (i.e. - Add an entry whenever a user visits a page, performs an operation etc.), and sometimes 'sample' based (i.e. - Measure CPU level every 5 seconds).

Sometimes, your sampling can be unreliable - For example, when monitoring activity via WAN.
Then you get 'holes' in your data stream. These holes cause problems when analysing your data.

Several companies I know have been known to develop utility functions, which periodically go over the data streams and 'fix' these holes. These functions are usually costly, as finding such holes could get complicated and performance demanding in large datasets. Fixing them (Especially if it's an 'update' operation) is also costly.

My suggestion: Fix the problem before it rises. Keep your data aligned before you insert data into the database - Whenever there's a missed reading, fix it in the next reading by keeping track of your last reading's timing.
You might find it cheaper to hold a pre-input processing machine for this than requiring a huge server for your database because it needs to align data over night.


No comments:

Post a Comment