Monday, September 22, 2014

From one hosting to another...

This weekend, we moved our cluster from Rackspace to Amazon.

It was a very tense process, as we need to transport GBytes of data seamlessly from one server to another, as well as shift our data writing server from the old database to the new one, while losing as little data as possible.
Eventually, it was done in less than two hours, most of which were exporting data from the old server and transporting it into the new one.

My main tips for performing such a move:
- Sit down before, and write the process step by step. Preparations for the new server, exporting and transporting the data, updating DNS services, testing. Discuss and review with your team.
- Review the process, and estimate risks and contingencies. What happens if it takes too long to transport the data? What happens if you have to roll back? What happens if the new server crashes?
- In case of DNS updates, it would be faster and more reliable to add the new domain ip to /etc/hosts instead of waiting for DNS to refresh, which might take time. Don't forget to jot down the IPs of your old servers, you might need them :)
- In our case, we also had some one premise sensors with no access to DNS, which means we had to use both the old and new stacks active, both working with the new database, until we're able to access those sites and change the destination IP.
- After you've got your plan laid down, do a dry run. There's nothing like a dry run to sort out bugs and add missing steps.
- Be 100% clear about who performs which step and when, but keep one person in charge of the whole process.
- Keep everyone in the loop - Upgrade was performed Friday night, so we decided to do it from home.. We were a team of three, using Google Hangout to communicate while everyone is online.

Eventually, process went quite well, Amazon SSD servers are fast and zippy, and we're ready for our next scale challenge, which would have us scaling to tables larger than 100m records while keeping our high performance standards.



Thursday, August 21, 2014

Shifting weights... And Cluster Performance.

At the beginning, there was one machine.

One server ran our database and web application.

But as our database began to scale, we needed to separate them.
Separating the servers is easy - Just define an 'internal' network, and let the machines communicate via the internal leg.

However, this caused another problem - We began to experience long lags when transferring large datasets between the machines.
When dataset was larger than X items, performance began to degrade, and was substantially slower due to network lag. Below that, performance was actually better.

Part of it was due to node's mysql drivers. They're just slow. (Compared to native mysql drivers)
The other part was network lag.

So how do you solve it?
Part of my original design was using mysql as a data store - Only use simple sql, no arithmetical operations and aggregations. This allowed us to move from mysql to mongodb easily, and eased CPU load for the database.

But now it's time to shift the weight, because bandwidth is our new bottleneck.
So we're moving some of the heavier processing into the database so that a smaller dataset would transfer faster. Much faster.

Performance is a delicate art of balance. Always remember that paradigms change as scale changes. Don't be afraid to shift weight in order to keep your system balanced.


Sunday, July 27, 2014

Recruitment - A different take

Growing as a company could sometimes be difficult.
You need to be able to recruit people (Or outsource them, which will not be handled in this post), train them and get them to be productive as fast as you can.

For a small startup, this is even harder - Every position counts, and every worker might be responsible for great success, or miserable failure. This can make the recruitment decision a very critical one.

For our next recruit, which should be a tier 3 support engineer, we've decided to handle the process a bit differently.

First, we decided to describe the typical workday for the position. What the person would do when he comes to the office, which would be his tasks, how would he work. We write that down.

After that analysis, we focus on what he (Or she) needs to know in order to perform his tasks. It could be deep knowledge of our product, SQL know how, MS Office skills or whatever. Divide those skills into skills he would need to know "From home", generic skills that could be taught, and specific (Our product) know how.

Once we know that, we can actually start looking for the right person. The search would be much more focused, as a lot of skills which you might consider as prerequisites will be shed during the process.
We would also have a clear training plan for that person.

In the end, I love to say that I'm looking for excellent PEOPLE - As most skills could be taught.

Happy hunting!


Monday, May 19, 2014

Behold, the future!

This week, I've spent two days at the Google TLV Campus, participating in an amazing workshop given by Prime design studio.
Though I'll probably cover the workshop in another post (Waiting for 'official' photos), I would like to share one significant insight I've got:

During brainstorming for our workshop project's specs and features, Omri (Prime's CEO) sat with us, and while we were looking at a certain concept, asked us to look forward to the future of that product. What would such a product do 10 years from now? 20 years?

So we sat down, throwing down sci-fi inspired ideas ranging from laser grids and quad-copters to terra forming robots (Yes, it was THAT crazy).

[Side note] The two basic rules of brainstorming are: Write down everything, and never argue. Anything is possible during brainstorm, even the most ridiculous ideas.

After that, we looked at the result, and understood what we'll be working on. And though it was different from the original project and looked a bit like a moon shot, it started to seem possible, and the end result (And presentation) was awesome.

After contemplating on the whole process, I've come to realise that we almost never look at our product 10 years from now.

And we should, because a lot of those features could be implemented today. 

Successful companies are great at this, because they create the future of products now. The best example is Apple, but in a smaller scale, Waze, Nest and 23andMe are great examples as well.

Show your customers the future they want, and you will own it.

The future, according to us. Visioned product is top left.

Thursday, May 8, 2014

No, you are not like us.

There's a time in a person's career where he is either managed by younger people, or managing younger ones.
Nearing 40, you begin to develop a generational gap with some of your surroundings. Either you need to recruit younger talent, work with (Or report to), or manage younger people. This has all happened to me before, nevertheless, a gap of more than 10 years means a generational gap.
I might hang out in the same social networks. Maybe even (If you're still) go out to the same cool places, or like the same music. I may even look younger than your real age (Of course I do, and so do you! :)).

But still, I'm older. Which means that (If you look at it in reverse), I might as well be 120 to them.
I won't go into sociological terms like the 'x'/ 'y'/ whatever generation, but as far as I know, the two best things you can do are:

- Don't think you are 'one of them'. Acting/ behaving/ chasing habits won't make you younger. It will be like my mom getting into Instagram and starting to post comments on every one of my kids' pictures, backwards to one year ago. Just don't.

- Do try to understand the culture. The habits. The language. Much like understanding the recent buzz tech tool or design language, you do need to keep up, just keep up in your own terms, pace and perception.




Wednesday, April 23, 2014

Reading time #6!


All right, 6th time around, some good reads and links.

- Was Better place a revolutionary startup, or a huge flop just waiting to happen? (Personally, I was sceptical from the beginning... But this is just me) This is the most amazing article documenting the huge car business that never happened. 

- Here's my friend Moshe Kaplan's view of our Mysql to Mongo migration process. Article was favoured by MongoLabs themselves. Respect.

- If any of you is into creating cool stuff with HTML5 canvas, you MUST check out obelisk.js. So cool.

- I've been talking a lot about node and mongodb - Feel like getting started? Here's one great step by step tutorial on writing your first node/mongodb web service.

- A nice little service to create and share youtube mixtapes - This one is on beta, but looks promising.

And, finally (And a little off topic), two shows to watch:
- If you've ever been near a startup, you MUST watch Silicon Valley. Made by Mike Judge (Yes, the one who created Beavis and Butthead).

- Being a teen in the beginning of the 90s wasn't easy. Surviving Jack shows how teens communicated BEFORE mobile tech and social network existed.

Previous reads:

#5

#4

#3

#2

#1

Friday, April 4, 2014

Clear off the table

[ taking some time off the ongoing mysql->mongo series ]

Been visualising data a lot lately.
Scatter charts, realtime indicators, good ol' line charts and, of course, tables.
It's really easy to fill the screen with loads of information, screaming colours and as many numbers floating around the screen. This makes orientation difficult for the user, especially non-technical users.

Some of our users simply need insights and recommendations, but need the graphs for credibility.

This is a great post from darkhorseanalytics explaining how a good table looks like. [hint - not your good ol' office 1998 excel highlight table]

http://darkhorseanalytics.com/blog/clear-off-the-table/


Sunday, March 30, 2014

Mysql to MongoDb, chapter 2: Diving in

[This is a second part in an ongoing series, part 1 is here)

So, after making the call to go mongo, doing a mongo 101 crash course, we've started working on two major fronts (Only two developers):
1. DB Layer rewrite - This was pretty much straight forward. We've had about 100 functions to rewrite, but a lot of them were simple CRUD function. We've decided to use the native mongo node drivers, as I don't like to use frameworks in my code. (There's mongoose, which is a nice ODM layer, but, as I've said, I'd rather use native stuff, unless there's a performance advantage there.)
Major points you need to consider when transforming your code:

- Mysql has auto count features for unique inserts in table. Mongo has a unique id per object in database. If you're not using one, single unique id (And you're not) for each record in your mysql database, you need to use some sort of an applicative counter solution for inserts in mongo. This is also very useful to return insert id for new entities.

- Type checks and conversion: Rather than using a framework, I've decided to implement a simple hash table for field names and types.

- Logging: Like printing mysql statements to log, write a function which logs your query objects in mongo native format. (Like db.users.find({name:"yuval"}). It makes it much easier to debug.

THIS ONE IS REALLY IMPORTANT:

Do not break mysql support in order to support mongo! Fix both the mysql and mongo db layers, not in your application layer! Make it work, and don't do irreversible things that would brake mysql support. Support switching form mysql to mongo in a single configuration flag, so you can compare performance.   

2. Data conversion - We used mongify, a neat ruby tool, which translates sql databases into mongo. Performance was a bit dodgy for huge tables, se we've contributed some code, which also upped the performance by ~20 times.

Important note for people using open source software - Don't just report bugs. You can fix stuff and contribute to the community. 
[Especially if you're using it for commercial purposes]

Some things we've encountered during our conversion process:
- Dry run your conversion process. Dump your mysql, reload into a vanilla server with mongo installed, and do the dry runs from that server. The operational conversion is something you only need to do once, so it's ok to leave things for manual tinkering later!

- You'll see your application is working slower. Don't worry about it. There's a lot of tuning to do.

- Indexing: You need to take good care of this. Use explain({verbose:1}) for your big queries in order to find out why. Indexing in mongo will solve a lot of your performance problems.

- Large sorts won't work, even with indexes. In our case, it was a sort for an set of more than 130000 records. Instead of implementing paging, we've moved the sorting to the application (Works really fast, thank you). We will need to implement paging eventually, because we've just postponed the inevitable...

- Uniqueness: Like mysql, mongo has an ensureUnique method on index creation. We decided to add indexes manually and not automatically.


The next chapter will deal with more sophisticated tuning methods post conversion. Stay tuned :)



Thursday, March 27, 2014

From mysql to mongo in less than a week, chapter one

This week, we finally took the plunge at Lightapp, and migrated our database into mongo.

In the coming weeks I'll be writing a series of posts about the experience, along with some insights and tips, and (Of course) the end result, in numbers.

The first step was deciding we need mongo instead of mysql. This one was pretty simple. Our company's product reads data from numerous sources, then aligns them and analyses them.
Since we're using big EAV tables for storing all that data, mongodb is a far more suitable solution than mysql.
The stage was right, as our database only contains tens of millions of records, right before we migrate all our current customers (Which will scale it towards hundreds of millions). So, with the help of my good friend (And scale pro) Moshe Kaplan, we started the process.

To make things easier, we also decided on a '1 to 1' conversion. This means that we only switch our db layer, and do it without touching our application. Luckily we've designed our application just like that - Only 1 module was responsible for db connection and querying, thus it was simple to write a parallel mongo db module, and switch between the two seamlessly.

So the first major tip I recommend: 
Build your db layer as a 'replaceable' layer. This means no SQL queries or DB connections outside your layer. If you currently have a written application with lots of 'history', just search for all the queries, and replace them with function calls (And filter parameters) into one file.
You might even find that you can get rid of some duplicated code in the process :)

Second tip:
Fiddle with mongo a bit. You can use a tool like pentaho to pull data from your mysql db, and into mongo, and then login to mongo, do some queries - Learn the whole query and CRUD mechanism in mongo. It's pretty simple, and has lots of great documentation all over the web.

Third tip:
Don't expect miracles. As we only switched our query functionality at first stage, we were not expecting a huge performance gain. Optimisation takes time, and the big value we were expecting from mongo was by moving our processing engine (Data bucketing and aggregation) lower from our application layer into the db layer.

Next thing was to approach the conversion process itself, but this would be elaborated in the next post of the series.

Stay tuned for more :)

[edit:] continue to chapter 2.



Wednesday, March 19, 2014

My take on Logging Levels

A long time ago, while developing a realtime server for Verint, the system's architect devised a great document called 'logging policy'.

It explained how a log message should look like, which information they should contain, and elaborated the rules for each log message level: For example: "There is no such thing as a 'Good' warning. Do NOT put 'Server is now up' messages in warning level".
Back then, the common practice ranged more than five levels, from critical to verbose.

At least at the beginning of development, I found that, though it's nice to do so, right now the most effective number of logging levels is 2. Debug mode and production mode.
Production mode should contain high level log events (Flow events, processing metrics) and errors, debug - the rest.

Logging is one of the strongest maintainability tools for your server. However, keep it simple, or you'll drown in misleading information.



Tuesday, February 25, 2014

Way too lean.

Everyone's going lean these days. Work lean and cheap, go to market as quickly as possible, fail fast.
While this approach works great in our time and age, it causes a lot of products (Jelly, for example) to come half baked.
The philosophy behind it is correct. Test the market as fast as you can, to validate your idea and gain ground. If you fail enough times, you might even succeed once. And one good success is all you need, right? (Yes, I am being ironic.)

However, unless your service goes ballistic right after launch, you might never know if it works or not. Hell, even if it does - What's your measure of success?

99% of the apps do not pass the 10000 download barrier. So what does success mean? You might fail your product a bit too fast there. You might even not know why it failed.

My advice would be to be a bit more thorough about your product. Get real feedback from as many people as you can (2nd hand friends are even better, as they don't owe you anything. You can even pay a symbolic token like a t shirt or something).

Don't launch too early, don't fail too fast. Thinking twice might not align with the cranky, wham bam lean approach, but might save you a lot of time and heartache.


Wednesday, February 5, 2014

Deliberate Mistake, correct end result

I've spent my army service days as an infantry soldier.
A lot of my unit's field practice included night navigation - Where you learn a path (By heart), and sent to navigate few miles at night, and collect a few waypoints.

The most challenging part of these drills was not walking with 35 pounds of gear on, or even seeing what's ahead of you to avoid pitfalls. It was memorising the path, and walking through it - If you don't know everything by heart (Sometimes you only have 20 minutes to prepare your gear and learn the path), you're bound to get lost. If you get lost, you walk more. If you walk more, you'll be more tired. And so begins a downward spiral no one likes.

I found that the most effective way for me to study these paths was using the deliberate mistake method: Where you choose a point on the map that's easy to recognise and reach, and from there walk to the nearest waypoints. This would sometimes mean walking more than the 'direct' approach, but you'd be less likely to forget a turn or a hill, and less 'marks' to count when you're walking, at night, with your gear on.

Nowadays, sometimes when I look at decisions I need to make, I sometimes take the 'wrong' decision, to reach a larger goal sooner - Like coding some ugly patch - I know I can correct this mistake later, but I also know I would also reach my goals on time, with confidence.


Tuesday, January 14, 2014

You can stop developing your product now, thanks.

Once a startup's MVP has been released, there's always the question of what's next.
A lot of startups tend to use the time after sorting out the quirks and the bugs to develop more features and enrich the product.

But sometimes, adding more features to your product before even finding out what works will only distract your new users, who are trying to get to know your product.

A good example might be Google Plus vs. Twitter.

Twitter has remained essentially the same as it's been since its inception. Google Plus has overhauled completely more than once, and is crammed with tons of features. While Google stuffed its social network with more and more features (Share and video and an amazing image viewer and more menus and what not), and how to implement them different than Facebook, Twitter focused on refining its mobile and web experience.

So next time your R&D department got spare time, have them improve performance, clean up code and take care of automation and scale rather than add more buttons and features that might just make your product too rich and too redundant.

Your users will appreciate more if they wait 1 second instead of 5 for a response rather than another copycat feature.

Sunday, January 12, 2014

DO NOT UPGRADE!

Too often, when an application fails on memory or performs poorly, the immediate solution would be upgrading the machine.
Add more cores, more memory - And think that this would clear the problem from under the rug.

This strategy actually causes even more damage - As you will have to deal with the real issue (YOUR CODE...) later, when the system has more data, more angry clients, and more code that breaks.

If your application consumes too much CPU, you should profile it and solve the problem.
If your application leaks memory, you should find out where.

Taking the lazy approach would cause you credibility issues once you will really need that upgrade.

Wednesday, January 1, 2014

Keep your data aligned!

Many big data systems analyse large, periodical data streams.

These data streams are sometimes event based (i.e. - Add an entry whenever a user visits a page, performs an operation etc.), and sometimes 'sample' based (i.e. - Measure CPU level every 5 seconds).

Sometimes, your sampling can be unreliable - For example, when monitoring activity via WAN.
Then you get 'holes' in your data stream. These holes cause problems when analysing your data.

Several companies I know have been known to develop utility functions, which periodically go over the data streams and 'fix' these holes. These functions are usually costly, as finding such holes could get complicated and performance demanding in large datasets. Fixing them (Especially if it's an 'update' operation) is also costly.

My suggestion: Fix the problem before it rises. Keep your data aligned before you insert data into the database - Whenever there's a missed reading, fix it in the next reading by keeping track of your last reading's timing.
You might find it cheaper to hold a pre-input processing machine for this than requiring a huge server for your database because it needs to align data over night.