Making Crawler Fast and Reliable

So, as I told in the previous article - the basic version of the Crawler worked well and proved to be usable. The problem - it was slow and unstable.

To make it fast we need to run it on multiple machines (about 5 - 20). And to make it stable we need to figure out how to make reliable system from unreliable components.

The unreliable component - Crawler uses Browser Emulator (Selenium) with enabled JavaScript to properly render content of sites. And it consumes lots of resources and frequently crash and hangs (it is not very stable by itself and what's worse - there maybe invalid HTML or JS on different sites that can crush or hang it).

Going distributed

Multiple machines instead of just one make things a bit complex because couple issues arise:

  • Provisioning and deployment couple of tens of machines. I don't want to do it by hands.
  • Handle crashes of machines and heal after it. Crawler should be robust and continue to work if one of its nodes crashes, and pick it up again when this node get fixed or new node get added.
  • Detect if one of its nodes hanged and need to be rebooted.
  • Evenly distribute work between nodes (some sites are small some huge, work should be evenly distributed).
  • Should be easy to add new nodes.
  • Understand what's going on in the system, if all nodes are properly loaded and if you need to stop some or add new.

So, how to do it? At first I was tempted to build some cool and smart architecture (like ones described in papers from Google or Amazon). But, thankfully I know what to do when you visited by this sort of thoughts - you need to lay on the couch and relax for a while or go to walk and this stupid ideas about the ultimate architecture should leave you.

Constraints & Sacrifices

Instead of trying to build superior system - I tried to came up with the simplest possible design. Lets review the requirements, take advantage of our constraints and think - what features we really need and what can be sacrificed.

It's impossible to be good at everything. You will sacrifice something anyway and it's better when it's your explicit and conscious choice.

The most important - do we really need to be truly distributed?

Seems like no - in our case the Browser Emulators consume 95% of resources, the Crawler itself just coordinate the workflow and doesn't perform any resource consuming tasks.

Manager & Workers

So, we can take advantage of this constraints and use simpler architecture - one Manager and N Workers.

The Worker machines with Browser Emulators can be killed and added on the fly without stopping the crawler. The Master node is critical, it is the single point of failure - if it goes down everything goes down. But it is ok in our case because the Master node doesn't contains unreliable parts and is stable enough, and even if it crashes rarely - we can afford couple of hours downtime.

So, we solved the biggest problem, the heart of the Crawler won't be truly distributed it will be just a single process on one machine (yes it would't be able to handle thousands of machines, but for our purposes for couple of tens nodes it will be enough). How to handle instability of Worker nodes with emulators? Yes we can sent heartbeats or use other technics to check if Worker is alive, but there's a simpler way.

  • When Worker machine started it will register itself on the Manager in the list of available Workers.
  • Every call to the Worker comes with timeout, so if Worker dies the call doesn't hang, it ends with timeout error. And in case of such error Master removes problematic Worker from list of Workers.
  • Every half an hour Worker machines get rebooted by cron task. So, even if there where some zombie Workers that hanged - they will be reloaded, and automatically registered in the Manager on startup.

So, now our system is reliable - Workers can be created and crushed but it doesn't affect the system itself (only slight fluctuation in performance). There also will be some resource lost for idle time (Worker for some time can stay killed or be a zombie) but it is relatively small and we can accept it.

The deployment is also easy - we create just two machines by hands - Manager and the first Worker. All other Workers created by cloning the first one and when started registered automatically.

The even load achieved by randomly assigning urls to Workers (there also some measurement and rules on how much time can be consumed by each site, but its a minor details).

And for analytics you just use Amazon EC2 Monitoring and see how much CPU consumed by your workers and see if you need add more.

Callbacks vs. Actors

I used node.js for development, and it was pretty performant and easy to start with. But it was very hard to make things work robustly using Callbacks, Events and State Machines. When the system stop working or what's worse - works slightly differently in hard to reproduce cases - it wasn't easy figuring out what's went wrong.

So, I ended up used different programming model - I used Actors, Messages and Queues. It uses Callbacks and Fibers internally, but as it turned out it's more predictable and easier to work with, at least for me.