Making Crawler Fast and Reliable
So, as I told in the previous article - the basic version of the Crawler worked well and proved to be usable. The problem - it was slow and unstable.
To make it fast we need to run it on multiple machines (about 5 - 20). And to make it stable we need to figure out how to make reliable system from unreliable components.
The unreliable component - Crawler uses Browser Emulator (Selenium) with enabled JavaScript to properly render content of sites. And it consumes lots of resources and frequently crash and hangs (it is not very stable by itself and what's worse - there maybe invalid HTML or JS on different sites that can crush or hang it).
Going distributed
Multiple machines instead of just one make things a bit complex because couple issues arise:
- Provisioning and deployment couple of tens of machines. I don't want to do it by hands.
- Handle crashes of machines and heal after it. Crawler should be robust and continue to work if one of its nodes crashes, and pick it up again when this node get fixed or new node get added.
- Detect if one of its nodes hanged and need to be rebooted.