Architecture of our Infrastructure
My main goal is to create a system that will scale easily and allow us to use commodity infrastructure elements. These commodity infrastructure elements each have their advantages and disadvantages. The architecture of the system aims to utilize the advantages and minimize the disadvantages.Here is a list different functions performed by our system.
- Public User Interface
- Collection
- Processing
- Storing
Public User Interface
Running on Cherokee Web Server. Cherokee is a high efficiency web server that has low requirements and can handle very high loads. This element can be served sufficiently well by an inexpensive virtualized web server running in the cloud.
Collection
Collection element of the system will be performed by scrapy spiders which will get their tasks from the Task Server and store the result in the Beecoop Data Store.
Processing
Processing is done by workers that, like Collection, get their tasks from the Task Server and either store their data in Beecoop Data Store or PUI database.
Both Processing and Collection require a lot of RAM and CPU. We’re going to attempt to use commodity hardware to fulfill these functions. Core Networks has inexpensive hardware that we can try to use. If this fails then we can use Amazon EC2.
Storing
The collection element will capture a lot of different kinds of data. In our testing, 1 week of execution, we collected over 500,000 units of data. To handle this volume of information we’re going to utilize Google Data Store, because it will give us a huge database that we will not have to scale ourselves.
Visual Representation of our Infrastructure

