I spent a considerable amount of my time while working on my BS and MS at NJIT doing contract work I found on Upwork. My entry level work was limited to alot of Manual and automated QA Testing, but this helped to accumulate many hours of work which made my profile on Upwork stronger. After a while I branched out into more challenging opportunities. This was my first real project that I found on my own on Upwork.

My source code can be found here

The Assignment

The task was to crawl Amazon to detect discrepancies for books where the trade-in value exceeded the first available purchase price:

purchasePrice
This image shows the first available purchase price after a user selects the option to view 'More Buying Choices'.

I had created a web crawler for my graduate course in Information Retrieval, so my plan was to extend the usage of this application
to fit my client's needs as described here.

The Problems

The task required me to take an input .txt file, which contained a list of keywords my client would provide me. The web crawler application would be able to parse the keywords from this file, search amazon for books of the respective keywords, and for each result on the first page examine the aformentioned prices. Then the application was supposed to calculate the instances where the trade-in value exceeded the purchase price by a specific amount ($). The desired output would be rendered in a CSV file, for my client to look at in an Excel spreadsheet rather than have to deal with any of my code directly.

Distribution of Work

I determined the steps to be the following:

Crawl Manager nodeJS Instance

  1. Parse input keywords.txt file
  2. Inject keywords into the Amazon URL to be crawled
  3. Push URL on the Queue of URLs to be crawled

Crawl Worker nodeJS Instances

  • Pop URL off queue to crawl
    • If the URL is the result of searching for books:
      • Push URL of subsequent individual books onto queue to be crawled
    • If the URL is an indivudal book:
      • Extract trade-in value
      • Push URL of 'More Buying Choices' onto queue to be crawled
    • If the URL is the 'More Buying Choices' page of an individual book
      • Extract the first available purchase price</List.Item>

Python CSV Formatting Script

  1. Calculate profitable instances using extracted values from nodeJS Worker Instances
  2. Format results into an easy to read comma separated values file for client to view

Distributed Architecture

The client had given me access to his AWS EC2 management console. I was familar with deploying VPS Instances on AWS and was able to use the instances to create my workers and manager instances. I had opted to use Redis and the Kue library installed in AWS elasticache as a centralized job queue.

I opted to use redis over other job queues such as RabbitMQ (Celery) because it had a simple and easy to understand API and did not have too many excess features and was lightweight and easy to setup. Elasticache enabled me to have my redis instance available to all EC2 nodes in my VPC (Virtual Private Cloud). The advantage of architecting a web crawler to utilize a centralized queue was that the URLs to be crawled are simple units of work. These units of work can be consumed by as any workers as necessary in order to reduce overall task completion time.

The Output

The input file followed this same format The keywords were processed as command line arguments (argv) and through the node file module (fs). The Manager file followed the same logic as described in the outline pseudocode above.The worker file follows the respective pseudocode outline above, and used cheerio to scrape the DOM to find the approriate link(s) to crawl, or values to return. The manager file contained a reference count to the total number of jobs, [and when completed] (https://gitlab.com/yolo/nodecrawlin/blob/amazonCrawlin/priceCompManager.js#L68) returned a pretty printed JSON struture returned in a .txt file Python script> processed these results and conducted the appropriate calculations desired by the client to reduce the final set to only the desired records where profit was above a specified level (in this demo, $3). The final results were printed out into a .CSV file