How to build a serverless web crawler

By A Cloud Guru News

Jun 08, 2023 • 0 Minute Read

Please set an alt value for this image...

Use serverless to scale an old concept for the modern era

Recently, a client project involved the need to crawl a large media site to generate a list of URLs and site assets. Given the large size of the site, using the traditional approach to crawling would have timed out the Lambda function — so we looked at changing the approach to a serverless model.

Here’s what I learned on this project about designing serverless functions.

Let’s crawl before we run

There are many reasons to crawl a website — and crawling is different to scraping. In crawling a site, we land on a web page that usually the home page, search the page for URLs, and then recursively explore those URLs.

Scraping might be the reason for crawling — especially if you want to store a copy of the content of those pages, or you could simply have some secondary reason for indexing the pages.

Crawling is the interesting part because you can quickly generate a vast list of URLs, or control what you collect by implementing some rules.For example, maybe you only explore URLs with the same domain name, and remove query parameters to reduce duplicates.

Then you have to consider the rate of crawl — do you process one URL at a time or explore many concurrently? How do you treat http versus https if you find mixed usage in the html?

For Node users, there’s a package that does this elegantly called Website Scraper, and it has plenty of configuration options to handle those questions plus a lot of other features.

Running this tool is pretty easy — you can visit my github repo for the full example, but here are the important parts:

const options = {
  urls: myTargetSite,
  directory: '/temp/',
  prettifyUrls: true,
  recursive: true,
  filenameGenerator: 'bySiteStructure',
  urlFilter: (url) => url.startsWith(myTargetSite),
  onResourceSaved: (resource) => URLs.push(resource.url),
  onResourceError: (resource, err) => console.error(`${resource}: ${err}`),
  requestConcurrency: 10
}
const result = await crawl(options)
console.log('# of URLs:', URLs.length)

The package is mostly configuration-driven. We specify a target website and have it recursively search for URLs matching the urlFilter . In this example, the filter is set to include any URL in the same domain, requesting up to 10 URLs concurrently.

When a resource is saved, we push the URL onto an array then log any errors. At the end of the process, all the discovered resources are stored in the array URLs. There’s more code in the full script, but these are the essentials.

This is all great — but we ran into some major issues when crawling a site with over 10,000 URLs. It effectively runs as one atomic job, fetching URLs, saving resources, and managing an internal list of URLs to explore:

It can run for hours and if it fails, there’s no way to pick up at where it stopped. All the state is managed internally and lost if there’s a problem.
The memory consumed can be substantial as the internal map grows, so you must ensure the instance used has enough RAM allocated.
The temporary disk space used can run into hundreds of gigabytes for large sites, so you’ll need to make sure there’s enough local space available.

None of this is the fault of the webscraper package — it’s purely that everything changes at scale. This project works perfectly for 99% of the websites on the Internet but not for crawling the New York Times or Wikipedia.

Given the size of the client’s site, we needed to rethink this approach while also leveraging the advantages of serverless.

Serverless Crawler — Version 1.0

The first step was to simply package the webrawler code into a Lambda function in a basic ‘lift-and-shift’. This attempt operated as expected — timing out at 15 minutes when the site exploration cannot be finished in time.

At the end of 15 minutes, we had little idea of the progress other than what you can glean from looking at the ashes of our log files.To make sure it worked, we had to ensure that the size of the site is not fundamentally related to how long the function runs.

In the new version, the function retrieves the page, finds URLs that match our filters, stores those somewhere and then terminates. When new URLs are stored, this fires up the same function and the process starts over.

Put simply, we’re going to extract the recursion out of the code and replicate it with our serverless box of toys. If we store the URLs in DynamoDB, we can use the streams to trigger the crawler every time a URL is written to the database:

We built this so the entire process is triggered by writing the homepage URL as a record into DynamoDB. This is what happens:

Write “https://mytestsite.com” into DynamoDB.
The stream causes Lambda to start with the incoming event “https://mytestsite.com”.
It fetches the page, finds 20 URLs and saves as 20 records in DynamoDB.
The stream for 20 new records causes multiple Lambdas to start. Each loads a page, finds another 20 URLs and writes these into the database.
The stream for 400 new records causes multiple Lambdas to start, crashing the website we are crawling, and all the unfinished Lambdas throw errors.

Ok, so what went wrong?

It’s easy to forget how efficiently Lambda can scale up for you — and that can result in unintentional and self-induced Denial of Service (DOS).

No matter how large the site, there is a concurrency level which will tank the webserver. We needed to make adjustments to how this worked to prevent the onslaught of Lambdas trying dutifully to get their work done. We needed to make the web crawler, well… crawlier.

Serverless Web Crawler 2.0 — Now Slower!

Our previous problem was caused by Lambda doing its job. If many records are written to DynamoDB simultaneously, this will fire up more concurrent Lambdas to get the work done.

There are times when you don’t want this behavior —and this is one of those times. So how can you slow it down? Generally there are a few strategies that can help put the brakes on:

Introduce an SQS Delay Queue between the Lambda function writing into DynamoDB — this adds a configurable delay of up to 15 minutes.
Change the concurrency setting on the Lambda function from the unreserved default (1000) to a much lower number. Setting this value to 1 will prevent all concurrency and effectively result in serial ingestion. In our case we found 10 was adequate.
Change the batch size, which affects how many records are sent from the stream per invocation. A lower number here results in more concurrent invocations, whereas a larger number will cause a single Lambda to handle more items from the stream if many records are added simultaneously.

Between these three levers, we now have reasonable amount of control to smooth out the processing speed if a large number of events appear in the DynamoDB stream.

Show me the code!

First, before you run anything, make sure you have a small website you either own or have permission to crawl — preferably a development site with a handful of pages. Crawling third party sites without explicit permission often violates terms of use and you don’t want to run afoul of the AWS acceptable use policy.

Ok, let’s look at the files in the package:

handler.js: the default entry point for Lambda, it will execute all incoming events concurrently — if the batch size is 5 URLs, it will fetch these all at the same time. The await promise.all mechanism here handles the complexity of making this happen.
processURL.js: this fetches the page and the discovered URLs, and batches these into DynamoDB in groups of 25 items.
crawl.js: the actual crawling work happens here — it fetches the page and then uses the Cheerio package to discover URLs within the html. There is some logic to validate the URL and eliminate duplicates.
dynamodb.js: this uses batchWriteItem to upload 25 items at a time to the DynamoDB table.
test.js: a minimal test harness that simulates a new item event from testEvent.json. If you run node test.js, it will fire up the whole process for the URL in the testEvent:

To use the code, you will need to create a DynamoDB table called crawler. At this point, when the code runs, it will only work for a single URL.

To make it fire when new URLs are added to DynamoDB, you must activate the stream on the table — go to ‘Overview’ tab, enable the stream and copy the stream ARN into the serverless.yaml file.

Yes, this could all have been automated in the repo. Due to the recursive nature of how the Lambda and the table interact, I did not want anyone to download the code, run it against Amazon.com and wonder why their AWS bill is sky-high. Once you connect the DynamoDB stream to the Lambda function, you will have a recursive serverless loop so be very careful.

There is one interesting point that isn’t immediately obvious but helps the crawling process. In our DynamoDB table, the URL is the primary partition key so it must be unique. When batchWriteItem tries to write duplicate values into the table, the duplicate item does not get written since it cannot update items, so the stream event does not fire. This is critical because many of the URLs discovered in each iteration will already be present in the table.

The overall flow in the end looks like this:

Serverless Web Crawler 3.0

The code provided is just the shell. For our client’s project, we also implemented more of the webscraper logic that saves resources to S3 with each invocation of Lambda. We also added additional attributes to the DynamoDB table, including state (e.g. ‘New’, ‘Processing’, ‘Error’), timestamps and the ability to follow up on failed pages using a cron job.

Working within the serverless environment had a number of beneficial side-effects:

Storing objects on S3, there is no size limit compared with storing files on an EC2 instance. It works for any size of site.
The amount of RAM needed for each invocation isn’t influenced by the overall size of the crawl. Each Lambda function only cares about downloading a single page.
If the target site becomes unavailable or the overall process is paused for some reason, you know the state of the crawl because it’s stored in the DynamoDB table. We extracted the recursion state from the process and put it into a database.
Extending the functionality is trivial. For example, treating images differently to HTML or CSS files doesn’t impact the original function.

For me, the most interesting insight about this project realizing just how many long-lived processes that are not suitable for serverless might be able to leverage the same pattern.

For example, it might be possible to identify some iterative element in these applications which can be broken down so the state is maintained in DynamoDB. The concurrency capability of serverless could dramatically improve the performance of these tasks at potentially much lower cost.

A C.

More about this author