Featured resource
2026 Tech Forecast
2026 Tech Forecast

1,500+ tech insiders, business leaders, and Pluralsight Authors share their predictions on what’s shifting fastest and how to stay ahead.

Download the forecast
  • Lab
    • Libraries: If you want this lab, consider one of these libraries.
    • Core Tech
Labs

Resilient Concurrency and Rate-Limiting for LLM Callbacks

In this Code Lab, you'll build a rate-limited concurrency system for handling thousands of LLM API callbacks. You'll implement queue-based concurrency control, handle rate-limit responses, manage asynchronous arrays efficiently, and maintain system resilience. When finished, you'll have production-ready patterns for safely orchestrating high-volume LLM operations.

Lab platform
Lab Info
Level
Advanced
Last updated
Jun 17, 2026
Duration
29m

Contact sales

By clicking submit, you agree to our Privacy Policy and Terms of Use, and consent to receive marketing emails from Pluralsight.
Table of Contents
  1. Challenge

    Introduction

    Welcome to the Resilient Concurrency and Rate-Limiting for LLM Callbacks Code Lab.

    In this lab, you will build the orchestration layer that lets a backend safely send a large batch of LLM callbacks to a rate-limited endpoint: capping how many run at once with a promise pool, backing off when the service pushes back with a 429, retrying the failures worth retrying, and settling the whole batch into a clean success-and-failure report instead of crashing on the first rejection.

    About the tools and concepts
    • Rate limt: A rate limit is the endpoint's contract: it accepts only so many requests per unit of time and rejects the overflow with an HTTP 429 Too Many Requests, often alongside a hint for how long to wait. Sending an unbounded burst at such an endpoint is the fast path to a wall of 429s and wasted work.

    • Avoiding API starvation is the goal. When a naive job sends every request at once, the endpoint rejects almost all of them and useful throughput collapses toward zero: the work starves even though the service is available.

    • A promise pool prevents this by capping how many callbacks are in flight at the same instant. Rather than hand-rolling a semaphore, production code reaches for a small, battle-tested library (here, p-limit). You create a limiter with a fixed concurrency, then wrap each unit of work in it: the limiter runs up to the ceiling immediately and holds the rest in an internal queue, releasing a queued caller the moment a running one finishes. This keeps offered load near the endpoint's sustainable rate instead of spiking far above it.

    • p-limit also exposes live counters, activeCount for callbacks executing right now and pendingCount for those waiting in its queue, so you can observe backpressure as the batch runs.

    • Backkoff strategy: Capped exponential backoff is the standard answer to a 429. Each successive retry waits roughly twice as long as the last, so a struggling endpoint gets exponentially more breathing room, but the delay is capped at a ceiling so late retries stay bounded instead of ballooning into minutes. Adding jitter, a random offset on top of the computed delay, keeps a fleet of retrying callers from resynchronizing into a new thundering herd.

    • Promise.allSettled is the array primitive for managing massive asynchronous arrays resiliently. Unlike Promise.all, which rejects the moment any single promise rejects, allSettled waits for every promise to finish and reports each outcome as either fulfilled with a value or rejected with a reason, so one failed record never cancels the other ninety-nine.

    • A typed error lets the retry layer make decisions. By throwing a dedicated RateLimitError instead of a generic Error, the code that catches it can retry rate-limit rejections specifically while letting genuinely broken requests fail fast.

    Prerequisites

    Before starting this lab, you should have:

    • Understanding of promises and async/await: composing, awaiting, and resolving promises
    • Familiarity with arrays and array methods: .map, .filter, and .reduce
    • Knowledge of API rate limits and backoff strategies: what a 429 means and why retries need delay
    • Basic understanding of the event loop and concurrency: what runs in parallel versus what merely interleaves
    • Experience with HTTP requests and callbacks: sending a request and handling its response

    The lab environment is ready to use. Run node --version from inside the workspace folder at any time to confirm the runtime, and the project dependencies are already installed.


    The scenario

    You are a backend engineer at CarvedRock building a system that processes customer data records requiring LLM transformation. The LLM service enforces a rate limit of 10 requests per second, and your application has 100 pending requests to process. Naive parallel execution exhausts the rate limit and triggers 429 responses.

    Your task is to implement queue-based concurrency control that respects the rate limit, handle 429 responses with exponential backoff, manage the asynchronous request array efficiently, and maintain system reliability so every record is accounted for.

    The application structure

    Key files in the lab environment
    • workspace/src/config.js: the shared knobs (concurrency ceiling, retry budget, base and maximum backoff delay) every module reads from one place
    • workspace/src/mockLlmServer.js: the mock endpoint that enforces the rate limit and returns 429s with instant responses; treat it as a black box
    • workspace/src/llmClient.js: the client wrapper that calls the endpoint and raises a typed RateLimitError on a 429
    • workspace/src/backoff.js: the capped exponential-backoff delay with jitter
    • workspace/src/processRecord.js: the per-record retry loop that ties the client, backoff, and retry budget together
    • workspace/src/batchProcessor.js: the orchestrator that creates the p-limit pool, gates every record through it, monitors backpressure, and settles the results
    • workspace/src/logger.js: the shared stage logger
    • workspace/runPipeline.js: the end-to-end runner that dispatches all 100 records and prints the summary
    • workspace/data/records.js: the 100 pending records

    Complete the tasks in order. Each task builds on the previous one.

    Run the full workload from the workspace directory at any point with:

    node runPipeline.js
    
  2. Challenge

    Establishing the client and the rate-limit contract

    Setting the system's limits in one place

    Every resilient batch starts from a handful of numbers: how many callbacks may run at once, how many times a single record may try before giving up, and how the backoff delay grows and where it stops.

    Centralizing these in config.js means the pool, the retry loop, and the backoff function all read from one source of truth, and tuning the system later is a one-line change rather than a hunt across modules.

    The values you set here matter: a concurrency ceiling that sits just under the endpoint's rate limit keeps the pipeline busy without tripping it, and a retry budget large enough to outlast the queue keeps records from failing before their turn comes.

    Calling the endpoint and naming the rate-limit failure

    The client wrapper is the single point where your code touches the model service, so it is also the right place to translate the endpoint's response into something the retry layer understands. A 429 is not a normal error: it is a retry me later signal, and the rest of the system needs to recognize it as distinct from a malformed request or a broken record.

    Raising a dedicated RateLimitError, rather than a generic one, lets the retry loop catch rate-limit rejections specifically and back off, while letting other failures surface immediately.

  3. Challenge

    Controlling concurrency with a promise pool

    Capping in-flight work with p-limit

    The pool is the mechanism that keeps the workload from overwhelming the endpoint. Instead of hand-rolling a semaphore, you use p-limit, the library advanced teams actually reach for in production: it is small, well-tested, and removes the queue bookkeeping you would otherwise own and have to maintain.

    You create a limiter bound to the concurrency ceiling, then gate every record's processing through it. p-limit runs up to the ceiling immediately and parks the overflow in an internal queue, admitting the next queued caller the instant a running one finishes, so the number of callbacks in flight stays pinned at the ceiling from start to finish.

    Watching the queue and the in-flight load

    A pool you cannot see into is hard to operate. p-limit exposes two live counters: activeCount, the callbacks executing at this instant, and pendingCount, the callbacks waiting in its internal queue.

    Reading them as the batch runs turns the pool from a black box into an observable system: you can watch the active count sit pinned at the ceiling while the pending count drains toward zero, which is exactly the backpressure signal an operator needs to confirm the limiter is doing its job and to reason about whether the ceiling is set correctly.

  4. Challenge

    Handling rate limits with exponential backoff

    Waiting longer each time, capped, with jitter

    Even a well-tuned concurrency ceiling will occasionally draw a 429: bursts overlap, windows roll, and the endpoint pushes back. The right response is to wait, and to wait progressively longer on each successive attempt so a struggling service gets exponentially more room to recover.

    Doubling the delay per attempt is the standard curve, but unbounded doubling quickly produces excessive waits, so you cap the delay at a ceiling.

    On its own, a fixed curve also makes every retrying caller wake at the same instant, recreating the burst that caused the 429. Adding jitter, a random offset layered on the computed delay, spreads those wakeups out so the retries arrive smoothly instead of in a synchronized wave.

    Retrying the right failures and failing hard on the rest

    Backoff is the "how" of a retry; the retry loop is the "when" and the "how many".

    A robust loop draws a sharp line between failures worth retrying and failures that should stop the record cold. A RateLimitError is transient: wait and try again. The loop treats anything else as non-retryable and re-throws it at once, because retrying a genuinely broken request just wastes the budget. And the budget is finite: once a record exhausts its attempts, the loop raises a clear, final error so the batch layer can record a clean failure instead of letting the record hang or silently vanish.

  5. Challenge

    Managing asynchronous arrays and resilience

    Settling every callback instead of bailing on the first

    With the pool, backoff, and retry loop in place, the array of pool-governed promises is already in flight. The choice of how you await that array is what makes the batch resilient.

    Promise.all would abandon the entire run the instant any one record threw its final error, discarding the ninety-nine that succeeded alongside the one that failed.

    Promise.allSettled instead waits for all of them and reports each outcome independently, which is exactly the behavior a batch job needs when you expect a few records to fail.

    Splitting the outcomes into success and failure

    A settled array is a list of outcome objects, not results: each entry reports a status of either fulfilled or rejected, with the real value or the error tucked inside.

    The final step reshapes that into the report a caller actually wants: the transformed values that succeeded and the reasons that failed, separated. Partitioning the settled array by status gives the runner a complete, honest picture of the batch: how many records made it through and exactly which ones did not, which is the difference between a job you can operate and one you can only guess at.

  6. Challenge

    Run the full pipeline

    Now that every task is complete, run the end-to-end workload to watch the orchestration layer absorb the rate limit and settle all 100 callbacks.

    1. Confirm the runtime is available:

      node --version
      
    2. Start the workload from the workspace directory against the full batch of 100 records:

      node runPipeline.js
      
    3. Watch the log stream print an [INIT] line as the dispatch begins, then a series of [POOL] lines reporting the live active and queued counts, and finally a [DONE] line reporting how many callbacks settled and how long the run took.

    4. Notice the [POOL] snapshots: the active count holds at the ceiling you set while the queued count falls steadily, visible proof the pool is pacing the work rather than sending it all at once.

    5. Confirm the final [DONE] summary reports succeeded: 100 and failed: 0. Every record made it through despite the endpoint's rate limit, because the pool held concurrency at the ceiling and capped backoff absorbed the 429s the endpoint returned. Because the endpoint responds instantly, the run's pace is set entirely by the rate limit and your backoff, so it settles in roughly ten to thirteen seconds.

    Expected result: Every layer you built is visible in one run: the pool caps in-flight callbacks and exposes its queue, the client raises typed rate-limit errors, capped backoff spaces out the retries, the retry loop recovers the transient failures, and Promise.allSettled settles the whole batch into a clean 100 succeeded, 0 failed report instead of collapsing under a wall of 429s.

  7. Challenge

    Conclusion

    Congratulations on completing the Resilient Concurrency and Rate-Limiting for LLM Callbacks lab!

    You have built the orchestration layer that turns a naive parallel burst into a production-grade batch: capping concurrency with a p-limit pool, backing off on rate limits, retrying the failures worth retrying, and settling every callback into a complete success-and-failure report. These are the patterns you need to safely orchestrate high-volume LLM operations.

    What you have accomplished

    1. Set the concurrency, retry, and backoff limits: Defined the concurrency ceiling, retry budget, and backoff bounds once in a shared config every module reads from.
    2. Wired up the client and detected rate-limit responses: Routed every record through one client wrapper that raises a typed RateLimitError on a 429.
    3. Created the pool and gated the workload: Built a p-limit pool at the concurrency ceiling and routed every record through it so in-flight work stays pinned at the limit.
    4. Surfaced backpressure with the pool's live counters: Read activeCount and pendingCount to make the queue and in-flight load observable as the batch runs.
    5. Implemented capped exponential backoff: Waited a bounded, exponentially growing, jittered interval before each retry so a struggling endpoint recovers and retries never resynchronize.
    6. Added bounded retry and error recovery: Retried rate-limit failures within a budget, failed fast on non-retryable errors, and raised a clear final error on exhaustion.
    7. Awaited settlement of the whole batch: Used Promise.allSettled so one failure never cancels the run.
    8. Partitioned results for a complete report: Split the settled array into succeeded values and failure reasons for an honest, operable summary.

    Key takeaways

    • A promise pool such as p-limit keeps offered load near the endpoint's sustainable rate, which avoids the API starvation a naive parallel burst causes and prevents far more 429s than any retry strategy can clean up after.
    • The pool's activeCount and pendingCount counters turn concurrency into something you can observe and reason about in production, not just configure and hope.
    • Capped exponential backoff with jitter is the standard pairing for rate-limit recovery: the exponential curve gives the endpoint room, the cap keeps late retries bounded, and the jitter stops a fleet of callers from retrying in lockstep.
    • A typed error turns retry logic into a clear decision: retry the transient failures, fail fast on the rest.
    • Promise.allSettled is the resilient way to await a batch: it reports every outcome instead of abandoning the run on the first rejection.

    Experiment before you go

    You still have time in the lab environment. Try these explorations:

    • Lower MAX_CONCURRENCY and rerun the workload: watch the [POOL] active count drop and the total time climb as fewer callbacks run at once. Then raise it well above the endpoint's ceiling and watch the 429s and retries multiply.
    • Lower MAX_RETRIES toward 5 and watch the failed bucket fill: with instant responses, a record that runs out of attempts before the queue clears fails, which is exactly the starvation the retry budget prevents.
    • Raise MAX_DELAY_MS and observe how a higher cap lengthens the tail of the run as late retries wait longer.
    • Add a [RETRY] log line inside the retry loop so the timeline shows each backoff as it happens, then watch where the retries cluster during a run.
    • Explore p-limit's clearQueue() method: imagine a fatal condition partway through the batch and reason about how draining the pending queue would let you abort the remaining work cleanly.
About the author

Angel Sayani is a Certified Artificial Intelligence Expert®, CEO of IntellChromatics, author of two books in cybersecurity and IT certifications, world record holder, and a well-known cybersecurity and digital forensics expert.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Get started with Pluralsight