- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- Core Tech
Managing High Volume Data Streams with JavaScript
Real systems fall over when they try to load an entire dataset into memory at once. In this lab, you will build a command-line log processor that reads a multi-million-line event log without ever holding more than a sliver of it in memory. You will construct a ReadableStream over a huge file, consume it line by line with for await...of, decode and split raw byte chunks into clean lines with a TransformStream, and fold everything into a compact summary that stays flat in memory no matter how large the input grows. Along the way, you will diagnose a deliberately leaky baseline, apply backpressure so a fast producer cannot overwhelm a slow consumer, and guarantee that file handles and stream readers are always released, even on early exit or mid-stream errors. You build everything on Node.js against the standard Web Streams API, so the finished code also runs unchanged on other runtimes, such as Bun. By the end of the lab, you will know how to process unbounded data in JavaScript with predictable, constant memory.
Lab Info
Table of Contents
-
Challenge
Introduction
Large data does not fit in memory. A log file with millions of lines read all at once forces the heap to hold every byte at the same time, and the process slows, balloons or crashes outright. Streaming is the answer. You pull data through in small pieces, process each piece and let it go, so memory stays flat no matter how big the input grows. In this lab, you build a command-line log-aggregation tool called the Globomantics Stream Processor. It reads a very large newline-delimited event log as a
ReadableStream, consumes it line by line withfor await...ofand produces a small summary report. Along the way you turn a naive buffer-everything baseline into a constant-memory pipeline that honours backpressure and tears down its resources cleanly.Because everything you write targets the standard WHATWG
ReadableStream, the finished code also runs unchanged on other runtimes such as Bun. This lab uses Node throughout so the commands and the constant-memory proof stay consistent.You join Globomantics as an engineer tasked with taming a log pipeline that falls over on large files. The workspace is a small Node project so you can focus on the streaming logic rather than the plumbing.
scripts/generate-log.jswrites a largedata/events.logon first run, giving you a high-volume input without any network access. You do not edit this file.src/processor.jsis the entry point you build out. It exportscreateLineStream,summarizeandrunas stubs with their signatures and JSDoc in place.runorchestrates the work andsummarizeowns the single consumption loop.src/naive.jsis a deliberately leaky baseline that reads the whole file into one string. It exists only to demonstrate the memory problem. You never edit it.src/cli.jsholds the CLI chrome already wired up. Argument parsing including the--limit Nflag, the report formatting and exit-code handling live here.src/types.jsdefines the sharedEventSummaryshape and any constants.- A pre-placed file helper opens the log file and exposes byte-range reads, so you author only the streaming source logic rather than low-level file-handle management.
Each task can be validated individually by clicking on the Validate button next to it.
If you get stuck, every task has a Task Solution section you can expand to reveal the answer. This can be found under the FEEDBACK/CHECKS section of every task.
info> The
solutions/folder at the top of the workspace contains the final state of every task.A failed task will list one or more failed checks under its Checks section, each with a specific message describing what went wrong.
The starting point of the lab is a directory named "
processor". The current directory of the built-in terminal will be set to theprocessor/directory. You can use the Terminal to run the tool with Node.Click on the Next step arrow to get started.
-
Challenge
Analyzing the memory problem
When you read a whole file into a string, every byte of that file lives in memory at the same time. Split that string into an array and you now hold a second full copy. For a small file, nobody notices. For a multi-million-line log, the heap grows until the process slows to a crawl or crashes.
Streaming flips the model. Instead of pulling the entire dataset in at once, you pull it through in small chunks, do your work on each chunk and let it go. Memory stays flat because only one small piece is in hand at any moment.
Here is the contrast on a tiny throwaway example. The eager version keeps every reading forever:
const readings = []; for (const value of sensorFeed) { readings.push(value); // every reading stays in memory } const average = readings.reduce((a, b) => a + b, 0) / readings.length;The streaming version keeps only a running total and a count, so memory does not grow with the number of readings:
let sum = 0; let count = 0; for (const value of sensorFeed) { sum += value; // nothing is retained count += 1; } const average = sum / count;The rest of this lab applies that same idea to a real event log. The start state ships a baseline at
src/naive.jsthat does the job the wrong way. It reads the entire log into one string and then splits it into an array of every line. Two full copies of a huge file in memory at once.Run it against the log you just generated, but cap the heap so the problem shows up quickly:
node --max-old-space-size=64 src/naive.js data/events.logOn a large log this either crawls or throws a heap allocation error. That is the memory problem you are here to fix.
The fix is to stop loading the file and start streaming it. The standard way to model a source you can pull data through is a
ReadableStream. Your tool now pulls data through aReadableStreamone chunk at a time instead of loading the whole file. -
Challenge
Decoding chunks into lines without buffering
Your stream yields raw byte chunks straight from the file reader. A chunk is just whatever bytes happened to fit in the reader's buffer, so a chunk can stop in the middle of a line and the next chunk can begin partway through the following line. You cannot treat one chunk as one line.
The fix is a carry buffer. You decode each chunk into text, split that text on newlines and emit the complete lines you find. Whatever sits after the last newline is an unfinished line, so you hold it in a small variable and prepend it to the next chunk. When the stream ends you emit whatever is left in the carry.
Here is the idea on a throwaway example that splits a stream of comma-separated digits into complete groups. Notice that only one partial group is ever held in memory at a time:
let carry = ''; function handleChunk(text, emit) { carry += text; const parts = carry.split(','); carry = parts.pop(); // the last piece may be incomplete, so keep it for (const part of parts) { emit(part); // every complete piece flows out immediately } } function handleEnd(emit) { if (carry.length > 0) emit(carry); // flush the final piece }Memory stays flat because the carry only ever holds one unfinished piece, never the whole input. Decoded text chunks still do not align to lines. One chunk might hold two and a half lines, the next might finish that half line and start another. To turn this into clean lines you use the carry-buffer pattern in a
TransformStream.A
TransformStreamsits in the middle of a pipeline. It receives chunks in itstransformmethod, does work and enqueues whatever it wants to pass on. It also has aflushmethod that runs once when the input ends, which is exactly where you emit the final partial line.The plan for your line splitter is:
- Keep a
carrystring that holds the unfinished trailing line. - In
transform, prepend the carry to the incoming chunk, split on the newline, hold the last piece back as the new carry and enqueue every complete line before it. - In
flush, enqueue whatever is left in the carry so the last line is not lost when the file has no trailing newline.
Because the carry only ever holds one unfinished line, memory stays flat no matter how large the file is. Your pipeline now turns a file of any size into a stream of complete log lines. Bytes flow in from the file reader, a
TextDecoderStreamturns them into text and yoursplitLinestransform cuts that text into lines while holding only one unfinished line in memory. The samefor await...ofloop insummarizethat used to count byte chunks now counts lines, with no change to the loop itself.The key win is that memory never grows with the file. Whether the log has a thousand lines or fifty million, only one chunk and one carry are ever in hand at once.
- Keep a
-
Challenge
Summarizing in constant memory
Your pipeline now hands
summarizeone complete log line per iteration and the loop counts those lines. A count is a start, but the tool is meant to report more than a number. It reports how many events of each level you saw and the timestamp of the most recent one. To do that you first turn each raw line into a small structured object you can read fields off.Parsing is just pulling the pieces out of a known format. You do it on one line at a time and you never keep the line around afterwards, so it costs no extra memory.
Here is the example that parses simple
key=valuepairs into an object:function parse(text) { const result = {}; for (const pair of text.split(' ')) { const [key, value] = pair.split('='); if (key) result[key] = value; } return result; } parse('id=42 status=ok'); // { id: '42', status: 'ok' }A real parser also has to cope with input that does not fit the format. A line that is empty or malformed must not throw, because one bad line in a multi-million-line file should not crash the whole run. Now that each line becomes an event, you have a choice about what to do with it. The tempting move is to push every event into an array and work on the array at the end. That is exactly the trap this lab exists to avoid. An array of every event grows with the file and defeats the whole point of streaming.
The streaming move is to fold each event into a small accumulator that never grows. You hold one summary object with a running total, a count for each level and the latest timestamp. Each event updates those numbers and is then forgotten. The accumulator is the same size whether the file has ten lines or fifty million.
Here is the pattern example that tallies word lengths without keeping the words:
const totals = { count: 0, longest: 0 }; for (const word of stream) { totals.count += 1; totals.longest = Math.max(totals.longest, word.length); // word is dropped after this }Notice that nothing holds the words. You apply the same idea to your event summary, updating the total, the per-level count and the latest timestamp as each event flows past. It is worth being precise about what you just avoided. If
summarizehad pushed every parsed event into an array and computed the totals at the end, the tool would still print the right numbers on a small file. Every test on a small fixture would pass. The bug would hide until someone ran it on a real multi-million-line log, at which point the array would grow until the heap ran out and the process crashed.That is the danger of the buffer-everything approach. It is correct on small inputs and catastrophic on large ones, so it survives casual testing and fails in production. The fold-into-an-accumulator approach you wrote has the same output on small inputs and stays flat on large ones.
In the Terminal, you can try running the finished tool on the full log and confirm it prints a summary:
node src/cli.js data/events.logThen you can run it again under a low heap ceiling on the same large file to prove it stays flat where the naive baseline crashes:
node --max-old-space-size=64 src/cli.js data/events.log -
Challenge
Using backpressure, cleanup and mitigating leaks
Your source still reads the whole file as fast as it can. It loops in
start, enqueuing every chunk before anyone reads them. On a slow consumer those chunks pile up inside the stream's internal queue, and that queue is memory you did not account for. The producer is running ahead of the consumer.Backpressure is the stream telling the producer to wait. A
ReadableStreamdoes this through thepullmodel. Instead of pushing everything instart, you give the source apull(controller)method. The stream callspullonly when it wants another chunk, which is when the consumer has taken what was there and the internal queue has room. You produce one unit perpullcall and then stop until you are asked again.Here is the example of a number generator. It hands out one number each time it is pulled, never racing ahead:
let next = 0; new ReadableStream({ pull(controller) { controller.enqueue(next); // exactly one item per pull next += 1; }, });You can convert your eager file source to this demand-driven shape. The byte reader stays the same, you just call it from
pullinstead of looping instart. Sometimes you do not want the whole file. A tool might show the first matching events and stop or a user might ask for a capped number of results. The moment you stop reading, you have a resource to think about. The file handle is still open and the source is still ready to produce. If you just walk away, the handle leaks.A
for await...ofloop has a clean way to stop. When youbreakout of it, the loop calls the async iterator's teardown, which cancels the underlying stream. Cancelling the stream runs the source'scancel()method, which is exactly where you close the file handle. So a plainbreaktriggers the whole teardown chain for you, with no manual lock handling.Here is the example of a counter. Breaking out of the loop is all it takes to trigger cancellation of the source behind it:
let seen = 0; for await (const value of someStream) { seen += 1; if (seen >= 10) break; // cancels the stream, which runs its cancel() }This could be wired into
summarize, breaking when the run hits its configured limit. Breaking out of the loop is a clean stop you control. Errors are not. A line might fail to parse in a way you did not expect, the disk might hiccup or downstream code might throw. When the body of afor await...ofloop throws, the loop still calls the iterator's teardown on its way out, so a single throw tears the source down the same way abreakdoes. That is good, but it only covers exits that flow through the loop itself.To make teardown unconditional you wrap the consumption in
try...finally. Thefinallyblock runs whether the loop finishes normally, breaks early or throws, so it is the one place that always executes. Inside it you callstream.cancel(). Cancelling a stream that has already finished or already been cancelled is safe and does nothing the second time, so calling it infinallynever causes a problem on a clean run.You can add
try...finallyaround the fold loop. After this, every run of the tool closes its file handle no matter how the run ends. You built the whole pipeline. You can now run it end to end. In the Terminal, generate a fresh sample log if you do not already have one, then run the tool against it:node scripts/generate-log.js node src/cli.js data/events.logYou see a small report like this, produced from a multi-million-line file while the heap stayed flat:
events: 2000000 levels: INFO=500000 WARN=500000 ERROR=500000 DEBUG=500000 latest: 2022-01-24T03:33:19.000ZTry a capped run to see the early exit you wired in. It stops reading as soon as it hits the limit instead of walking the whole file:
node src/cli.js data/events.log --limit 5You turned a leaky one-string baseline into a streaming pipeline that holds constant memory at any input size. The pieces fit together like this:
- A
ReadableStreamsource over the file that produces one byte chunk per pull, so it never races ahead of the consumer and fills the queue. This is how you avoided the memory leak that started the lab. - A decode-and-split pipeline that turns raw bytes into complete lines while holding only a one-line carry, so partial lines across chunk boundaries stitch together correctly.
- A
for await...offold loop that collapses millions of events into one fixed-size summary, never growing an array. - Demand-driven production through
pull, so the source honours backpressure and produces at the consumer's pace. - An early
breakthat stops a capped run almost immediately, and atry...finallythat callscancel()on every exit path so the file handle is always released, whether the run finishes, stops early or throws.
Every line of streaming code you wrote targets the standard Web Streams API.
ReadableStream,TransformStream,TextDecoderStreamand async iteration are part of the platform, not a Node-only library. The samesrc/processor.jsruns unchanged on other runtimes such as Bun and in modern browsers. You used Node here so the commands and the memory measurements stayed consistent, but the pattern travels.For the next potential steps, you could:
- Add a second
TransformStreamto the pipeline that filters events by level before the fold, so a run can summarise only errors. - Replace the file source with a network source that produces chunks from an HTTP response body, which is itself a
ReadableStream, and watch the rest of the pipeline keep working untouched. - Stream the summary out incrementally instead of returning it once at the end, so a long run can report progress as it goes.
Congratulations! You now have a streaming pipeline that reads any size of input in constant memory and tears itself down cleanly. That is the foundation for almost every high-volume data tool you will build.
- A
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.