Tim Bray has a fun programming challenge that I thought I'd relax with tonight.
I just hacked up the most obvious sequential implementation using C# 3.0. [ed: jcheng has an even more compact version over here.]
Total elapsed time over the ~225MB log file on my 6 month old X60 (2ghz Core Duo) was 3.85 seconds.
I'll put the code up here and trust that the folks within Microsoft “solving the multi/many-core problem“ will offer up solutions that crush this one (which I spent 20 minutes on after a two beer dinner at Taj Palace).
In looking at the solutions out there, it's obvious I need to learn enough Erlang to figure out whether the file:read_file solutions are buffering the entire file in memory or are doing streaming. The program I wrote does line-by-line processing to avoid buffering the entire file in memory. I'm not sure that loading the entire file into virtual memory (either by hand or using mmap) is going to yield faster results even with more cores and a more parallized approach. [ed: after thinking about it for ten minutes, I came up with the obvious solution described here]
Anyway, here's the code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Text.RegularExpressions;
using System.Diagnostics;
class Program
{
static void Main(string[] args)
{
var regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)");
var grouped = from line in ReadLinesFromFile(@"C:\temp\bray.txt")
let match = regex.Match(line)
where match.Success
let url = match.Value
group url by url;
var ordered = from g in grouped
let count = g.Count()
orderby count descending
select new { Count = count, Key = g.Key };
foreach (var item in ordered.Take(10))
Console.WriteLine("{0}: {1}", item.Count, item.Key);
}
// LINQ-compatible streaming I/O helper
public static IEnumerable<string> ReadLinesFromFile(string filename)
{
using (StreamReader reader = new StreamReader(filename)) {
while (true)
{
string s = reader.ReadLine();
if (s == null)
break;
yield return s;
}
}
}
}
Posted
Oct 09 2007, 07:41 PM
by
don-box