Wide Finder in C# - the Naive implementation

Don Box's Spoutlet

Syndication

Tim Bray has a fun programming challenge that I thought I'd relax with tonight.

I just hacked up the most obvious sequential implementation using C# 3.0. [ed: jcheng has an even more compact version over here.]

Total elapsed time over the ~225MB log file on my 6 month old X60 (2ghz Core Duo) was 3.85 seconds.

I'll put the code up here and trust that the folks within Microsoft “solving the multi/many-core problem“ will offer up solutions that crush this one (which I spent 20 minutes on after a two beer dinner at Taj Palace).

In looking at the solutions out there, it's obvious I need to learn enough Erlang to figure out whether the file:read_file solutions are buffering the entire file in memory or are doing streaming. The program I wrote does line-by-line processing to avoid buffering the entire file in memory.  I'm not sure that loading the entire file into virtual memory (either by hand or using mmap) is going to yield faster results even with more cores and a more parallized approach.  [ed: after thinking about it for ten minutes, I came up with the obvious solution described here]

Anyway, here's the code:

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.IO;

using System.Text.RegularExpressions;

using System.Diagnostics;

 

class Program

{

    static void Main(string[] args)

    {

        var regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)");

 

        var grouped = from line in ReadLinesFromFile(@"C:\temp\bray.txt")

                      let match = regex.Match(line)

                      where match.Success

                      let url = match.Value

                      group url by url;

 

 

        var ordered = from g in grouped

                      let count = g.Count()

                      orderby count descending

                      select new { Count = count, Key = g.Key };

        

        foreach (var item in ordered.Take(10))

            Console.WriteLine("{0}: {1}", item.Count, item.Key);

    }

 

    // LINQ-compatible streaming I/O helper

    public static IEnumerable<string> ReadLinesFromFile(string filename)

    {

        using (StreamReader reader = new StreamReader(filename)) {

            while (true)

            {

                string s = reader.ReadLine();

                if (s == null)

                    break;

                yield return s;

            }

        }

    }

 

}

 

 

 


Posted Oct 09 2007, 07:41 PM by don-box

Comments

Igor Ostrovsky wrote re: Wide Finder in C# - the Naive implementation
on 10-09-2007 9:28 PM
Out of curiosity, how long does it take to just read the input? If much of the time is spent reading the input, then there isn't much processing there to parallelize.
Lars Wilhelmsen wrote re: Wide Finder in C# - the Naive implementation
on 10-09-2007 10:10 PM
Hi Don,

Wouldn't an obvious easy optimization be to stick in the RegexOptions.Compiled when you create the regex instance?

--larsw
Lasse Vågsæther Karlsen wrote re: Wide Finder in C# - the Naive implementation
on 10-10-2007 5:32 AM
I notice that your helper doesn't close the stream after being done with it.

Is there any reason to this, except for the simplicity of the example program?

The reason I'm asking is that I've been having problems with using NCover to get code coverage for the finally section of a enumerator method, even though I know (and can observe) that it executes, as long as you use foreach or make sure you dispose of the enumerator yourself, and I just want to know if using try/finally in such a method supported...
Thanos Baskous wrote re: Wide Finder in C# - the Naive implementation
on 10-10-2007 9:44 AM
@Lasse:

The "using" keyword is a shortcut that the C# compiler recognizes and then generates the code that closes the stream for you by calling the dispose method on the object in a finally block.

What the above is really doing is something like this (I'm not sure if formatting will show up correctly):

StreamReader reader = new StreamReader(filename);
try {
while (true)
{
string s = reader.ReadLine();
if (s == null)
break;
yield return s;
}
}
finally {
if(reader != null)
reader.Dispose();
}
Sean wrote re: Wide Finder in C# - the Naive implementation
on 10-11-2007 5:32 PM
Lars: I had the same thought, but the up front cost of compiling the regex might actually make it take longer for the input size.

He would have to test the two approaches.
midk wrote re: Wide Finder in C# - the Naive implementation
on 10-15-2007 8:33 PM
Would be a powerful demonstration to show PLINQ on this. May need to increase the file size to a few tens of GB to make it worthwhile.
cd wrote re: Wide Finder in C# - the Naive implementation
on 12-21-2007 4:26 PM
Interesting use of Linq on a regex expression. I am currently working on parsing emails and this implementation should make it go a lot quicker.

Add a Comment

(required)  
(optional)
(required)  
Remember Me?