HD Regular Expression Search

I am working on a project for my computer security class and I have a couple questions. I had an idea to write a program that would search the whole hard drive looking for email addresses. I am just looking for addresses stored in plain text since it would be hard to find anything otherwise. I figured the best way to find addresses would be to use a regular expression.

I wrote an application in C# that works fairly well but it I would like to see if anyone has any better ideas. I am completely up for writing this in another language since I'm assuming C# isn't the best for this type of thing. So far the application I created just starts at the C:/ and recursively locates all files on the drive skipping those that aren't accessible. It also skips all common image, video, audio, compressed, and files over 512mb. This speeds it up quite a bit but there is a small chance that a large file could contain something useful. It takes about 12 seconds to generate the list of files and I'm guessing about an hour to check them all. One downside is that it uses about 50% cpu while scanning.

I'm looking for ideas on how to improve the search. Is there a faster way, a more efficient way, a more thorough way, things like that? I was trying to think if there was any way that you could tell if the file would contain plain text strings or not. Just let me know if you have any cool ideas. Thanks.

13.10.2009 20:04:43
Don't forget the famous xkcd take on this problem: xkcd.com/208
Tim 13.10.2009 20:11:08

To be honest, the easiest existing way to do this is to use grep. As you improve your program, compare your speeds to it, and when you get close, stop worrying about optimizing. Alternatively, take a look at its source for an example of an existing product that does what you're looking for.

13.10.2009 20:09:38
Noting that C# is the app (and thus making a sweeping assumption by ignoring Mono) on Windows, findstr would be the equivalent without resorting to installing a Win32 port of grep.
Chris J 13.10.2009 20:28:56

you should just use grep + find. grep is optimized for searching files fast, and find is optimized for providing lists of appropriate files for things like this. people have spent a long time optimizing these tools - no need to reinvent the wheel.

13.10.2009 20:07:43

As noted elsewhere, tools already exist for this if you install Win32 ports of UNIX tools. Alternatively, the Windows equivalent is:

for /r c:\ %i in (*.*) do findstr /i /r "regular expression" "%i"
13.10.2009 20:34:00