How To: Searching and scanning files

Posted: (EET/GMT+2)

 

How To: Searching and scanning files

Level: Beginner to Intermediate


The ability to search for files is a basic requirement for a computer with a complex file system. For example, a small 2 gigabyte hard disk can contain tens of thousands of files, so it's no miracle if you forgot where you put one of them.

Also, it is often necessary to find a file that contains a particular string. For example, I often need to find files which contain simple Double Byte Character Strings (DBCS's). For example 32-bit EXE files contain such characters in their resource strings. Using Windows 95's Find command doesn't work, because if I enter "ABC", the command will scan for the exact string. But, in the EXE file, "ABC" is stored as "#0A#0B#0C", where "#0" is a null character.

For this reason, I sat down for where hours, and typed in the following program. And, it works... ;-)


Searching for files

In the Win32 API, there are functions to enumerate files and directories in a single directory, but no commands to search for files in a whole drive. Thus, we need some programming, and I mean recursive programming.

Recursive functions are functions, that call themselves. Recursion is never absolutely necessary, but sometimes elegant algorigthms are recursive. One example is the searching of files.

Our algorithm is:

    1. Scan given directory for all files & dirs (ScanDir)
    2. Call ourself (ScanDir) recursively for any found directories
    3. Scan given directory for files matching filespec
    4. [Optional] Scan for any file matching specs for the given string.

ScanDir in practice

The following is the implementation of the ScanDir function:

    Procedure ScanDir(Dir : String);
    Var
      DirSearch : THandle;
      DirFD     : TWin32FindData;
      S         : String;
    
    Begin
      DirSearch := FindFirstFile(PChar(Dir+'*.*'),DirFD);
      If (DirSearch <> Invalid_Handle_Value) Then Begin
        Repeat
          Inc(Scanned);
          If (((DirFD.dwFileAttributes And File_Attribute_Directory) <> 0) And
               (DirFD.cFileName[0] <> '.')) Then Begin
            S := Dir+DirFD.cFileName;
            If (S[Length(S)] <> '\') Then S := S+'\';
            ScanDir(S);
            ScanFiles(S);
          End;
        Until (Not FindNextFile(DirSearch,DirFD));
      End;
      FindClose(DirSearch);
    End;
    
Here, the directory to be searched for is initially the root directory (\), but as the function gets called recursively, the directory changes.

To find files using Win32 API calls, we must first initialize a search record. This is done using FindFileFirst. As parameters, we give what we want to search for. Subsequent search calls (FindFileNext) don't need such parameters.

To find a directory, we must scan for all files with the "*.*" wildcard. Of course, this will find all files too, but we can separate files from directories by checking the file attributes. Also, we need to avoid the "current" and "parent" directories ("." and "..") as well.

After we've found a directory, a recursive call occurs. We pass the newly found directory as the new "base" directory, so the new recursion doesn't check for the same files as the previous level.


Searching for files

After directories, we obviously need to scan for the files. Remember, the directories were scanned using the "*.*" wildcard. But, the user can specify another wildcard for the files, for example "*.txt". For this reason, we need an additional bundle of FindFirstFile/FindNextFile/FindClose calls. The code looks like this:

    Procedure ScanFiles(Dir : String);
    Var
      FileSearch : THandle;
      FileFD     : TWin32FindData;
    
    Begin
      FileSearch := FindFirstFile(PChar(Dir+FileSpec),FileFD);
      If (FileSearch <> Invalid_Handle_Value) Then Begin
        Repeat
          With FileFD do Begin
            If ((dwFileAttributes And File_Attribute_Directory) = 0) Then Begin
              If ((ScanLen <> 0) And (Not ScanFileForText(Dir+cFileName))) Then
                Break;
              Inc(Matching);
              If (cAlternateFileName[0] = #0) Then WriteLn(Dir,cFileName)
              Else WriteLn(cAlternateFilename,' ',Dir,cFileName);
            End;
          End;
        Until (Not FindNextFile(FileSearch,FileFD));
      End;
      FindClose(FileSearch);
    End;
    
As you can see, the only difference compared to ScanDir is that we need to check if the user entered a string to search for (if so, ScanLen does not equal zero). Also, note the result output using WriteLn. Sometimes you need to get the short (8.3) filename of a long filename, for example because a DOS program needs one. The cAlternateFilename field of the TWin32FindData record gives this name.


Searching for strings

The only thing left is the searching of strings. If the user gave a string to search for on the command line, we convert the string to a DBCS string by adding the null char (#0) in front of every char the user gave. For example, if the user enters "XY", our program seraches for "#0X#0Y". The code to search for a string in a file is:

    Function ScanFileForText(Filename : String) : Boolean;
    Var
      F     : File;
      D     : PChar; { data }
      I,J,S : Integer;
    
    Begin
      Result := True;
      J := 0; S := 0; D := nil;
      Try
        Assign(F,Filename);
        {$I-}
        Reset(F,1);
        {$I+}
        If (IOResult <> 0) Then Begin
          WriteLn('Can''t scan: ',Filename);
          Result := False;
          Exit;
        End;
        S := FileSize(F);
        GetMem(D,S);
        BlockRead(F,D^,S);
        Close(F);
        For I := 0 to S-1 do Begin
          If (D[I] = ScanString[J+1]) Then Begin
            Inc(J);
            If (J = ScanLen-1) Then Exit; { found }
          End
          Else Begin
            J := 0;
            If (D[I] = ScanString[1]) Then Inc(J);
          End;
        End;
      Finally
        FreeMem(D,S);
      End;
      Result := False;
    End;
    
First, we check if we can open the given file. Note that we disable exception handling with the "$I-" compiler directive, and handle the errors (IOResult) ourselves. For example, an opening operation can fail, if another program has the file open.

Next, we just scan the file sequentally to see if the user-specified string occurs in the file. Of course, the search algorithm used is the worst (slowest) available, but it is the easiest to implement. For example, one could use the Boyer-Moore algorithm instead of the "brute-force" used here.

Conclusion

Having read this article, you should be able to create your own file search engines, with or without the file scanning. The example here is very simple, for example there is no multiple drive support. Also, a nice GUI is missing. But, download the code, and examine it. Then extend it!