Tuesday, May 19, 2009

Solutions to Chapter 8 (p. 205)

#1. Simply evaluating globToRegex "[" gives the answer - error is evaluated. There are two interesting things to note here: first, there are two different places that can flag this error: in globalToRegex', which happens if the '[' is the last character in the string; or in charClass, which happens if the character class is nonempty but not terminated. Second, and more interesting: due to lazy evaluation, the error is flagged only after the rest of the regexp string was generated and printed. See, for example, the response from ghci here:

GlobRegex> globToRegex "whatnow?["
"^whatnow.*** Exception: unterminated character class

#2. There's probably something in POSIX regular expressions that allows for case insensitivity, but of course that's not interesting.

Let's add a new Bool parameter, which should be True if case is to be ignored. We'll also add a new helper function, escapeCase, which can return "xX" for any character x if case is ignored. escapeCase is used in two different contexts: inside a character class (a [...] block), we need a simple replacement, but outside of a character class, we need to create one, i.e., replace x with [xX]. Because escape is used to process any character outside of a character class block, it is the perfect case to handle this. Here goes:

module GlobRegex (globToRegex, matchesGlob, matchesGlobIgnoreCase) where

import Text.Regex.Posix ((=~))
import Data.Char (toUpper, toLower)

globToRegex :: String -> Bool -> String
globToRegex cs ign = '^' : globToRegex' cs ign ++ "$"

globToRegex' :: String -> Bool -> String
globToRegex' "" _ = ""

globToRegex' ('*':cs) ign = ".*" ++ globToRegex' cs ign
globToRegex' ('?':cs) ign = "." ++ globToRegex' cs ign

globToRegex' ('[':'!':c:cs) ign = "[^" ++ (escapeCase c ign) ++ (charClass cs ign)
globToRegex' ('[':c:cs) ign = "[" ++ (escapeCase c ign) ++ (charClass cs ign)
globToRegex' ('[':_) _ = error "unterminated character class"

globToRegex' (c:cs) ign = (escape c ign) ++ globToRegex' cs ign

escape :: Char -> Bool -> String
escape c _ | c `elem` regexChars = '\\' : [c]
  where regexChars = "\\+()^$.{}]"
escape c False = [c]
escape c True = '[' : (escapeCase c True) ++ "]"

escapeCase :: Char -> Bool -> String
escapeCase c True | lowerC /= upperC = [lowerC, upperC]
  where upperC = toUpper c
        lowerC = toLower c
escapeCase c _ = [c]

charClass :: String -> Bool -> String
charClass (']':cs) ign = ']' : globToRegex' cs ign
charClass (c:cs) ign = escapeCase c ign ++ charClass cs ign
charClass _ _ = error "unterminated character class"

matchesGlob :: FilePath -> String -> Bool
f `matchesGlob` g = f =~ globToRegex g False

matchesGlobIgnoreCase :: FilePath -> String -> Bool
f `matchesGlobIgnoreCase` g = f =~ globToRegex g True

Trying it out in ghci:

Prelude> :load "GlobRegex.hs"
[1 of 1] Compiling GlobRegex        ( GlobRegex.hs, interpreted )
Ok, modules loaded: GlobRegex.
*GlobRegex> globToRegex "hello" False
*GlobRegex> globToRegex "hello" True
*GlobRegex> globToRegex "HELLO" True
*GlobRegex> globToRegex "foo[bar]" True

No comments:

Post a Comment