Thursday, August 24, 2006

A split function in Haskell

Splitting a string into parts based on a token delimiter is a very common operation in some problem domains. Languages such as Perl or Java provide a split function in their standard library to execute this algorithm, yet I'm often surprised to see how many languages do not have one. As far as I can tell neither C++ nor Haskell have it so I have coded such a function in the past multiple times in both languages. (This is not exactly true: Haskell has the words function which splits a string by whitespace characters. Nevertheless I didn't know this when I wrote my custom implementation.)

When I implemented a custom split function in Haskell I was really amazed to see how easy and clean the resulting code was. I'm sure there is some better and even cleaner way to write it because I'm still a Haskell newbie! Here is it:
split :: String -> Char -> [String]
split [] delim = [""]
split (c:cs) delim
| c == delim = "" : rest
| otherwise = (c : head rest) : tail rest
where
rest = split cs delim
The above code starts by declaring the function's type; this is optional because Haskell's type system is able to automatically deduce it. It then uses pattern matching to specify the algorithm's base and recursive cases. At last, the recursive case is defined by parts, just as you do in mathematics. Oh, and why recursivity? Because iteration does not exist in functional programming in the well-known sense of imperative languages. Also note the lack of variables (except for the input ones) and that everything is an evaluable expression.

Let's now compare the above code with two implementations in C++. A first approach to the problem following common imperative programming thinking results in an iterative algorithm:
std::deque< std::string >
split_iterative(const std::string& str, char delim)
{
std::deque< std::string > parts;

std::string word;
for (std::string::const_iterator iter = str.begin();
iter != str.end(); iter++) {
if (*iter == delim) {
parts.push_back(word);
word.clear();
} else
word += *iter;
}
parts.push_back(word);

return parts;
}
This is certainly uglier and much more difficult to prove right; iteration is a complex concept in that sense. In this code we have variables that act as acumulators, temporary objects, commands, etc. Be glad that I used C++ and not C to take advantage of STL containers.

OK, to be fair the code should be implemented in a recursive way to be really comparable to the Haskell sample function. Let's attempt it:
std::deque< std::string >
split_recursive(const std::string& str, char delim)
{
std::deque< std::string > parts;

if (!str.empty()) {
std::string str2 = str;
parts = split_recursive(str2.erase(0, 1), delim);

if (str[0] == delim)
parts.push_front("");
else
parts[0] = str[0] + parts[0];
} else
parts.push_front("");

return parts;
}
This split_recursive function follows the same algorithm as the split written in Haskell. I find that it is still harder to read and more delicate (I had some segmentation fault s until I got it right).

Of course Haskell is not appropriate for everything (which is true for every language out there). I have yet to write a big and useful program in Haskell to really see its power and to be able to relly compare it to other languages. All I can do at the moment is to compare trivial stuff as the above.

14 comments:

  1. Your definition is fine for all standards. I find the following one using span a bit easier to understand:
    split delim s
    | [] <- rest = [token]
    | otherwise = token : split delim (tail rest)
    where (token,rest) = span (/=delim) s

    Both definitions use explicit recursion. In general it is better to avoid it and instead use combinators capturing recursion patterns (think of map, foldr and the like). One of the lesser known ones is 'unfoldr', which builds a list given a term x and a function to deconstruct it incrementally.

    unfoldr :: (a -> Maybe (b, a)) -> a -> [b]

    With unfoldr split can be defined as

    split delim = takeWhile (not . null) . unfoldr (Just . (second $ drop 1) . break (==delim))

    In this case this definition is not that easy to understand (this comes from the Blow your Mind Idioms wiki page), so I prefer to use explicit recursion here. What it does is to build an infinite list of tokens and take only the non empty ones.

    As you mention Haskell code is easier to prove right. But it is easier to test too! We can state some obvious properties that the function 'split' should honor and ask Quickcheck to test them:

    propSplit1 x delim = concat (split x delim) == filter (/=delim) x

    propSplit2 x delim = split (x'++delim:x') == [x',x']
    where x' = filter (/=delim) x

    Try it and you will see that the definition based on unfoldr fails to honor one of the properties.


    Or even more usefully! we can prove that the split version based on span is equivalent to the explicit recursion one:
    propSplit3 x delim = split x delim = splitSpan delim x

    *Main> quickCheck propSplit3
    OK, passed 100 tests.


    God, I'll better stop it here. I just can't get tired of selling Haskell!

    ReplyDelete
  2. Your C++ variant rather inefficient,
    you copy all content of container on exit,
    and copy symbol by symbol not chunks of memory, here is more efficient variant
    and rather simple one:

    template<class Container>
    void split(const std::string& str, char sep, Container& res)
    {
    std::string::size_type prev_pos = 0, pos;

    while ((pos = str.find(sep, prev_pos)) != std::string::npos) {
    res.push_back(std::string(str, prev_pos, pos - prev_pos));
    prev_pos = pos + 1;
    }
    res.push_back(std::string(str, prev_pos, str.length() - prev_pos));
    }

    Variant on Haskell looks like yacc gramatics for me, and far from readable,
    C++ reflect algorithm by itself, may be it is far from human language, but give opportunity easy optimize code.

    ReplyDelete
  3. There's a bug in the "blow your mind" version of split above. If it encounters two delimiters in a row with nothing in between it will stop there and not split the rest of the string.

    I'm afraid my Haskell isn't good enough to fix it.

    ReplyDelete
  4. Here's an unfoldr based version that should work for multiple delims

    It's a bit longer than the one-liner but is also a bit more readable.

    import List

    split :: Char -> String -> [String]
    split = unfoldr . split'

    split' :: Char -> String -> Maybe (String, String)
    split' c l
      | null l = Nothing
      | otherwise = Just (h, drop 1 t)
      where (h, t) = span (/=c) l

    ReplyDelete
  5. I'm sorry, but your comparison seems a bit unfair, since your code is needlessly complicated.

    void split(const string& s, vector<string>& v) {
    for(string::iterator it(s.begin()),end = it;end != s.end();it = end + 1){
    end = find(it,s.end(),' ');
    v.push_back(string(it, end));
    }
    }

    ReplyDelete
  6. Pepe,
    I found your split function to be very concise and impressive.

    I tried to compile it and was told:

    Warning: accepting non-standard pattern guards (use -XPatternGuards to suppress this message). Using PatternGuards helped but then I tried a slight variation on your code:

    split delim s
    | [] == rest = [token]
    | otherwise = token : split delim (tail rest)
    where (token,rest) = span (/=delim) s

    Notice the == instead of the <- on the second line. This worked too and did not require the PatternGuards extension.

    Does using the <- have an advantage over using ==?

    Thanks

    ReplyDelete
  7. Anonymous said...

    I'm sorry, but your comparison seems a bit unfair, since your code is needlessly complicated.

    void split(const string& s, vector& v) {
    for(string::iterator it(s.begin()),end = it;end != s.end();it = end + 1){
    end = find(it,s.end(),' ');
    v.push_back(string(it, end));
    }
    }
    ----------
    The C++ version may be the same number of lines but it is so UGLY and it still looks complicated compared the the haskell version. I don't think anyone would argue that C++/STL code is just BUTT UGLY!

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. split :: Char -> [Char] -> [[Char]]
    split delim = foldr f []
    where f x y | x == delim = []:y
    | null y = [[x]]
    | otherwise = (x:(head y)):(tail y)

    Nevermind, my last code was broke, I wrote assuming that the string ended in a delimiter. Thus, an error occurred from using head in such a case. Though it's not one line anymore, this version maintains the method, just uses a more educated function in lieu of the quick little lambda.

    ReplyDelete
  10. split sep = takeWhile (not . null) . unfoldr (Just . span (/= sep) . dropWhile (== sep))

    This should work with multiple delimiters.

    ReplyDelete
  11. I found this post was very inspirational when I was writing my multi-character deliminated split function a few days ago. It uses many of the same principles as yours but allows a wide deliminater to be used. Please take a look and see what you think.

    ReplyDelete
  12. Haskell has a library split function which is words.

    words :: String -> [String]
    words s = case dropWhile Char.isSpace s of
    "" -> []
    s' -> w : words s''
    where (w, s'') = break Char.isSpace s'

    ReplyDelete
  13. Since this thread has seen comments even in 2011 I just thought it a good idea to note here that there has been a truly excellent "split" package on Hackage (the Haskell library repository) for some years now.
    http://hackage.haskell.org/packages/archive/split/0.1.4.2/doc/html/Data-List-Split.html
    It fulfill any and every splitting need (short of regexes, but you find a splitRegex in the regex libraries)

    ReplyDelete
  14. Very cool, I had some trouble understanding how this works, until I traced the execution on paper. I also came up with my own version that is slightly different:

    otherTok :: String -> [Char] -> [[Char]]
    otherTok [] _ = []
    otherTok cs delim = foldl(\acc c -> if c `elem` delim then [] : acc else (head acc ++ [c]) : (tail acc) ) [""] cs

    However this produces output in reverse order and adjacent delimiters result in empty lists being inserted into the result, which of course can all be taken care of with a reverse and a filter

    ReplyDelete