When I implemented a custom split function in Haskell I was really amazed to see how easy and clean the resulting code was. I'm sure there is some better and even cleaner way to write it because I'm still a Haskell newbie! Here is it:
split :: String -> Char -> [String]The above code starts by declaring the function's type; this is optional because Haskell's type system is able to automatically deduce it. It then uses pattern matching to specify the algorithm's base and recursive cases. At last, the recursive case is defined by parts, just as you do in mathematics. Oh, and why recursivity? Because iteration does not exist in functional programming in the well-known sense of imperative languages. Also note the lack of variables (except for the input ones) and that everything is an evaluable expression.
split [] delim = [""]
split (c:cs) delim
| c == delim = "" : rest
| otherwise = (c : head rest) : tail rest
where
rest = split cs delim
Let's now compare the above code with two implementations in C++. A first approach to the problem following common imperative programming thinking results in an iterative algorithm:
std::deque< std::string >This is certainly uglier and much more difficult to prove right; iteration is a complex concept in that sense. In this code we have variables that act as acumulators, temporary objects, commands, etc. Be glad that I used C++ and not C to take advantage of STL containers.
split_iterative(const std::string& str, char delim)
{
std::deque< std::string > parts;
std::string word;
for (std::string::const_iterator iter = str.begin();
iter != str.end(); iter++) {
if (*iter == delim) {
parts.push_back(word);
word.clear();
} else
word += *iter;
}
parts.push_back(word);
return parts;
}
OK, to be fair the code should be implemented in a recursive way to be really comparable to the Haskell sample function. Let's attempt it:
std::deque< std::string >This split_recursive function follows the same algorithm as the split written in Haskell. I find that it is still harder to read and more delicate (I had some segmentation fault s until I got it right).
split_recursive(const std::string& str, char delim)
{
std::deque< std::string > parts;
if (!str.empty()) {
std::string str2 = str;
parts = split_recursive(str2.erase(0, 1), delim);
if (str[0] == delim)
parts.push_front("");
else
parts[0] = str[0] + parts[0];
} else
parts.push_front("");
return parts;
}
Of course Haskell is not appropriate for everything (which is true for every language out there). I have yet to write a big and useful program in Haskell to really see its power and to be able to relly compare it to other languages. All I can do at the moment is to compare trivial stuff as the above.
Your definition is fine for all standards. I find the following one using span a bit easier to understand:
ReplyDeletesplit delim s
| [] <- rest = [token]
| otherwise = token : split delim (tail rest)
where (token,rest) = span (/=delim) s
Both definitions use explicit recursion. In general it is better to avoid it and instead use combinators capturing recursion patterns (think of map, foldr and the like). One of the lesser known ones is 'unfoldr', which builds a list given a term x and a function to deconstruct it incrementally.
unfoldr :: (a -> Maybe (b, a)) -> a -> [b]
With unfoldr split can be defined as
split delim = takeWhile (not . null) . unfoldr (Just . (second $ drop 1) . break (==delim))
In this case this definition is not that easy to understand (this comes from the Blow your Mind Idioms wiki page), so I prefer to use explicit recursion here. What it does is to build an infinite list of tokens and take only the non empty ones.
As you mention Haskell code is easier to prove right. But it is easier to test too! We can state some obvious properties that the function 'split' should honor and ask Quickcheck to test them:
propSplit1 x delim = concat (split x delim) == filter (/=delim) x
propSplit2 x delim = split (x'++delim:x') == [x',x']
where x' = filter (/=delim) x
Try it and you will see that the definition based on unfoldr fails to honor one of the properties.
Or even more usefully! we can prove that the split version based on span is equivalent to the explicit recursion one:
propSplit3 x delim = split x delim = splitSpan delim x
*Main> quickCheck propSplit3
OK, passed 100 tests.
God, I'll better stop it here. I just can't get tired of selling Haskell!
Your C++ variant rather inefficient,
ReplyDeleteyou copy all content of container on exit,
and copy symbol by symbol not chunks of memory, here is more efficient variant
and rather simple one:
template<class Container>
void split(const std::string& str, char sep, Container& res)
{
std::string::size_type prev_pos = 0, pos;
while ((pos = str.find(sep, prev_pos)) != std::string::npos) {
res.push_back(std::string(str, prev_pos, pos - prev_pos));
prev_pos = pos + 1;
}
res.push_back(std::string(str, prev_pos, str.length() - prev_pos));
}
Variant on Haskell looks like yacc gramatics for me, and far from readable,
C++ reflect algorithm by itself, may be it is far from human language, but give opportunity easy optimize code.
There's a bug in the "blow your mind" version of split above. If it encounters two delimiters in a row with nothing in between it will stop there and not split the rest of the string.
ReplyDeleteI'm afraid my Haskell isn't good enough to fix it.
Here's an unfoldr based version that should work for multiple delims
ReplyDeleteIt's a bit longer than the one-liner but is also a bit more readable.
import List
split :: Char -> String -> [String]
split = unfoldr . split'
split' :: Char -> String -> Maybe (String, String)
split' c l
| null l = Nothing
| otherwise = Just (h, drop 1 t)
where (h, t) = span (/=c) l
I'm sorry, but your comparison seems a bit unfair, since your code is needlessly complicated.
ReplyDeletevoid split(const string& s, vector<string>& v) {
for(string::iterator it(s.begin()),end = it;end != s.end();it = end + 1){
end = find(it,s.end(),' ');
v.push_back(string(it, end));
}
}
Pepe,
ReplyDeleteI found your split function to be very concise and impressive.
I tried to compile it and was told:
Warning: accepting non-standard pattern guards (use -XPatternGuards to suppress this message). Using PatternGuards helped but then I tried a slight variation on your code:
split delim s
| [] == rest = [token]
| otherwise = token : split delim (tail rest)
where (token,rest) = span (/=delim) s
Notice the == instead of the <- on the second line. This worked too and did not require the PatternGuards extension.
Does using the <- have an advantage over using ==?
Thanks
Anonymous said...
ReplyDeleteI'm sorry, but your comparison seems a bit unfair, since your code is needlessly complicated.
void split(const string& s, vector& v) {
for(string::iterator it(s.begin()),end = it;end != s.end();it = end + 1){
end = find(it,s.end(),' ');
v.push_back(string(it, end));
}
}
----------
The C++ version may be the same number of lines but it is so UGLY and it still looks complicated compared the the haskell version. I don't think anyone would argue that C++/STL code is just BUTT UGLY!
This comment has been removed by the author.
ReplyDeletesplit :: Char -> [Char] -> [[Char]]
ReplyDeletesplit delim = foldr f []
where f x y | x == delim = []:y
| null y = [[x]]
| otherwise = (x:(head y)):(tail y)
Nevermind, my last code was broke, I wrote assuming that the string ended in a delimiter. Thus, an error occurred from using head in such a case. Though it's not one line anymore, this version maintains the method, just uses a more educated function in lieu of the quick little lambda.
split sep = takeWhile (not . null) . unfoldr (Just . span (/= sep) . dropWhile (== sep))
ReplyDeleteThis should work with multiple delimiters.
I found this post was very inspirational when I was writing my multi-character deliminated split function a few days ago. It uses many of the same principles as yours but allows a wide deliminater to be used. Please take a look and see what you think.
ReplyDeleteHaskell has a library split function which is words.
ReplyDeletewords :: String -> [String]
words s = case dropWhile Char.isSpace s of
"" -> []
s' -> w : words s''
where (w, s'') = break Char.isSpace s'
Since this thread has seen comments even in 2011 I just thought it a good idea to note here that there has been a truly excellent "split" package on Hackage (the Haskell library repository) for some years now.
ReplyDeletehttp://hackage.haskell.org/packages/archive/split/0.1.4.2/doc/html/Data-List-Split.html
It fulfill any and every splitting need (short of regexes, but you find a splitRegex in the regex libraries)
Very cool, I had some trouble understanding how this works, until I traced the execution on paper. I also came up with my own version that is slightly different:
ReplyDeleteotherTok :: String -> [Char] -> [[Char]]
otherTok [] _ = []
otherTok cs delim = foldl(\acc c -> if c `elem` delim then [] : acc else (head acc ++ [c]) : (tail acc) ) [""] cs
However this produces output in reverse order and adjacent delimiters result in empty lists being inserted into the result, which of course can all be taken care of with a reverse and a filter