This is the story about what happened when I once tried to divide a string using java and its built-in split(String, int) function. In an ebook reader app, this was to be used to obtain a nine-word string from the beginning of a book page to use as a label for a bookmark. Thus, the string came from one source and the code from another, and neither of them were mine, but the resulting problem certainly was.
The story starts with a string, of course, and it looked like this:
{"This", "is", "the", "beginning", "of", "a", "long", "story."}
String delimiters = "\\s+"; // one or more whitespace characters String[] chunks = string.split(delimiters, 10); // apply pattern 9 times, return an array of max length 10The delimiters string is a regular expression which, according to the java documentation, represents a series of one or more whitespace characters. One passes it on to the String object's split() method, which will use it to divide the string into chunks, cutting off at every point where it finds delimiter.
But here is what I got:
{"", "This", "is", "the", "beginning", "of", "a", "long", "story."}
This could have been the end of the story, but it wasn't. For two reasons. The first one is that the code, which was already in production, tried to glue the chunks back together with one space between each chunk, in this for loop:
for (int i = 0; i < nChunks; ++i) { if (chunks[i].length() <= 0) { // fails if regularExpression does not account for consecutive whitespace throw new AssertionError("Regular expression does not account for consecutive whitespace."); } result += chunks[i]; if (i != nChunks - 1) { result += " "; } }So an
AssertionError
was thrown. I have access to this code, I could have simply replaced the throw
with
continue
, but since I found the reason for the empty string, I preferred to trim() the input string before splitting
it instead:
string = string.trim();This time, there should be no empty string in the string array. But alas! there was! And that's the second circumstance that intrigued me to explore this. The array resulting from split()ing a trim()ed version of the string still was
{"", "This", "is", "the", "beginning", "of", "a", "long", "story."}
It's the non-breaking space.
Why wasn't the non-breaking space trimmed away? Here is the java.lang.String's trim() function:
public String trim() { int len = count; int st = 0; while ((st < len) && (charAt(st) <= ' ')) { st++; } while ((st < len) && (charAt(len - 1) <= ' ')) { len--; } return ((st > 0) || (len < count)) ? substring(st, len) : this; }The condition charAt(st) <= ' ') isn't met for the non-breaking space in our string, which is \xA0.
And why isn't it stripped off with the \s+ regex pattern? Because the \s pattern is defined as [ \t\n\x0B\f\r]. The non-breaking space isn't in there. (\x0B is a vertical tabulation).
But we don't need to dwell at the fact that the non-breaking space isn't counted as whitespace; java offers another way to get at it, namely \p{javaWhitespace}. This pattern is said to be equivalent to java.lang.Character.isWhitespace(), so I thought that should be fine.
But this time, printing out the strings in the array gave me this:
{" ", "This", "is", "the", "beginning", "of", "a", "long", "story."}
The fix for this is easy, we simply include the three non-breaking spaces in our splitter regexp, thus: [\\u00A0\\u2007\\u202F\\p{javaWhitespace}]+. But then we're back to the leading empty string in our result array. So the complete procedure to follow to fix this issue would be
On a side note: if the string is terminated by whitespace characters (and not trim()ed), the two split() methods in java's String class behave differently. The one that accepts only a string, returns, as expected, an array of words. But the one that takes both an input stream and an integer, will iterate over its input string the given number of times looking for the regexp to match. If it reaches the end of the string in this search, the empty string will appear at the end of the result array. The only way to avoid this, is to set the integer to 0, which gives the same behaviour as if the other method were used. But this is documented behaviour.
January 2017