Splitting strings with non-breaking space in java / android

This is the story about what happened when I once tried to divide a string using java and its built-in split(String, int) function. In an ebook reader app, this was to be used to obtain a nine-word string from the beginning of a book page to use as a label for a bookmark. Thus, the string came from one source and the code from another, and neither of them were mine, but the resulting problem certainly was.

The story starts with a string, of course, and it looked like this:

  This is the beginning of a long story.
What was expected to come out, was this String array:
{"This", "is", "the", "beginning", "of", "a", "long", "story."}
The code provided was this:
  String delimiters = "\\s+";  // one or more whitespace characters
  String[] chunks = string.split(delimiters, 10); // apply pattern 9 times, return an array of max length 10
The delimiters string is a regular expression which, according to the java documentation, represents a series of one or more whitespace characters. One passes it on to the String object's split() method, which will use it to divide the string into chunks, cutting off at every point where it finds delimiter.

But here is what I got:

{"", "This", "is", "the", "beginning", "of", "a", "long", "story."}
Why the empty string at the start of the array? Well, it turns out that the input string start with whitespace. And the JDK8 docs clearly state that in such cases, this is expected: When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. I missed this because I was developing for android, and read a previous version of the docs, which didn't have this text. One still could figure it out, though, from this: The array returned by this method contains each substring of this string that is terminated by .... I missed the consequences of this. If there are leading spaces in the string, the empty string will sit at position 0 and be terminated by the whitespace, hence the first string in the returned array. Mystery solved!

This could have been the end of the story, but it wasn't. For two reasons. The first one is that the code, which was already in production, tried to glue the chunks back together with one space between each chunk, in this for loop:

  for (int i = 0; i < nChunks; ++i) {
    if (chunks[i].length() <= 0) {
      // fails if regularExpression does not account for consecutive whitespace
      throw new AssertionError("Regular expression does not account for consecutive whitespace.");
    }
    result += chunks[i];
    if (i != nChunks - 1) {
      result += " ";
    }
  }
So an AssertionError was thrown. I have access to this code, I could have simply replaced the throw with continue, but since I found the reason for the empty string, I preferred to trim() the input string before splitting it instead:
string = string.trim();
This time, there should be no empty string in the string array. But alas! there was! And that's the second circumstance that intrigued me to explore this. The array resulting from split()ing a trim()ed version of the string still was
{"", "This", "is", "the", "beginning", "of", "a", "long", "story."}
So what was this rebellious string? What was sitting at the start of it? Printing out the first four characters as ints, I got 160, 10, 32, 84. The 84 is the 'T' starting the visible part of the string. We know 10 to be a line feed and 32 to be a space, but what is that 160 character again?

It's the non-breaking space.

Why wasn't the non-breaking space trimmed away? Here is the java.lang.String's trim() function:

  public String trim() {
    int len = count;
    int st = 0;

    while ((st < len) && (charAt(st) <= ' ')) {
      st++;
    }
    while ((st < len) && (charAt(len - 1) <= ' ')) {
      len--;
    }
    return ((st > 0) || (len < count)) ? substring(st, len) : this;
  }
The condition charAt(st) <= ' ') isn't met for the non-breaking space in our string, which is \xA0.

And why isn't it stripped off with the \s+ regex pattern? Because the \s pattern is defined as [ \t\n\x0B\f\r]. The non-breaking space isn't in there. (\x0B is a vertical tabulation).

But we don't need to dwell at the fact that the non-breaking space isn't counted as whitespace; java offers another way to get at it, namely \p{javaWhitespace}. This pattern is said to be equivalent to java.lang.Character.isWhitespace(), so I thought that should be fine.

But this time, printing out the strings in the array gave me this:

{" ", "This", "is", "the", "beginning", "of", "a", "long", "story."}
So why is the leading empty space now a space character? And why is it included in the array, shouldn't it be excluded as a delimiter? Not surprisingly this time, it's the non-breaking space again. The java.lang.Character documentation explains that the three non-breaking spaces '\u00A0', '\u2007' and '\u202F' are not counted as whitespace. So logically, the leading non-breaking space is not a splitter, but a substring delimited by the following line break.

The fix for this is easy, we simply include the three non-breaking spaces in our splitter regexp, thus: [\\u00A0\\u2007\\u202F\\p{javaWhitespace}]+. But then we're back to the leading empty string in our result array. So the complete procedure to follow to fix this issue would be

  1. Replace any non-breaking spaces with spaces: string.replaceAll("[\\u00A0\\u2007\\u202F]+", " ");
  2. Trim the input string
  3. Split the string and re-assemble it with no risk of leading spaces or empty strings
So why don't we do all of this in only one step? Why not simply replace all (breaking and non-breaking) whitespace character groups with one single whitespace and that would be it? We could, of course. But for picking out the first nine words of a very long string, the procedure above is OK.

On a side note: if the string is terminated by whitespace characters (and not trim()ed), the two split() methods in java's String class behave differently. The one that accepts only a string, returns, as expected, an array of words. But the one that takes both an input stream and an integer, will iterate over its input string the given number of times looking for the regexp to match. If it reaches the end of the string in this search, the empty string will appear at the end of the result array. The only way to avoid this, is to set the integer to 0, which gives the same behaviour as if the other method were used. But this is documented behaviour.

January 2017