Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

### Topics

InfoQ Homepage Articles Java Feature Spotlight: Text Blocks

# Java Feature Spotlight: Text Blocks

This item in japanese

Lire ce contenu en français

### Key Takeaways

• Java SE 13 (Sept 2019) introduced text blocks as a preview feature, aimed at reducing the pain of declaring and using multi-line string literals in Java. It was subsequently refined in a second preview, with minor changes, and is scheduled to become a permanent feature of the Java Language in Java SE 15(Sept 2020).
• String literals in Java programs are not limited to short strings like "yes" and "no"; they often correspond to entire "programs" in structured languages such as HTML, SQL, XML, JSON, or even Java.
• Text blocks are string literals that can comprise multiple lines of text and uses triple quotes (""") as its opening and closing delimiter.
• A text block can be thought of as a two dimensional block of text embedded in a Java program.

• Being able to preserve the two-dimensional structure of that embedded program, without having to muck it up with escape characters and other linguistic intrusions, is less error-prone and results in more readable programs.

Preview Features

Given the global reach and high compatibility commitments of the Java platform, the cost of a design mistake in a language feature is very high. In the context of a language misfeature, the commitment to compatibility not only means it is very difficult to remove or significantly change the feature, but existing features also constrain what future features can do -- today's shiny new features are tomorrow's compatibility constraints.

The ultimate proving ground for language features is actual use; feedback from developers who have actually tried them out on real codebases is essential to ensure that the feature is working as intended. When Java had multi-year release cycles, there was plenty of time for experimentation and feedback. To ensure adequate time for experimentation and feedback under the newer rapid release cadence, new language features will go through one or more rounds of preview, where they are part of the platform, but must be separately opted into, and which are not yet permanent -- so that in the event they need to be adjusted based on feedback from developers, this is possible without breaking mission-critical code.

In Java Futures at QCon New York, Java Language Architect Brian Goetz took us on a whirlwind tour of some of the recent and future features in the Java Language. In this article, he dives into Text Blocks.

Java SE 13 (Sept 2019) introduced text blocks as a preview feature, aimed at reducing the pain of declaring and using multi-line string literals in Java.

It was subsequently refined in a second preview, with minor changes, and is scheduled to become a permanent feature of the Java Language in Java SE 15(Sept 2020).

Text blocks are string literals that can comprise multiple lines of text. A text block looks like:

String address = """
25 Main Street
Anytown, USA, 12345
""";

In this simple example, the variable address will contain a two-line string, with line terminators after each line. Without text blocks, we would have to have written:

String address = "25 Main Street\n" +
"Anytown, USA, 12345\n";
or

String address = "25 Main Street\nAnytown, USA, 12345\n";

As every Java developer already knows, these alternatives are cumbersome to write. But, more importantly, they are also more error-prone (it's easy to forget a \n and not realize it), and more difficult to read (because the language syntax is intermixed with the contents of the string). Since a text block is usually free of escape characters and other linguistic interruptions, it lets the language get out of the way so it is easier for readers to see the contents of the string.

The most commonly escaped character in string literals is newline (\n), and text blocks eliminate the need for these by allowing multi-line strings to be expressed directly. After newline, the next most commonly escaped character is double quote (\"), which must be escaped because it conflicts with the string literal delimiter. Text blocks eliminate the need for these as well, because a single quote does not conflict with the triple-quote text block delimiter.

### Why the funny name?

One might think that this feature would have been called "multi-line string literals" (and, probably, that's what a lot of people will call it.) But, we chose a different name, text blocks, to highlight the fact that a text block is not merely an unrelated collection of lines, but instead better thought of as a two-dimensional block of text that is embedded in a Java program. To illustrate what we mean by "two-dimensional", let's take a slightly more structured example, where our text block is a snippet of XML. (The same considerations apply to strings that are snippets of "programs" in some other "language", such as SQL, HTML, JSON, or even Java, that are embedded as literals in a Java program.)

void m() {
System.out.println("""
<person>
<firstName>Bob</firstName>
<lastName>Jones</lastName>
</person>
""");
}

What does the author expect this to print? While we can't read their minds, it seems unlikely that the intent was that the XML block should be indented by 21 spaces; it is far more likely that these 21 spaces are there solely to line up the text block with the surrounding code. On the other hand, it is almost certainly the author's intent that the second line of output should be indented by four more spaces than the first. Further, even if the author did want exactly 21 spaces of indentation, what happens when the program is modified and the indentation for the surrounding code changes? We wouldn't want the indentation of the output to change just because the source code was reformatted -- nor would we want for the text block to look "out of place" relative to the surrounding code because it doesn't line up in a sensible way.

From this example, we can see that the natural indentation of a block of a multi-line block of text embedded in our program source derives both from the desired relative indentation between the lines of the block, and from the relative indentation between the block and the surrounding code. We want our string literals to line up with our code (because they would look out of place if they didn't), and we want the lines of our string literals to reflect the relative indentation between the lines, but these two sources of indentation -- which we can call incidental and essential -- are necessarily intermixed in the source representation of the program. (Traditional string literals do not have this problem because they can not span lines, so there is no temptation to put extra leading spaces inside the literal just to make things line up.)

One way to address this problem is with a library method we can apply to multi-line string literals, such as Kotlin's trimIndent method, and Java indeed does provide such a method: String::stripIndent. But because this is such a common problem, Java goes farther, automatically stripping the incidental indentation at compile time.

To disentangle the incidental and essential indentation, we can imagine drawing the smallest possible rectangle around the XML snippet that contains the entire snippet, and treating the contents of this rectangle as a two-dimensional block of text. This "magic rectangle" is the contents of the text block, and reflects the relative indentation between the lines of the block but ignores any indentation that is an artifact of how the program is indented.

This "magic rectangle" analogy may help motivate how text blocks work, but the details are a little more subtle because we may want finer control over which indentation is deemed incidental vs essential. The balance of incidental vs essential indentation can be adjusted using the position of the trailing delimiter relative to the contents.

### The details

A text block uses triple-quotes (""") as its opening and closing delimiter, and the remainder of the line with the opening delimiter must be blank. The content of the text block begins on the next line, and continues up until the closing delimiter. The compile-time processing of the block's contents has three phases:

• Line terminators are normalized. All line terminators are replaced by with the LF (\u000A) character. This prevents the value of a text block from being silently affected by the newline conventions of whatever platform the code was last edited on. (Windows uses CR + LF to terminate lines; Unix systems use LF only, and there are even other schemes in use as well.)
• Incidental leading white space, and all trailing white space, is removed from each line. Incidental white space is determined as follows:
• Compute a set of determining lines, which are all the non-blank lines of the result of the previous step, as well as the last line (the line that contains the closing delimiter) even if it is blank;
• Compute the common whitespace prefix of all determining lines;
• Remove the common whitespace prefix from each determining line.
• Escape sequences in the content are interpreted. Text blocks use the same set of escape sequences as do string and character literals. Performing these last means that escapes like \n, \t, \s, and \<eol> do not affect the whitespace processing. (Two new escape sequences have been added to the set as part of JEP 368; \s for an explicit space, and \<eol> as a continuation indicator.)

In our XML example, all the whitespace would be removed from the first and last line, and the middle two lines would be indented by four spaces, because there are five determining lines in this example -- the four lines containing XML code and the line containing the closing delimiter -- and the lines are all indented by at least as much whitespace as the first line of content. Often enough, this indentation is what is expected, but sometimes we might not want to strip all the leading indentation. If, for example, we wanted to have the whole block indented by four spaces, we could do so by moving the closing delimiter to the left by four spaces:

void m() {
System.out.println("""
<person>
<firstName>Bob</firstName>
<lastName>Jones</lastName>
</person>
""");
}

Because the last line is also a determining line, the common whitespace prefix is now the amount of whitespace before the closing delimiter in the last line of the block, and this is the amount that is removed from each line, leaving the whole block indented by four. We could also manage the indentation programmatically, via the instance method String::indent), which takes a multi-line string (whether it comes from a text block or not), and indents every line by a fixed number of spaces:

void m() {
System.out.println("""
<person>
<firstName>Bob</firstName>
<lastName>Jones</lastName>
</person>
""".indent(4));
}

In the extreme case, if no whitespace stripping is desired, the closing delimiter could be moved all the way back to the left margin:

void m() {
System.out.println("""
<person>
<firstName>Bob</firstName>
<lastName>Jones</lastName>
</person>
""");
}

Alternately, we could achieve the same effect by moving the entire body of the text block back to the margin:

void m() {
System.out.println("""
<person>
<firstName>Bob</firstName>
<lastName>Jones</lastName>
</person>
""");
}

These rules may sound somewhat complicated at first, but the rules were chosen to balance the various competing concerns of wanting to be able to indent the text block relative to the surrounding program while not generating variable amounts of incidental leading whitespace, and providing an easy way to adjust or opt out of whitespace stripping if the default algorithm is not what is wanted.

### Embedded expressions

Java's string literals do not support interpolation of expressions, as some other languages do; text blocks do not either. (To the extent that we may consider this feature at some point in the future, it would not be specific to text blocks, but applied equally to string literals.) Historically, parameterized string expressions were built with ordinary string concatenation (+); in Java 5, String::format was added to support "printf" style string formatting.

Because of the global analysis surrounding whitespace, getting the indentation right when combining text blocks with string concatenation can be tricky. But, a text block evaluates to an ordinary string, so we can still use String::format to parameterize the string expression. Additionally, we can use the new String::formatted method, which is an instance version of String::format:

String person = """
<person>
<firstName>%s</firstName>
<lastName>%s</lastName>
</person>
""".formatted(first, last));

(Unfortunately, this method could not also be called format because we cannot overload static and instance methods with the same name and parameter lists.

## Precedents and history

While string literals are, in some sense, a "trivial" feature, they are used frequently enough that small irritations can add up. So it should be no surprise that the lack of multi-line strings has been one of the most common complaints about Java in recent years, and that many other languages have multiple forms of string literals to support different use cases.

What may be surprising is the number of different ways that such a feature is expressed in popular languages. It's easy to say "we want multi-line strings", but when we survey other languages, we find a surprisingly diverse range of approaches in both syntax and goals. (And, of course, a comparably broad range of developer opinions about the "right" way to do it.) While no two languages are the same, for most features that are common to a broad ranges of languages (such as for loops) there are generally a few common approaches that languages pick from; it is unusual to find fifteen different interpretations of a feature in fifteen languages, but that's exactly what we found when it comes to multi-line and raw string literals.

The following table shows (some of) the options for string literals in various languages. In each, the ... is considered the content of the string literal, which may or may not be processed for escape sequences and embedded interpolations, xxx represents a user-chosen nonce that is guaranteed to not conflict with the contents of the string, and ## represents a variable number of # symbols (which may be zero.)

Language Syntax Notes
Bash '...' [span]
Bash $'...' [esc] [span] Bash "..." [esc] [interp] [span] C "..." [esc] C++ "..." [esc] C++ R"xxx(...)xxx" [span] [delim] C# "..." [esc] C#$"..." [esc] [interp]
C# @"..."
Dart '...' [esc] [interp]
Dart "..." [esc] [interp]
Dart '''...''' [esc] [interp] [span]
Dart """...""" [esc] [interp] [span]
Dart r'...' [prefix]
Go "..." [esc]
Go ... [span]
Groovy '...' [esc]
Groovy "..." [esc] [interp]
Groovy '''...''' [esc] [span]
Groovy """...""" [esc] [interp] [span]
Java "..." [esc]
Javascript '...' [esc] [span]
Javascript "..." [esc] [span]
Javascript ... [esc] [interp] [span]
Kotlin "..." [esc] [interp]
Kotlin """...""" [interp] [span]
Perl '...'
Perl "..." [esc] [interp]
Perl <<'xxx' [here]
Perl <<"xxx" [esc] [interp] [here]
Perl q{...} [span]
Perl qq{...} [esc] [interp] [span]
Python '...' [esc]
Python "..." [esc]
Python '''...''' [esc] [span]
Python """...""" [esc] [span]
Python r'...' [esc] [prefix]
Python f'...' [esc] [interp] [prefix]
Ruby '...' [span]
Ruby "..." [esc] [interp] [span]
Ruby %q{...} [span] [delim]
Ruby %Q{...} [esc] [interp] [span] [delim]
Ruby <<-xxx [here] [interp]
Ruby <<~xxx [here] [interp] [strip]
Rust "..." [esc] [span]
Rust r##"..."## [span] [delim]
Scala "..." [esc]
Scala """...""" [span]
Scala s"..." [esc] [interp]
Scala f"..." [esc] [interp]
Scala raw"..." [interp]
Swift ##"..."## [esc] [interp] [delim]
Swift ##"""..."""## [esc] [interp] [delim] [span]

Legend:

• esc. Some degree of escape sequence processing, where escapes are usually derived from the C style (e.g., \n);
• interp. Some support for interpolation of either variables or arbitrary expressions.
• span. Multi-line strings can be expressed by simply spanning multiple source lines.
• here. A "here-doc", where the following lines, up until a line that contains only the user-selected nonce, are treated as the body of the string literal.
• prefix. The prefix form is valid with all the other forms of string literals, and have been omitted for brevity.
• delim. The delimiter is customizable to some degree, whether by inclusion of a nonce (C++), a varying number of # characters (Rust, Swift), or swapping curly braces for other matched brackets (Ruby).
• strip. Some degree of stripping of incidental indentation is supported.

While this table gives a flavor for the diversity in approaches to string literals, it really only scratches the surface, as the variety of subtleties in how languages interpret string literals are too varied to be captured in such a simple form. While most languages use an escape language inspired by C, they vary in exactly what escapes they support, whether and how they support unicode escapes (e.g., \unnnn), and whether forms that don't support the full escape language still supports some limited form of escaping for delimiter characters (such as using two quotes for an embedded quote instead of ending the string.) The table also leaves out a number of other forms (such as the various prefixes in C++ to control character encoding) for brevity.

The most obvious axis of variation across languages is the choice of delimiters, and how different delimiters signal different forms of string literals (with or without escapes, single or multiple lines, with or without interpolation, choice of character encodings, etc.) But reading between the lines, we can see how these syntactic choices often reflect philosophical differences about language design -- how to balance the various goals such as simplicity, expressiveness, and user convenience.

Not surprisingly, the scripting languages (bash, Perl, Ruby, Python) have made "user choice" their first priority, with many forms of literals which may vary in non-orthogonal ways (and often, multiple ways to express the same thing.) But in general, languages are all over the map in how they encourage users to think about string literals, how many forms they expose, and how orthogonal those forms are. We also see several philosophies about strings that span multiple lines. Some (like Javascript and Go) treat line terminators are just another character, allowing all forms of string literals to span multiple lines, some (such as C++) treat them as a special case of "raw" strings, and others (such as Kotlin) divide strings into "simple" and "complex", and put multi-line strings into the "complex" bucket, and others offer so many options that they defy even these simple classifications. Similarly, they vary in their interpretation of "raw string". True raw-ness requires some form of user-controllable delimiter (as C++, Swift, and Rust have), though others call their strings "raw" but still reserve some form of escaping for their closing (fixed) delimiter.

Despite the range of approaches and opinions, from the perspective of balancing principled design with expressiveness, there is a clear "winner" from this survey: Swift. It manages to support escaping, interpolating, and true raw-ness with a single, flexible mechanism (in both single- and multi-line variants.) It should not be surprising that the newest language in the group has the cleanest story, as it had the benefit of hindsight and could learn from the successes and mistakes of others. (The key innovation here is the escape delimiter varies in lockstep with the string delimiter, avoiding the need to choose between "cooked" and "raw" modes, while still sharing the escape language across all forms of string literal -- an approach which warrants the high praise of "obvious in hindsight".) While Java could not adopt the Swift approach wholesale because of existing language constraints, the Java approach took as much inspiration from the good work that the Swift community did as we could -- and left room to take more in the future.

Text blocks were not the first iteration of this feature; the first iteration was raw string literals. Like Rust's raw strings, it used a variable-sized delimiter (any number of backtick characters) and didn't interpret the contents at all. This proposal was withdrawn after it was fully designed and prototyped, as it was judged that, while it was sound enough, felt too "nailed on the side" -- it had too little common with traditional string literals, and therefore, if we wanted to extend the feature in the future, there was not a path to extending them together. (Because of the rapid-release cadence, this only delayed the feature by six months, and resulted in a substantially better feature.)

One major objection to the JEP 326 approach is that raw strings worked differently in every way from traditional string literals; different delimiter characters, varying vs fixed delimiters, single- vs multi-line, escaping vs non-escaping. Invariably, someone is going to want some different combination of choices, and there will be calls for more different forms, leading us down the road that Bash took. On top of that, it didn't do anything to address the "incidental indentation" problem, which was obviously going to be a source of brittleness in Java programs. Learning from this experience, text blocks share much more with traditional string literals (delimiter syntax, escape language), varying only in one crucial aspect -- whether the string is a one-dimensional sequence of characters, or a two-dimensional block of text.

## Style guidance

Jim Laskey and Stuart Marks, of the Java team at Oracle, have published a programmer's guide outlining the details, and style recommendations, for text blocks.

Use text blocks when it improves the clarity of code. Concatenation, escaped newlines, and escaped quote delimiters obfuscate the contents of a string literal; text blocks get "out of the way" so the contents are more obvious, but they are syntactically heavier than traditional string literals. Use them where the benefits pay for the extra costs; if a string fits on a single line and has no escaped newlines, it is probably best to stick with traditional string literals.

Avoid in-line text blocks within complex expressions. While text blocks are string-valued expressions, and therefore can be used anywhere a string is expected, it is not always best to nest text blocks within complex expressions; it is sometimes better to pull it out into a separate variable. In the following example, the text block breaks up the flow of the code when reading, forcing readers to mentally switch gears:

String poem = new String(Files.readAllBytes(Paths.get("jabberwocky.txt")));
String middleVerses = Pattern.compile("\\n\\n")
.splitAsStream(poem)
.match(verse -> !"""
’Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.
""".equals(verse))
.collect(Collectors.joining("\n\n"));

If we pull the text block into its own variable, it is easier for readers to follow the flow of the computation:

String firstLastVerse = """
’Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.
""";
String middleVerses = Pattern.compile("\\n\\n")
.splitAsStream(poem)
.match(verse -> !firstLastVerse.equals(verse))
.collect(Collectors.joining("\n\n"));

Avoid mixing spaces and tabs in the indentation of a text block. The algorithm for stripping incidental indentation computes a common whitespace prefix, and therefore will still work if lines are consistently indented with a combination of spaces and tabs. However, this is obviously brittle and error-prone, so it is best to avoid mixing them -- use one or the other.

Align text blocks with the neighboring Java code. Since incidental whitespace is automatically stripped, we should take advantage of this to make code easier to read. While we might be tempted to write:

void printPoem() {
String poem = """
’Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.
""";
System.out.print(poem);

because we don't want any leading indentation in our strings, most of the time we should write:

void printPoem() {
String poem = """
’Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.
""";
System.out.print(poem);
}

Don't feel obligated to line up the text with the opening delimiter. We can choose to line up the text block contents with the opening delimiter:

String poem = """
’Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.
""";

This might seem attractive, but can be cumbersome if the lines are long or the delimiter starts far to the left margin, because now the text will be sticking all the way into the right margin. But this form of indentation isn't required; we can use any continuation indentation, as long as we do so consistently:

String poem = """
’Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.
""";

When a text block contains an embedded triple quote, only escape the first quote. While it is allowable to escape every quote, it is not necessary, and interferes needlessly with readability; escaping only the first quote is needed:

String code = """
String source = \"""
String message = "Hello, World!";
System.out.println(message);
\""";
""";

Consider splitting very long lines with \. Along with text blocks, we get two new escape sequences, \s (for a literal space) and \<newline> (a continuation line indicator.) If we have literals with very long lines, we can use \<newline> to put a line break in the source code but which is removed during the compile-time escape processing of the string.

## Wrap up

String literals in Java programs are not limited to short strings like "yes" and "no"; they often correspond to entire "programs" in structured languages such as HTML, SQL, XML, JSON, or even Java. Being able to preserve the two-dimensional structure of that embedded program, without having to muck it up with escape characters and other linguistic intrusions, is less error-prone and results in more readable programs.

Brian Goetz is the Java Language Architect at Oracle, and was the specification lead for JSR-335 (Lambda Expressions for the Java Programming Language.) He is the author of the best-selling Java Concurrency in Practice, and has been fascinated by programming since Jimmy Carter was President.

Style

## Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

• ##### Regex

by Mario Kusek,

• ##### Regex

by Mario Kusek,

Your message is awaiting moderation. Thank you for participating in the discussion.

Did you consider easier writing regular expressions?

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p