Author:
Raimo Niskanen <raimo(at)erlang(dot)org> , Kiko Fernandez-Reyes <kiko(at)erlang(dot)org>
Status:
Accepted/27-w To be implemented in OTP version 27.0 with a warning in OTP version 26.1
Type:
Standards Track
Created:
07-Jun-2023
Erlang-Version:
OTP-27

EEP 64: Triple-Quoted Strings #

Abstract #

This EEP proposes the introduction of Triple-Quoted Strings, and defines their semantics. The main benefit is to allow multi-line strings in an easy and useful way, i.e. with indentation, similar to other languages, e.g. Elixir.

Their first use case is for in-module documentation attributes containing Markdown or similarly formatted text where verbatim text is desirable since any documentation text format has its own notion of escape sequences which will collide with Erlang’s escape sequences.

Rationale #

Today (June 2023), writing multi-line strings is awkward and arguably ugly. They may contain escape sequences and have no concept of indentation:

foo () ->
    case bar() of
         ok ->
             X = "First line
Second line with \"\\*not emphasized\\* Markdown\"
Third line",
             {ok, X}
    end.

The content’s indentation cannot adhere to the surrounding code’s and the * has to be doubly escaped to get a \* character sequence into the actual content.

In a documentation attribute as suggested in EEP 59, the indentation problem is not that pronounced because the documentation attribute itself is not much indented:

-doc "
First line
Second line with \"\\*not emphasized\\* Markdown\"
Third line".

But it sure looks better with indentation and not having to quote the backslashes:

-doc """
    First line
    Second line with "\*not emphasized\* Markdown"
    Third line
    """.

The main reason to consider this EEP is for documentation attributes, where not having to worry about escape sequences is this EEP’s most attractive property. Introducing a new string format, however, will also require defining how it shall behave in Erlang code.

Having a string format that is only allowed in attributes would simply be very strange and the one suggested in this EEP would also be useful in Erlang code.

Design Decisions #

An attribute is an Erlang form in the source code that consists of a - token, an atom, one value term and a full stop (dot). The value term may be enclosed in parentheses (which is not very interesting for documentation attributes).

-doc "  Badly formatted
documentation paragraph
/-\\
\\-/".

A documentation attribute should have a string as its content term, and here we want to use our new and more convenient triple-quoted string instead of a normal string:

-doc """
      Better formatted
    documentation paragraph
    /-\
    \-/
    """.

Verbatim Strings #

We want the strings to be verbatim since then they are sure to not clash with any documentation text format. In Markdown (and AsciiDoc) it is the use of the backslash character (\) that collides since it has a meaning in normal Erlang strings, so every backslash would need to be escaped into double backslash if the string accepts escape sequences.

This is in many cases not a big problem, though. Elixir also uses Markdown as documentation format and mostly ignores this problem. But in some modules, it is a problem, such as the regular expression module, and there Sigils for verbatim text are used to avoid quoting all backslashes.

We could also do like Elixir and choose both, but then we have to implement Sigils or something like that before getting triple-quoted strings that are useful enough.

So if we have to choose one format, it should be the Verbatim one, since that is the more general format, also in code; where you simply would want to define a string for some purpose, impossible to foresee.

This does not close the door on Sigils. We can state that we have merely chosen that the default for triple-quoted strings is verbatim, and for normal strings it is escaped. Although it is a bit annoying that we then do not have the same defaults as Elixir, there are other more annoying subtle differences between strings in the languages. See Comparison with Elixir.

Because the strings are verbatim, and we want to use them to define any string in code, we cannot have the limitation that there always is a newline at the end. Therefore the last newline (the newline on the line that precedes the string ending line) should be stripped. It is easy to add a newline if an ending newline is needed.

Elixir does not strip the last newline and dodges this problem by escaping newlines. If you do not want the last newline you put a backslash last on the last line. This does not work in their verbatim strings, though. So you have to choose between either verbatim with ending newline or having to escape backslashes.

A slightly annoying consequence of verbatim strings with a fixed end marker is that there is no possibility to create a string containing such an end marker. Therefore this EEP, as a possible extension, suggests allowing not only triple-quoted strings, but 3-or-more-quoted strings. Then a start and end marker can be chosen that is not part of the string, and any string can be created. This is a not so pretty solution for a very rare corner case.

Triple-Quoted String Scanner Token #

A triple-quoted string must be a token that the scanner recognizes as a string, which makes it suitable for a documentation attribute value term. It starts and ends with three double quotes: """.

Double quotes, ", are chosen because normal Erlang strings use them and this is just a new variant. Since double quotes are used a triple-quoted string shall, as a normal string, produce a list of characters (Unicode code-points).

It would be more convenient if a triple-quoted string produced an UTF-8 binary, but that would be a surprising feature for double quotes, and the documentation build process can work around this by converting the character list into the needed binary chunk.

In source code a triple-quoted string is valid within a binary, so producing a Unicode binary is reasonably straightforward:

X = <<"""
    Line 1
    Line 2
    """/utf8>>

The 2 + 7 characters (“<<” + “/utf8>>”) overhead is not exhausting since we are targeting multi-line strings.

As a future expansion it has been proposed to use Sigils (prefixes) for specialized strings such as regular expressions, interpolated variables (PR-7343), Unicode binary strings, etc. For example: X = ~u"Tschüß" for an UTF-8 encoded binary.

Triple-Quoted String Start #

After the starting """ only white-space is allowed up to the end of the line.

As a possible future expansion we might allow text here that shouldn’t be part of the string content, but could be a hint for e.g. syntax highlighting and indentation handling in an editor / pretty printer.

-doc """ md
    Markdown content
    * Bullet list
    """.

The scanner does not need to have any special treatment of the characters on the line after the starting """ except that it should not search for an ending """.

A later step strips the characters up to and including the newline from the string content.

If any of these characters is not white-space, a syntax error is reported.

Triple-Quoted String End #

All characters are collected as they are (verbatim) and becomes the string content.

A triple-quoted string ends with newline followed by optional white-space and then """. This completes the scanner token.

A later step uses the white-space on the ending line as the definition of the string’s indentation and strips that particular white-space sequence from every line in the string, and strips the newline preceding the ending line.

If any of the lines do not start with the defined indentation either because the line is too short or if the prefix differs, a syntax error is reported. For convenience and to adhere to editor conventions, however; an empty line may be completely empty instead of indented, but if it starts with white-space characters that is not a newline, they must be the defined indentation.

Requiring that all lines (except empty lines) must have exactly the same indentation characters is a simple solution to not have to define how indentation white-space (tab vs. space) normalization should be done, and also seems like a reasonable requirement.

CR, LF and White-Space #

The characters CR - Unicode code-point value 13, LF - code-point 10, and white-space - as defined in the Erlang scanner today, are handled as usual by the scanner, except that if the line preceding the ending line ends in CR LF then also the CR is interpreted as part of the newline and is stripped along with the LF. This is a convenience for systems with CR LF newlines.

Other than that, CR, LF and white-space within the string is passed through as-is.

This means that the examples in the following text where a normal string is used as matching reference assumes that the source code has LF-only newlines, such as:

"""

X
""" = "\nX"

If the source code has CR LF newlines that example instead becomes:

"""

X
""" = "\r\nX"

This example works in both cases but may be harder to read:

"""

X
""" = "
X"

Leading and trailing newline #

The rules above strips one leading and one trailing newline. This is a simple convention that also gives control over the string’s content:

Example 1:

"""

  X

""" = "\n  X\n"

Example 2:

"""
X
""" = "X"

Example 3:

"""
 
""" = ""

Note that the following could be a syntax error; too short multi-line string since trailing newline shall be stripped both from the starting line and from the last content line, so the content could be seen as less than empty, but it is more convenient to regard that newline as doubly stripped, and thereby allow this as also an empty string:

"""
""" = ""

Indentation #

The rules above facilitates indentation of the content to adhere to the surrounding code. The ending line determines the indentation.

Example 1:

"""
This string
is not indented
""" =
    "This string\nis not indented"

Example 2:

"""
    This string
    is indented
    """ =
    "This string\nis indented"

Example 3:

"""
      This indented string
    has an indented first line
    """ =
    "  This indented string\nhas an indented first line"

"""

Example 4:

foo() ->
    X =
        """
          This indented string
        has an indented first line

        and an empty line that is not indented
        """,
    %% That content line 3 is empty instead of indented
    %% is only visible if you "touch" the text
    %% with the cursor or the mouse
    X =
        "  This indented string\n"
        "has an indented first line\n"
        "\n"
        "and an empty line that is not indented".

Example 5:

"""
This is a syntax error (incorrect indentation)
    """

Example 6:

""" This is a syntax error
(non-white-space on start line)
"""

Example 6:

"""
This is an incomplete string so the scanner will search forward
for the end, and the shell will block waiting for more lines,
since these quote characters are not a valid string ending: """

Backwards incompatibility #

This is valid today:

X = """
    X
    """

It is equivalent to:

X = "" "
    X
    " ""

Which is equivalent to:

X = "
    X
    "

Which is equivalent to:

X = "\n    X\n"

But with the suggested triple-quoted strings the first code snippet would instead be equivalent to:

X = "X"

Also, this is valid today:

X = """ xxx
  X
    """

But according to this EEP it would be two syntax errors:

  1. The start line has got non-white-space after """.
  2. The first content line has incorrect indentation.

There are many other similar constructions that also would be syntax errors.

  • It is far from likely that anyone has deliberately used """ in source code to mean an empty string concatenated to another string.
  • Most today allowed combinations with """ will cause syntax errors. Only a few will have a subtly changed behaviour (string content).
  • Users can simply grep for """ in their source code. Creating the same sequence e.g through macros would be harder to find; the worst problem would not be new syntax errors (hard to miss), but changed behaviour. And the changed behaviour would be a slightly different string content.

Allowing 3-or-more-quoted strings as suggested in the next chapter has the potential to cause maybe larger backwards incompatibilities. Here is an example:

X = """"
    ++ foo() ++
    """"

That code is valid today and will prepend the empty string before and append the empty string after the return value of foo(). When introducing 3-or-more-quoted strings it will instead become:

X = "++ foo() ++"

The code example is really strange, though.

Therefore, it should be very unlikely that anyone encounters a real backwards incompatibility problem from the suggestions in this EEP.

But, to help users find accidental uses of 3-or-more-quoted strings, a compiler warning should be introduced early in the current Erlang/OTP release, and this EEP may be implemented in the next.

Such a warning may, unfortunately, be more complicated to implement then this EEP, because the scanner cannot emit warnings. Nor can the parser.

A warning can be implemented by letting the scanner emit a special dummy token that the parser strips and ignores. The preprocessor can then peek at the scanned token stream and emit the warning. This way there is no change in the parser output so the rest of the preprocessor and compiler as well as e.g. parse transforms are unaffected.

Quoting of """ #

With the rules above there is no possibility to have """ first on a line in a triple-quoted string.

This would be allowed:

-doc """
    A triple-quoted string starts with: """
    and ends with: """
    """.

As long as """ isn’t first on a line. Unfortunately that is a lie since the ending according to this EEP should be first on a line…

It would be possible to work around in Erlang code:

X = """
    A triple-quoted string starts with: """
    and ends with:
    
    """ "\"\"\"".

That is ugly.

We can either ignore this quirk since it is only when placed first on a line that """ becomes a problem, or we can use the GitHub Flavored Markdown trick to allow 3 or more start characters and matching end characters so this would be valid:

X = """"
    A triple-quoted string starts with: """
    and ends with:
    """
    """"

Comparison with Elixir #

Elixir has got triple-quoted strings and names them heredocs.

They are delimited with """ as in this suggestion and have got very nearly the same rules for start line, end line and indentation. Here are the known differences:

  • The last end of line is not stripped.
  • They produce UTF-8 encoded binaries, just like " quoted strings also does in Elixir.
  • They accept escape sequences.
  • Newlines can be escaped.
  • There is an escape sequence "\a".
  • They do not allow more than 3 " characters as string start, which is a suggestion in this EEP
  • They accept Sigils, which allows the string to be verbatim but then the final newline cannot be avoided.

This EEP suggests triple-quoted strings that are as Elixir’s heredocs without interpolation and escaping:

~S"""
Heredoc without interpolation and escaping
"""

The Elixir string has a final newline that this EEP suggests should always be removed, since without escape sequences there is no possibility to remove it.

Copyright #

This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.