flowchart LR
T["Text in R"] --> CR["Create <br> 'hello' / character(n) / readLines()"]
T --> CO["Concatenate <br> paste / paste0 / cat / sprintf"]
T --> SU["Substring <br> substr / substring / startsWith / endsWith"]
T --> CL["Clean <br> toupper / tolower / trimws"]
style T fill:#e3f2fd,stroke:#1976D2
style CR fill:#fff3e0,stroke:#F57C00
style CO fill:#fff3e0,stroke:#F57C00
style SU fill:#f3e5f5,stroke:#8E24AA
style CL fill:#e8f5e9,stroke:#388E3C
16 Strings: Creation, Concatenation, and Substrings
This chapter is a thorough introduction to working with text in R. You will learn how character values are created and stored in character vectors, the two styles of quoting (" and '), the common escape sequences (\n, \t, \", \\), and the length metrics length() and nchar(), which do not mean the same thing. You will see the four everyday concatenation tools (paste(), paste0(), cat(), sprintf()), the substring-extraction functions (substr(), substring(), startsWith(), endsWith()), and how to change case with toupper(), tolower(), tools::toTitleCase(), and trimws(). The chapter sticks to base R; Chapter 17 covers the richer stringr toolkit and regular expressions. By the end of this chapter you will be able to build, inspect, extract, and clean text cleanly.
16.1 Character Values and Character Vectors
R has no separate “character” and “string” types. A single letter and a full paragraph are both character vectors of length 1. Multiple strings sit side by side in a longer character vector.
length(x) reports how many strings the vector holds. nchar(x) reports how many characters are inside each string. They answer different questions and are the source of a classic beginner bug.
16.2 Quotes and Escape Sequences
Both styles delimit a string; pick whichever avoids escaping. The R community and tidyverse style guide prefer double quotes by default.
| Sequence | Meaning |
|---|---|
\n |
Newline. |
\t |
Tab. |
\\ |
A literal backslash. |
\" / \'
|
A literal double / single quote inside the matching delimiter. |
\u00e9 |
A Unicode code point (here, é). |
16.3 Building Strings Programmatically
paste() has a second argument, collapse, that joins a vector into a single string after concatenation. This is the right way to turn a character vector into one line of text.
sprintf() for Precise Formatting
sprintf() uses C-style format specifiers for controlled width, padding, and decimals. You saw it in Chapter 5; it is the right tool whenever the exact visual layout matters.
16.4 Length Metrics
nchar()
nchar() reports characters by default, not bytes. A string with accented letters or emojis takes more bytes than characters. You can ask for a byte count with nchar(x, type = "bytes") when it matters for file size estimates.
16.5 Extracting Substrings
substr(): Positions as Start and Stop
substr(x, start, stop) returns the slice from position start to position stop, inclusive on both ends, 1-indexed. It is vectorised.
substring(): Allows Multiple Start Points
substring() is the older cousin. It defaults first = 1 and last = 1000000L and, usefully, supports multiple start positions at once, producing several substrings from a single input string.
substr<-
Assigning to substr(x, start, stop) replaces the slice in place. The replacement must be the same length as the slice; if shorter, R truncates; if longer, only the first characters are used.
16.6 Prefix, Suffix, and Case
Both functions are vectorised and return a logical vector.
Four functions cover the everyday cases.
| Function | Purpose |
|---|---|
toupper(x) |
Convert to uppercase. |
tolower(x) |
Convert to lowercase. |
tools::toTitleCase(x) |
Title Case. |
chartr(old, new, x) |
Fixed-position character translation (1-to-1). |
Before any string comparison or grouping, decide on a canonical case (often lowercase) and apply it once at the point where the data enters your program. “New Delhi”, “new delhi”, and “NEW DELHI” are three different values until you normalise them.
16.7 Removing Whitespace
trimws() Strips Leading and Trailing Spaces
trimws() has three modes: both ends ("both", the default), only the left, or only the right.
trimws() Only Touches the Ends
Double spaces in the middle of a string survive trimws(). If you need to collapse them, use gsub("\\s+", " ", x) (covered in Chapter 17).
16.8 Splitting Strings with strsplit()
strsplit(x, split) breaks each string on the separator and returns a list with one element per input string. This reflects that different strings may split into different numbers of pieces.
strsplit() Is for Unknown-Length Cases
If you know every string splits into the same number of fields (for example, well-formed fixed-column text), do.call(rbind, strsplit(...)) gives you a matrix. For real CSV data, use read.csv() or readr::read_csv() rather than hand-splitting strings.
16.9 A Worked Example: Tidying a Messy Name List
Every core technique of the chapter appears: trimws() to drop edge whitespace, case normalisation with tolower() and toTitleCase(), strsplit() to break each name into words, substr() to pull first characters, and paste0() with collapse to assemble the initials.
16.10 Summary
| Concept | Description |
|---|---|
| Foundations | |
| Strings Are Character Vectors | A single string is a character vector of length 1 |
| Quotes and Escapes | Prefer double quotes; \n, \t, \\, \" are the common escapes |
| length() vs nchar() | length() counts strings; nchar() counts characters inside each string |
| Build and Format | |
| paste() and paste0() | Join pieces into strings; the collapse argument folds a vector into one string |
| sprintf() | Precise formatting for widths, decimals, and padding using C-style specifiers |
| Slice and Test | |
| substr() and substring() | Slice by start and stop positions; both are vectorised |
| startsWith() and endsWith() | Vectorised prefix and suffix tests returning logical vectors |
| Clean and Split | |
| Case Conversion | toupper, tolower, tools::toTitleCase normalise case |
| trimws() | Strips leading and trailing whitespace from each element |
| strsplit() | Splits each string on a separator; returns a list, one vector per input |
Text is the messiest kind of data you will handle, and the cost of skipping a cleanup step compounds rapidly downstream. Make trimws(), case normalisation, and consistent separators a routine first pass on any character column. The next chapter extends these tools to the richer world of regular expressions, pattern matching, and the stringr package, the toolkit you need as soon as simple position-based slicing stops being enough.