20 Factors: Creation, Levels, and Reordering

What this chapter covers

A factor is R’s data type for categorical variables, values drawn from a fixed, known set of categories such as "Low" / "Medium" / "High" or "Pass" / "Fail". Internally a factor stores integers with a label table called levels, which makes it both memory-efficient and aware of order. This chapter covers creating factors, inspecting and renaming levels, ordered factors, and changing the level order so summaries and plots come out the way you want.

20.1 Why a separate type for categories?

A column of grades stored as text is just letters with no inherent order, "A", "B", "C" sort alphabetically by accident. A factor lets you state explicitly: “these are the only valid categories, and this is their order.” That information then drives:

which categories summary() and table() count (including those with zero observations),
the order categories appear in plots and group-by output,
modelling code that needs categorical predictors with a known reference level.

Notice the printout: the values appear without quotes and a Levels: line shows the category list.

20.2 Creating factors

factor() accepts a vector and infers levels by sorting the unique values alphabetically.

That alphabetical default is rarely what you want. Specify levels = to take control.

Pass labels = to rename categories at the same time:

Values not in levels become NA

If a value in your data isn’t listed in levels, R silently turns it into NA. Use unique() first to be sure your level list covers everything.

The XL becomes <NA>, a useful safety net when you want to flag unknown categories, but a trap if you didn’t intend it.

20.3 Inspecting factors

as.integer() exposes the factor’s secret: each value is really an integer pointing into the levels vector. That is why factors are so cheap and so fast for grouping operations.

20.4 Adding, renaming, and dropping levels

Renaming all levels at once:

Adding a level that may not appear in the data yet, useful for plots that should always reserve space for an empty category:

Dropping unused levels after a filter:

droplevels() is the cleanup function to know.

20.5 Ordered factors

Some categorical variables have a natural order, Low < Medium < High, or grades D < C < B < A. ordered = TRUE records that order so comparisons work.

Use ordered factors for measurement scales and severity grades, but stick with regular factors for categories that have no inherent order (region, department, colour). Ordered factors change how some modelling functions treat the variable, so don’t reach for them by default.

20.6 Reordering levels

The level order, not the alphabetical order of the labels, drives every summary and plot. Three common ways to change it.

By hand with factor(..., levels = ...), re-state the level vector explicitly:

Move one level to the front with relevel(), handy for setting a regression reference category:

By another variable’s value with forcats::fct_reorder(), the easiest way to make a bar chart sort by height. forcats is part of the tidyverse and runs in webr.

Two more forcats helpers worth memorising:

fct_infreq(x), order levels by how often each appears (most common first).
fct_relevel(x, "Foo", "Bar"), push named levels to the front in the order given.

20.7 Worked example, student performance

A small dataset of student grades. We want a frequency table with categories in pedagogical order (F < D < C < B < A), not alphabetical, plus the top performer.

Two things to notice. First, cut() is the workhorse for turning a numeric variable into a factor with custom bins. Second, because grade is an ordered factor, max() and == give meaningful answers, exactly what factor() was designed for.

20.8 Summary

Summary of concepts introduced in this chapter
Concept	Description
Create
factor()	Create a factor from a vector; specify levels to control category order
ordered = TRUE	Record an intrinsic order so comparisons and min/max work
cut()	Bin a numeric vector into a factor with custom break points and labels
Inspect
levels()	Return the category vector behind a factor
nlevels()	Count how many categories a factor has
table()	Frequency table including levels with zero observations
Rename and Drop
levels(f) <- ...	Rename all levels in one assignment
droplevels()	Remove levels that no longer appear in the data after a filter
Reorder
factor(f, levels = ...)	Re-state the level order explicitly to change grouping and plot order
relevel()	Move one level to the front, useful as a regression reference category
fct_reorder()	forcats helper that orders levels by another variable's value
fct_infreq()	forcats helper that orders levels by frequency, most common first