# How to Use pandas get_dummies for One-Hot Encoding in DataFrames

> Learn how to use pandas get_dummies for one-hot encoding in DataFrames. Convert categorical data into binary indicator matrices perfect for machine learning preparation.

- Repository: [pandas/pandas](https://github.com/pandas-dev/pandas)
- Tags: how-to-guide
- Published: 2026-02-14

---

**pandas get_dummies converts categorical variables into a binary indicator matrix (one-hot encoding), creating new columns of 0s and 1s for each unique category to prepare data for machine learning algorithms.**

The `pandas.get_dummies` function in the pandas-dev/pandas library transforms categorical text data into numeric format suitable for statistical modeling. Located in [`pandas/core/reshape/encoding.py`](https://github.com/pandas-dev/pandas/blob/main/pandas/core/reshape/encoding.py), this utility generates dummy variables that represent categorical levels as binary flags. Machine learning algorithms require numeric inputs, making this transformation a critical step in data preprocessing pipelines.

## Core Functionality of pandas get_dummies

The function creates a **binary indicator matrix** where each unique category in the specified columns becomes its own column containing `1` (presence) or `0` (absence). Internally, `pandas.core.reshape.encoding.get_dummies` handles the conversion logic, supporting both DataFrame and Series inputs. The implementation automatically identifies object or category dtype columns for encoding, or accepts an explicit `columns` parameter to target specific variables.

## Source Code Architecture

The primary implementation resides in **[`pandas/core/reshape/encoding.py`](https://github.com/pandas-dev/pandas/blob/main/pandas/core/reshape/encoding.py)**, where the main `get_dummies` function processes the `data`, `prefix`, `prefix_sep`, `dummy_na`, `columns`, `drop_first`, and `dtype` parameters. For string-specific operations, **[`pandas/core/strings/accessor.py`](https://github.com/pandas-dev/pandas/blob/main/pandas/core/strings/accessor.py)** exposes `Series.str.get_dummies`, which provides a convenience wrapper for splitting delimiter-separated strings into dummy variables. This dual exposure allows users to access one-hot encoding both as a top-level DataFrame utility and as a Series string method.

## Essential Parameters for DataFrame Encoding

Several parameters control how `pandas get_dummies` constructs the output matrix:

- **`columns`** specifies which DataFrame columns to encode. When omitted, the function selects all columns with object or categorical dtype.

- **`prefix`** and **`prefix_sep`** control column naming. The `prefix` argument accepts strings, lists, or dictionaries mapping original column names to custom prefixes, while `prefix_sep` defines the separator (default is underscore).

- **`dummy_na`** adds a separate indicator column for missing values when set to `True`, ensuring `NaN` entries are explicitly represented in the encoded output.

- **`drop_first`** removes the first level of each categorical variable to avoid **multicollinearity**, which is essential for linear regression models where perfectly correlated dummy columns would violate assumptions.

- **`dtype`** specifies the data type of the resulting dummy columns, defaulting to **`uint8`** for memory efficiency.

## Practical Usage Examples

### Basic One-Hot Encoding

Convert categorical columns into binary indicators using the default configuration:

```python
import pandas as pd

df = pd.DataFrame({
    "city": ["Paris", "London", "Paris", "Berlin"],
    "temperature": [22, 15, 23, 10]
})

df_encoded = pd.get_dummies(df, columns=["city"])
print(df_encoded)

```

This generates separate columns for `city_Berlin`, `city_London`, and `city_Paris` while preserving the numeric `temperature` column.

### Handling Missing Values with dummy_na

Explicitly encode null entries as a separate category:

```python
df = pd.DataFrame({
    "fruit": ["apple", None, "banana", "apple"]
})

df_dummy = pd.get_dummies(df, dummy_na=True)
print(df_dummy)

```

Setting `dummy_na=True` creates a `fruit_nan` column that marks the location of missing values with `1`.

### Avoiding Multicollinearity with drop_first

Remove the first dummy column to prevent the dummy variable trap in regression analysis:

```python
df = pd.DataFrame({
    "size": ["S", "M", "L", "XL"]
})

df_dummies = pd.get_dummies(df, prefix='size', drop_first=True)
print(df_dummies)

```

This returns only `size_M`, `size_L`, and `size_XL`, using `size_S` as the implicit baseline reference category.

### String Splitting with Series.str.get_dummies

Process delimiter-separated values using the string accessor method defined in [`pandas/core/strings/accessor.py`](https://github.com/pandas-dev/pandas/blob/main/pandas/core/strings/accessor.py):

```python
s = pd.Series(["dog,cat", "cat", "dog", None])

dummy = s.str.get_dummies(sep=",")
print(dummy)

```

This splits each string on commas and creates binary columns for `cat` and `dog`, handling missing values as rows of zeros.

## Summary

- **`pandas get_dummies`** in [`pandas/core/reshape/encoding.py`](https://github.com/pandas-dev/pandas/blob/main/pandas/core/reshape/encoding.py) converts categorical variables into a binary indicator matrix essential for machine learning preprocessing.
- The function supports **custom prefixing**, **missing value handling** via `dummy_na`, and **multicollinearity prevention** through `drop_first`.
- **String accessor integration** via `Series.str.get_dummies` provides specialized handling for delimiter-separated text data.
- Default output uses **`uint8`** dtype for memory-efficient storage of binary flags.

## Frequently Asked Questions

### What is the difference between pandas get_dummies and sklearn OneHotEncoder?

pandas get_dummies operates directly on DataFrames and returns a dense matrix immediately integrated with your data structure, while sklearn's OneHotEncoder is a transformer object that requires fitting and produces sparse matrices by default. Use pandas get_dummies for exploratory data analysis and quick preprocessing, and sklearn OneHotEncoder for production pipelines requiring consistent encoding across train and test sets.

### How does pandas get_dummies handle missing values by default?

By default, `pandas get_dummies` ignores missing values and does not create indicator columns for `NaN` entries. Set `dummy_na=True` to explicitly generate a column representing missing values, which is crucial when the absence of data carries meaningful information for your model.

### Can I specify which columns to encode in pandas get_dummies?

Yes, use the `columns` parameter to pass a list of column names you want to encode. If you omit this parameter, the function automatically selects all columns with object or category dtype, which can be verified in the source code at [`pandas/core/reshape/encoding.py`](https://github.com/pandas-dev/pandas/blob/main/pandas/core/reshape/encoding.py).

### What is the purpose of the drop_first parameter in pandas get_dummies?

The `drop_first` parameter removes the first level of each categorical variable to avoid the dummy variable trap, where including all dummy columns creates perfect multicollinearity in linear models. This is statistically necessary for ordinary least squares regression to ensure the design matrix is full rank.