How to Use pandas get_dummies for One-Hot Encoding in DataFrames

pandas get_dummies converts categorical variables into a binary indicator matrix (one-hot encoding), creating new columns of 0s and 1s for each unique category to prepare data for machine learning algorithms.

The pandas.get_dummies function in the pandas-dev/pandas library transforms categorical text data into numeric format suitable for statistical modeling. Located in pandas/core/reshape/encoding.py, this utility generates dummy variables that represent categorical levels as binary flags. Machine learning algorithms require numeric inputs, making this transformation a critical step in data preprocessing pipelines.

Core Functionality of pandas get_dummies

The function creates a binary indicator matrix where each unique category in the specified columns becomes its own column containing 1 (presence) or 0 (absence). Internally, pandas.core.reshape.encoding.get_dummies handles the conversion logic, supporting both DataFrame and Series inputs. The implementation automatically identifies object or category dtype columns for encoding, or accepts an explicit columns parameter to target specific variables.

Source Code Architecture

The primary implementation resides in pandas/core/reshape/encoding.py, where the main get_dummies function processes the data, prefix, prefix_sep, dummy_na, columns, drop_first, and dtype parameters. For string-specific operations, pandas/core/strings/accessor.py exposes Series.str.get_dummies, which provides a convenience wrapper for splitting delimiter-separated strings into dummy variables. This dual exposure allows users to access one-hot encoding both as a top-level DataFrame utility and as a Series string method.

Essential Parameters for DataFrame Encoding

Several parameters control how pandas get_dummies constructs the output matrix:

  • columns specifies which DataFrame columns to encode. When omitted, the function selects all columns with object or categorical dtype.

  • prefix and prefix_sep control column naming. The prefix argument accepts strings, lists, or dictionaries mapping original column names to custom prefixes, while prefix_sep defines the separator (default is underscore).

  • dummy_na adds a separate indicator column for missing values when set to True, ensuring NaN entries are explicitly represented in the encoded output.

  • drop_first removes the first level of each categorical variable to avoid multicollinearity, which is essential for linear regression models where perfectly correlated dummy columns would violate assumptions.

  • dtype specifies the data type of the resulting dummy columns, defaulting to uint8 for memory efficiency.

Practical Usage Examples

Basic One-Hot Encoding

Convert categorical columns into binary indicators using the default configuration:

import pandas as pd

df = pd.DataFrame({
    "city": ["Paris", "London", "Paris", "Berlin"],
    "temperature": [22, 15, 23, 10]
})

df_encoded = pd.get_dummies(df, columns=["city"])
print(df_encoded)

This generates separate columns for city_Berlin, city_London, and city_Paris while preserving the numeric temperature column.

Handling Missing Values with dummy_na

Explicitly encode null entries as a separate category:

df = pd.DataFrame({
    "fruit": ["apple", None, "banana", "apple"]
})

df_dummy = pd.get_dummies(df, dummy_na=True)
print(df_dummy)

Setting dummy_na=True creates a fruit_nan column that marks the location of missing values with 1.

Avoiding Multicollinearity with drop_first

Remove the first dummy column to prevent the dummy variable trap in regression analysis:

df = pd.DataFrame({
    "size": ["S", "M", "L", "XL"]
})

df_dummies = pd.get_dummies(df, prefix='size', drop_first=True)
print(df_dummies)

This returns only size_M, size_L, and size_XL, using size_S as the implicit baseline reference category.

String Splitting with Series.str.get_dummies

Process delimiter-separated values using the string accessor method defined in pandas/core/strings/accessor.py:

s = pd.Series(["dog,cat", "cat", "dog", None])

dummy = s.str.get_dummies(sep=",")
print(dummy)

This splits each string on commas and creates binary columns for cat and dog, handling missing values as rows of zeros.

Summary

  • pandas get_dummies in pandas/core/reshape/encoding.py converts categorical variables into a binary indicator matrix essential for machine learning preprocessing.
  • The function supports custom prefixing, missing value handling via dummy_na, and multicollinearity prevention through drop_first.
  • String accessor integration via Series.str.get_dummies provides specialized handling for delimiter-separated text data.
  • Default output uses uint8 dtype for memory-efficient storage of binary flags.

Frequently Asked Questions

What is the difference between pandas get_dummies and sklearn OneHotEncoder?

pandas get_dummies operates directly on DataFrames and returns a dense matrix immediately integrated with your data structure, while sklearn's OneHotEncoder is a transformer object that requires fitting and produces sparse matrices by default. Use pandas get_dummies for exploratory data analysis and quick preprocessing, and sklearn OneHotEncoder for production pipelines requiring consistent encoding across train and test sets.

How does pandas get_dummies handle missing values by default?

By default, pandas get_dummies ignores missing values and does not create indicator columns for NaN entries. Set dummy_na=True to explicitly generate a column representing missing values, which is crucial when the absence of data carries meaningful information for your model.

Can I specify which columns to encode in pandas get_dummies?

Yes, use the columns parameter to pass a list of column names you want to encode. If you omit this parameter, the function automatically selects all columns with object or category dtype, which can be verified in the source code at pandas/core/reshape/encoding.py.

What is the purpose of the drop_first parameter in pandas get_dummies?

The drop_first parameter removes the first level of each categorical variable to avoid the dummy variable trap, where including all dummy columns creates perfect multicollinearity in linear models. This is statistically necessary for ordinary least squares regression to ensure the design matrix is full rank.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →