How to Drop Column by Name in pandas When Column Names Contain a Substring
Use df.loc[:, ~df.columns.str.contains('pattern')] for vectorized boolean indexing or df.drop([c for c in df.columns if 'pattern' in c], axis=1) to leverage pandas' optimized Index.drop routine—both approaches achieve O(k) complexity where k is the number of columns removed, not the total column count.
When working with wide datasets in the pandas-dev/pandas repository, you often need to drop column by name in pandas based on partial string matches rather than exact label matches. The library stores column labels in a specialized Index object that supports fast, vectorized operations, allowing you to remove columns containing specific substrings without Python-level iteration over rows.
Understanding the Internal Column Drop Architecture
The efficiency of column removal in pandas stems from how the library handles index manipulation internally. When you call DataFrame.drop, the method signature defined in pandas/core/frame.py forwards the request to NDFrame.drop in pandas/core/generic.py. This generic implementation resolves the labels, normalizes the axis argument, and invokes the underlying Index.drop method found in pandas/core/indexes/base.py.
According to the pandas source code, Index.drop creates a new Index instance without the specified labels using a C-level fast-path that operates in O(k) time complexity, where k represents the number of columns being removed. This means the performance cost scales with the size of your exclusion list, not the total number of columns in the DataFrame.
Method 1: Vectorized Boolean Masking with str.contains
The most direct way to drop column by name in pandas using substring matching employs vectorized string operations on the column Index itself. The str.contains method builds a boolean array in a single pass, and the tilde operator (~) inverts the mask to select only columns that do not match the pattern.
import pandas as pd
# Sample DataFrame with substring patterns in column names
df = pd.DataFrame({
'apple_qty': [10, 20],
'banana_qty': [5, 15],
'apple_price': [1.2, 1.3],
'banana_price': [0.8, 0.9],
'misc': [0, 1]
})
# Drop every column containing the substring 'price'
df_filtered = df.loc[:, ~df.columns.str.contains('price')]
print(df_filtered)
This approach avoids creating intermediate Python lists and performs the selection in a single vectorized pass through the column Index.
Method 2: Pre-computing Labels for DataFrame.drop
Alternatively, you can compute the list of columns to remove using a list comprehension, then pass that list to DataFrame.drop. While the list comprehension executes in Python, it runs only once, and the subsequent drop call leverages the optimized Index.drop routine implemented in pandas/core/indexes/base.py.
# Compute columns to drop using substring matching
cols_to_drop = [col for col in df.columns if 'price' in col]
# Use drop with axis=1 for column removal
df_dropped = df.drop(cols_to_drop, axis=1)
print(df_dropped)
As implemented in pandas-dev/pandas, this method is equally efficient to boolean masking because the O(k) index manipulation dominates the runtime, overshadowing the one-time cost of the list comprehension.
Method 3: Regex Filtering with DataFrame.filter
For advanced pattern matching, you can use DataFrame.filter with a regular expression that excludes matching columns. This method returns a new DataFrame containing only columns whose names match the regex, effectively dropping everything else.
# Keep only columns that do NOT end with 'price' using negative lookahead
df_regex = df.filter(regex='^(?!.*price$)')
print(df_regex)
All three approaches produce identical output:
apple_qty banana_qty misc
0 10 5 0
1 20 15 1
Performance Considerations
Both the boolean masking and pre-computed drop methods avoid row-wise iteration and keep operations column-wise, which is the most performant way to reshape a DataFrame. The choice between them depends on your specific workflow:
- Boolean masking (
df.loc[:, ~mask]) is slightly more concise and avoids the overhead of method dispatch withindrop(). - Pre-computed drop (
df.drop(list, axis=1)) is explicit and useful when you need the list of dropped columns for logging or further processing.
According to the source code in pandas/core/generic.py, the drop method includes additional validation logic for the labels parameter, making the loc approach marginally faster for simple filtering tasks, though both scale linearly with the number of columns removed.
Summary
- Boolean masking with
df.loc[:, ~df.columns.str.contains()]provides the most concise syntax for dropping columns by substring. - Pre-computing label lists for
df.drop()leverages the O(k)Index.dropimplementation inpandas/core/indexes/base.pyand works well when you need to reference the exclusion list later. - Regex filtering via
df.filter()offers advanced pattern matching capabilities for complex naming conventions. - All approaches maintain column-wise operations, avoiding the performance penalty of row-wise iteration.
Frequently Asked Questions
Which is faster: drop() or loc[] with a boolean mask?
Both methods exhibit O(k) complexity where k is the number of columns removed. However, df.loc[] with a boolean mask is marginally faster because it bypasses the label validation and error handling logic found in pandas/core/generic.py's drop implementation. For most datasets, the difference is negligible, but loc is preferred for simple filtering while drop is better when you need explicit control over the exclusion list.
Can I use regular expressions directly with the drop() method?
No, DataFrame.drop() requires an exact list of label names as implemented in pandas/core/frame.py. To use regex patterns, you must first identify matching columns using df.columns.str.contains() with regex=True, or use df.filter() with an appropriate regex pattern to select the columns you want to keep, effectively dropping the others.
How do I keep only the columns that contain a specific substring?
Remove the tilde (~) operator from the boolean mask to invert the selection: df.loc[:, df.columns.str.contains('substring')]. Alternatively, use df.filter(like='substring'), which is a convenience wrapper for partial string matching that returns only matching columns.
Does this approach work with MultiIndex column names?
Yes, but you must adjust the logic to target specific levels of the MultiIndex. Use df.columns.get_level_values(level).str.contains() to check a specific level, then apply the boolean mask across the DataFrame columns. The underlying Index.drop mechanism in pandas/core/indexes/base.py handles MultiIndex objects with the same O(k) efficiency.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →