How to Get a Random Sample of a Subset of a DataFrame Using pandas.sample

To obtain a random sample of a subset of a DataFrame using pandas.sample, first filter the DataFrame to your target subset using boolean indexing or .loc, then chain the .sample() method to draw random rows from that filtered result.

The pandas-dev/pandas library provides a robust API for statistical sampling through the sample method. When you need a random sample of a subset of a DataFrame using pandas.sample, the workflow involves isolating your data population before applying the random selection algorithm implemented in the core library.

How pandas.sample Works Under the Hood

The DataFrame.sample method delegates its core implementation to NDFrame.sample in pandas/core/generic.py. This base method handles parameter validation, random number generation via NumPy's Generator (np.random.default_rng), and the final indexing operation.

When you call .sample() on a filtered subset, pandas executes a fast Cython path through self._slice(slice(None), indices) in the generic implementation, utilizing integer-based indexing helpers defined in pandas/_libs/algos.pyx. This ensures high performance even when sampling from large DataFrames.

Sampling from a Filtered Subset

To obtain a random sample of a subset of a DataFrame using pandas.sample, you must first isolate your target population using any standard pandas indexing technique.

Boolean Indexing Approach

The most common method uses boolean masks to filter rows before sampling:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Engineering", "Sales", "HR", "Engineering"],
    "salary": [50000, 120000, 55000, 45000, 110000],
    "years": [2, 5, 3, 10, 4]
})

# Filter to Sales department, then sample 2 random employees

sales_sample = df[df["department"] == "Sales"].sample(n=2, random_state=42)
print(sales_sample)

Explanation – The boolean mask df["department"] == "Sales" isolates the rows belonging to the Sales department. Calling .sample(n=2) picks two of those rows at random.

Using loc for Conditional Sampling

For label-based filtering combined with sampling, use .loc:


# Sample 50% of employees with more than 3 years experience

experienced = df.loc[df["years"] > 3].sample(frac=0.5, random_state=1)
print(experienced)

Controlling Sample Size and Proportion

The pandas.sample method offers two mutually exclusive parameters for determining sample size: n for exact counts and frac for proportional sampling.

Exact Number of Rows with n

Specify n to return a precise number of rows from your subset:


# Get exactly 3 random rows from Engineering subset

eng_subset = df[df["department"] == "Engineering"]
sample_three = eng_subset.sample(n=3, random_state=99)

Note: If n exceeds the subset size and replace=False (the default), pandas raises a ValueError.

Fractional Sampling with frac

Use frac to sample a percentage of your subset:


# Sample 25% of all rows where salary > 100000

high_earners = df[df["salary"] > 100000]
quarter_sample = high_earners.sample(frac=0.25, random_state=42)

Advanced Sampling Techniques

Beyond simple random selection, pandas.sample supports weighted probabilities and sampling with replacement for complex statistical scenarios.

Weighted Random Sampling

Pass a weights parameter to influence selection probability. The weights Series must align with the subset's index:


# Sample 2 employees, weighted by years of experience (seniority bias)

subset = df[df["department"] == "Sales"]
weights = subset["years"] / subset["years"].sum()
weighted = subset.sample(n=2, weights=weights, random_state=5)
print(weighted)

Sampling with Replacement

Set replace=True to allow the same row to appear multiple times in your sample:


# Bootstrap sample: 10 draws from a 5-row subset

small_subset = df[df["department"] == "HR"]
bootstrap = small_subset.sample(n=10, replace=True, random_state=2024)
print(bootstrap)

Summary

  • Filter first, sample second: To get a random sample of a subset of a DataFrame using pandas.sample, apply boolean indexing or .loc to isolate your target population before calling .sample().
  • Core implementation: The method delegates to NDFrame.sample in pandas/core/generic.py, utilizing NumPy's random number generation and Cython-optimized indexing from pandas/_libs/algos.pyx.
  • Size control: Use n for exact counts or frac for proportions, but never both simultaneously.
  • Advanced options: Leverage weights for probability-based selection and replace=True for bootstrap sampling or small subset scenarios.

Frequently Asked Questions

How do I sample from a subset without resetting the index?

The sample method preserves the original index by default. When you filter a DataFrame to create a subset, the resulting sample maintains those original index labels unless you explicitly call .reset_index() on the result.

Can I use pandas.sample on a MultiIndex DataFrame subset?

Yes. When you filter a MultiIndex DataFrame to create a subset, pandas.sample operates on the rows while preserving the MultiIndex structure. The random selection applies to the first-level index values, and the returned DataFrame retains the complete MultiIndex hierarchy of the sampled rows.

What happens if I request more samples than available in my subset?

If you specify n larger than the subset size and leave replace=False (the default), pandas raises a ValueError: Cannot take a larger sample than population when 'replace=False'. To sample more rows than exist in your subset, set replace=True to enable sampling with replacement.

Is pandas.sample deterministic if I use random_state?

Yes. Passing an integer to random_state ensures reproducible results across multiple runs. According to the implementation in pandas/core/generic.py, the random_state parameter seeds NumPy's random number generator, making the sequence of selected indices identical whenever you use the same seed value on the same subset.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s https://instagit.com/install.md

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client