I have a data frame column with numeric values:
df['percentage'].head() 46.5 44.2 100.0 42.12
I want to see the column as bin counts:
bins = [0, 1, 5, 10, 25, 50, 100]
How can I get the result as bins with their value counts?
[0, 1] bin amount [1, 5] etc [5, 10] etc ...
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can use pandas.cut:
bins = [0, 1, 5, 10, 25, 50, 100] df['binned'] = pd.cut(df['percentage'], bins) print (df) percentage binned 0 46.50 (25, 50] 1 44.20 (25, 50] 2 100.00 (50, 100] 3 42.12 (25, 50]
bins = [0, 1, 5, 10, 25, 50, 100] labels = [1,2,3,4,5,6] df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels) print (df) percentage binned 0 46.50 5 1 44.20 5 2 100.00 6 3 42.12 5
bins = [0, 1, 5, 10, 25, 50, 100] df['binned'] = np.searchsorted(bins, df['percentage'].values) print (df) percentage binned 0 46.50 5 1 44.20 5 2 100.00 6 3 42.12 5
…and then value_counts or groupby and aggregate size:
s = pd.cut(df['percentage'], bins=bins).value_counts() print (s) (25, 50] 3 (50, 100] 1 (10, 25] 0 (5, 10] 0 (1, 5] 0 (0, 1] 0 Name: percentage, dtype: int64
s = df.groupby(pd.cut(df['percentage'], bins=bins)).size() print (s) percentage (0, 1] 0 (1, 5] 0 (5, 10] 0 (10, 25] 0 (25, 50] 3 (50, 100] 1 dtype: int64
By default cut returns categorical.
Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data, operations in categorical.
Method 2
Using the Numba module for speed up.
On big datasets (more than 500k), pd.cut can be quite slow for binning data.
I wrote my own function in Numba with just-in-time compilation, which is roughly six times faster:
from numba import njit
@njit
def cut(arr):
bins = np.empty(arr.shape[0])
for idx, x in enumerate(arr):
if (x >= 0) & (x < 1):
bins[idx] = 1
elif (x >= 1) & (x < 5):
bins[idx] = 2
elif (x >= 5) & (x < 10):
bins[idx] = 3
elif (x >= 10) & (x < 25):
bins[idx] = 4
elif (x >= 25) & (x < 50):
bins[idx] = 5
elif (x >= 50) & (x < 100):
bins[idx] = 6
else:
bins[idx] = 7
return bins
cut(df['percentage'].to_numpy()) # array([5., 5., 7., 5.])
Optional: you can also map it to bins as strings:
a = cut(df['percentage'].to_numpy())
conversion_dict = {1: 'bin1',
2: 'bin2',
3: 'bin3',
4: 'bin4',
5: 'bin5',
6: 'bin6',
7: 'bin7'}
bins = list(map(conversion_dict.get, a))
# ['bin5', 'bin5', 'bin7', 'bin5']
Speed comparison:
# Create a dataframe of 8 million rows for testing dfbig = pd.concat([df]*2000000, ignore_index=True) dfbig.shape # (8000000, 1)
%%timeit cut(dfbig['percentage'].to_numpy()) # 38 ms ± 616 µs per loop (mean ± standard deviation of 7 runs, 10 loops each)
%%timeit bins = [0, 1, 5, 10, 25, 50, 100] labels = [1,2,3,4,5,6] pd.cut(dfbig['percentage'], bins=bins, labels=labels) # 215 ms ± 9.76 ms per loop (mean ± standard deviation of 7 runs, 10 loops each)
Method 3
We could also use np.select:
bins = [0, 1, 5, 10, 25, 50, 100]
df['groups'] = (np.select([df['percentage'].between(i, j, inclusive='right')
for i,j in zip(bins, bins[1:])],
[1, 2, 3, 4, 5, 6]))
Output:
percentage groups 0 46.50 5 1 44.20 5 2 100.00 6 3 42.12 5
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0