Here is the snippet:
test = pd.DataFrame({'days': [0,31,45]})
test['range'] = pd.cut(test.days, [0,30,60])
Output:
days range 0 0 NaN 1 31 (30, 60] 2 45 (30, 60]
I am surprised that 0 is not in (0, 30], what should I do to categorize 0 as (0, 30]?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
test['range'] = pd.cut(test.days, [0,30,60], include_lowest=True) print (test) days range 0 0 (-0.001, 30.0] 1 31 (30.0, 60.0] 2 45 (30.0, 60.0]
See difference:
test = pd.DataFrame({'days': [0,20,30,31,45,60]})
test['range1'] = pd.cut(test.days, [0,30,60], include_lowest=True)
#30 value is in [30, 60) group
test['range2'] = pd.cut(test.days, [0,30,60], right=False)
#30 value is in (0, 30] group
test['range3'] = pd.cut(test.days, [0,30,60])
print (test)
days range1 range2 range3
0 0 (-0.001, 30.0] [0, 30) NaN
1 20 (-0.001, 30.0] [0, 30) (0, 30]
2 30 (-0.001, 30.0] [30, 60) (0, 30]
3 31 (30.0, 60.0] [30, 60) (30, 60]
4 45 (30.0, 60.0] [30, 60) (30, 60]
5 60 (30.0, 60.0] NaN (30, 60]
Or use numpy.searchsorted, but values of days has to be sorted:
arr = np.array([0,30,60]) test['range1'] = arr.searchsorted(test.days) test['range2'] = arr.searchsorted(test.days, side='right') - 1 print (test) days range1 range2 0 0 0 0 1 20 1 0 2 30 1 1 3 31 2 1 4 45 2 1 5 60 2 2
Method 2
pd.cut documentation
Include parameter right=False
test = pd.DataFrame({'days': [0,31,45]})
test['range'] = pd.cut(test.days, [0,30,60], right=False)
test
days range
0 0 [0, 30)
1 31 [30, 60)
2 45 [30, 60)
Method 3
You can use labels to pd.cut() as well.
The following example contains the grade of students in the range from 0-10.
We’re adding a new column called ‘grade_cat’ to categorize the grades.
bins represent the intervals: 0-4 is one interval, 5-6 is one interval, and so on
The corresponding labels are “poor”, “normal”, etc
bins = [0, 4, 6, 10] labels = ["poor","normal","excellent"] student['grade_cat'] = pd.cut(student['grade'], bins=bins, labels=labels)
Method 4
A sample of how the .cut works
s=pd.Series([168,180,174,190,170,185,179,181,175,169,182,177,180,171])
pd.cut(s,3)
#To add labels to bins
pd.cut(s,3,labels=["Small","Medium","Large"])
This can be used directly on a range
Method 5
@jezrael has explained almost all the use-cases of pd.cut()
One use-case that i would like to add is the following
pd.cut(np.array([1,2,3,4,5,6]),3)
the number of bins are decided by the second parameter, thus we have following output
[(0.995,2.667],(0.995,2.667],(2.667,4.333],(2.667,4.333], (4.333,6.0], (4.333,6.0]] Categories (3, interval[float64]): [(0.995,2.667] < (2.667,4.333] < (4.333,6.0]]
Similarly if we use the number of bin parameter(second parameter) as 2 following will be the output
[(0.995, 3.5], (0.995, 3.5], (0.995, 3.5], (3.5, 6.0], (3.5, 6.0], (3.5, 6.0]] Categories (2, interval[float64]): [(0.995, 3.5] < (3.5, 6.0]]
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0