Pandas how to use pd.cut()

Here is the snippet:

test = pd.DataFrame({'days': [0,31,45]})
test['range'] = pd.cut(test.days, [0,30,60])

Output:

    days    range
0   0       NaN
1   31      (30, 60]
2   45      (30, 60]

I am surprised that 0 is not in (0, 30], what should I do to categorize 0 as (0, 30]?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

test['range'] = pd.cut(test.days, [0,30,60], include_lowest=True)
print (test)
   days           range
0     0  (-0.001, 30.0]
1    31    (30.0, 60.0]
2    45    (30.0, 60.0]

See difference:

test = pd.DataFrame({'days': [0,20,30,31,45,60]})

test['range1'] = pd.cut(test.days, [0,30,60], include_lowest=True)
#30 value is in [30, 60) group
test['range2'] = pd.cut(test.days, [0,30,60], right=False)
#30 value is in (0, 30] group
test['range3'] = pd.cut(test.days, [0,30,60])
print (test)
   days          range1    range2    range3
0     0  (-0.001, 30.0]   [0, 30)       NaN
1    20  (-0.001, 30.0]   [0, 30)   (0, 30]
2    30  (-0.001, 30.0]  [30, 60)   (0, 30]
3    31    (30.0, 60.0]  [30, 60)  (30, 60]
4    45    (30.0, 60.0]  [30, 60)  (30, 60]
5    60    (30.0, 60.0]       NaN  (30, 60]

Or use numpy.searchsorted, but values of days has to be sorted:

arr = np.array([0,30,60])
test['range1'] = arr.searchsorted(test.days)
test['range2'] = arr.searchsorted(test.days, side='right') - 1
print (test)
   days  range1  range2
0     0       0       0
1    20       1       0
2    30       1       1
3    31       2       1
4    45       2       1
5    60       2       2

Method 2

pd.cut documentation
Include parameter right=False

test = pd.DataFrame({'days': [0,31,45]})
test['range'] = pd.cut(test.days, [0,30,60], right=False)

test

   days     range
0     0   [0, 30)
1    31  [30, 60)
2    45  [30, 60)

Method 3

You can use labels to pd.cut() as well.
The following example contains the grade of students in the range from 0-10.
We’re adding a new column called ‘grade_cat’ to categorize the grades.

bins represent the intervals: 0-4 is one interval, 5-6 is one interval, and so on
The corresponding labels are “poor”, “normal”, etc

bins = [0, 4, 6, 10]
labels = ["poor","normal","excellent"]
student['grade_cat'] = pd.cut(student['grade'], bins=bins, labels=labels)

Method 4

A sample of how the .cut works

s=pd.Series([168,180,174,190,170,185,179,181,175,169,182,177,180,171])
    pd.cut(s,3)
    #To add labels to bins
    pd.cut(s,3,labels=["Small","Medium","Large"])

This can be used directly on a range

Method 5

@jezrael has explained almost all the use-cases of pd.cut()

One use-case that i would like to add is the following

pd.cut(np.array([1,2,3,4,5,6]),3)

the number of bins are decided by the second parameter, thus we have following output

[(0.995,2.667],(0.995,2.667],(2.667,4.333],(2.667,4.333], (4.333,6.0], (4.333,6.0]]
Categories (3, interval[float64]): [(0.995,2.667] < (2.667,4.333] < (4.333,6.0]]

Similarly if we use the number of bin parameter(second parameter) as 2 following will be the output

[(0.995, 3.5], (0.995, 3.5], (0.995, 3.5], (3.5, 6.0], (3.5, 6.0], (3.5, 6.0]]
Categories (2, interval[float64]): [(0.995, 3.5] < (3.5, 6.0]]


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x