How to set the value of a pandas column as list

I want to set the value of a pandas column as a list of strings. However, my efforts to do so didn’t succeed because pandas take the column value as an iterable and I get a: ValueError: Must have equal len keys and value when setting with an iterable.

Here is an MWE

>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>> df
col1    col2
0   1   4
1   2   5
2   3   6

>> df['new_col'] = None
>> df.loc[df.col1 == 1, 'new_col'] = ['a', 'b']
ValueError: Must have equal len keys and value when setting with an iterable

I tried to set the dtype as list using df.new_col = df.new_col.astype(list) and that didn’t work either.

I am wondering what would be the correct approach here.

EDIT

The answer provided here: Python pandas insert list into a cell using at didn’t work for me either.

Contents hide

Answers:

Method 1

Method 2

Don’t do this.

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Not easy, one possible solution is create helper Series:

df.loc[df.col1 == 1, 'new_col'] = pd.Series([['a', 'b']] * len(df))
print (df)
   col1  col2 new_col
0     1     4  [a, b]
1     2     5     NaN
2     3     6     NaN

Another solution, if need set missing values to empty list too is use list comprehension:

#df['new_col'] = [['a', 'b'] if x == 1 else np.nan for x in df['col1']]

df['new_col'] = [['a', 'b'] if x == 1 else [] for x in df['col1']]
print (df)
   col1  col2 new_col
0     1     4  [a, b]
1     2     5      []
2     3     6      []

But then you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks.

Method 2

Don’t do this.

Pandas was never designed to hold lists in series / columns. You can concoct expensive workarounds, but these are not recommended.

The main reason holding lists in series is not recommended is you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks. Your series will be of object dtype, which represents a sequence of pointers, much like list. You will lose benefits in terms of memory and performance, as well as access to optimized Pandas methods.

See also What are the advantages of NumPy over regular Python lists? The arguments in favour of Pandas are the same as for NumPy.

That said, since you are going against the purpose and design of Pandas, there are many who face the same problem and have asked similar questions:

Method 3

You can try below code:

list1=[1,2,3]
list2=[4,5,6]
col=[str(“,”.join(map(str, list1))), str(“,”.join(map(str, list2)))]
df=pd.DataFrame(np.random.randint(low=0, high=0, size(5,2)), columns=col)
print(df)

Hope this is the expected output:

Method 4

Also using np.where:

df['new_col'] = np.where(df.col1 == 1,  pd.Series([['a', 'b']]) , np.nan)

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating