I want to find all values in a Pandas dataframe that contain whitespace (any arbitrary amount) and replace those values with NaNs.
Any ideas how this can be improved?
Basically I want to turn this:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz
2000-01-05 -0.222552 4
2000-01-06 -1.176781 qux
Into this:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz NaN
2000-01-05 -0.222552 NaN 4
2000-01-06 -1.176781 qux NaN
I’ve managed to do it with the code below, but man is it ugly. It’s not Pythonic and I’m sure it’s not the most efficient use of pandas either. I loop through each column and do boolean replacement against a column mask generated by applying a function that does a regex search of each value, matching on whitespace.
for i in df.columns:
df[i][df[i].apply(lambda i: True if re.search('^s*$', str(i)) else False)]=None
It could be optimized a bit by only iterating through fields that could contain empty strings:
if df[i].dtype == np.dtype('object')
But that’s not much of an improvement
And finally, this code sets the target strings to None, which works with Pandas’ functions like fillna(), but it would be nice for completeness if I could actually insert a NaN directly instead of None.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I think df.replace() does the job, since pandas 0.13:
df = pd.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'foo', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ' '],
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))
# replace field that's entirely space (or empty) with NaN
print(df.replace(r'^s*$', np.nan, regex=True))
Produces:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz NaN
2000-01-05 -0.222552 NaN 4
2000-01-06 -1.176781 qux NaN
As Temak pointed it out, use df.replace(r'^s+$', np.nan, regex=True) in case your valid data contains white spaces.
Method 2
If you want to replace an empty string and records with only spaces, the correct answer is!:
df = df.replace(r'^s*$', np.nan, regex=True)
The accepted answer
df.replace(r's+', np.nan, regex=True)
Does not replace an empty string!, you can try yourself with the given example slightly updated:
df = pd.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'fo o', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ''],
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))
Note, also that ‘fo o’ is not replaced with Nan, though it contains a space.
Further note, that a simple:
df.replace(r'', np.NaN)
Does not work either – try it out.
Method 3
How about:
d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)
The applymap function applies a function to every cell of the dataframe.
Method 4
I did this:
df = df.apply(lambda x: x.str.strip()).replace('', np.nan)
or
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
You can strip all str, then replace empty str with np.nan.
Method 5
If you are exporting the data from the CSV file it can be as simple as this :
df = pd.read_csv(file_csv, na_values=' ')
This will create the data frame as well as replace blank values as Na
Method 6
Simplest of all solutions:
df = df.replace(r'^s+$', np.nan, regex=True)
Method 7
For a very fast and simple solution where you check equality against a single value, you can use the mask method.
df.mask(df == ' ')
Method 8
These are all close to the right answer, but I wouldn’t say any solve the problem while remaining most readable to others reading your code. I’d say that answer is a combination of BrenBarn’s Answer and tuomasttik’s comment below that answer. BrenBarn’s answer utilizes isspace builtin, but does not support removing empty strings, as OP requested, and I would tend to attribute that as the standard use case of replacing strings with null.
I rewrote it with .apply, so you can call it on a pd.Series or pd.DataFrame.
Python 3:
To replace empty strings or strings of entirely spaces:
df = df.apply(lambda x: np.nan if isinstance(x, str) and (x.isspace() or not x) else x)
To replace strings of entirely spaces:
df = df.apply(lambda x: np.nan if isinstance(x, str) and x.isspace() else x)
To use this in Python 2, you’ll need to replace str with basestring.
Python 2:
To replace empty strings or strings of entirely spaces:
df = df.apply(lambda x: np.nan if isinstance(x, basestring) and (x.isspace() or not x) else x)
To replace strings of entirely spaces:
df = df.apply(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)
Method 9
This worked for me.
When I import my csv file I added na_values = ‘ ‘. Spaces are not included in the default NaN values.
df= pd.read_csv(filepath,na_values = ' ')
Method 10
print(df.isnull().sum()) # check numbers of null value in each column
modifiedDf=df.fillna("NaN") # Replace empty/null values with "NaN"
# modifiedDf = fd.dropna() # Remove rows with empty values
print(modifiedDf.isnull().sum()) # check numbers of null value in each column
Method 11
This is not an elegant solution, but what does seem to work is saving to XLSX and then importing it back. The other solutions on this page did not work for me, unsure why.
data.to_excel(filepath, index=False) data = pd.read_excel(filepath)
Method 12
This should work
df.loc[df.Variable == '', 'Variable'] = 'Value'
or
df.loc[df.Variable1 == '', 'Variable2'] = 'Value'
Method 13
you can also use a filter to do it.
df = PD.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'foo', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ' '])
df[df=='']='nan'
df=df.astype(float)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0