I am using python csvkit to compare 2 files like this:
df1 = pd.read_csv('input1.csv', sep=',s+', delimiter=',', encoding="utf-8")
df2 = pd.read_csv('input2.csv', sep=',s,', delimiter=',', encoding="utf-8")
df3 = pd.merge(df1,df2, on='employee_id', how='right')
df3.to_csv('output.csv', encoding='utf-8', index=False)
Currently I am running the file through a script before hand that strips spaces from the employee_id column.
An example of employee_ids:
37 78973 3 23787 2 22 3 123
Is there a way to get csvkit to do it and save me a step?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can strip() an entire Series in Pandas using .str.strip():
df1['employee_id'] = df1['employee_id'].str.strip() df2['employee_id'] = df2['employee_id'].str.strip()
This will remove leading/trailing whitespaces on the employee_id column in both df1 and df2
Alternatively, you can modify your read_csv lines to also use skipinitialspace=True
df1 = pd.read_csv('input1.csv', sep=',s+', delimiter=',', encoding="utf-8", skipinitialspace=True)
df2 = pd.read_csv('input2.csv', sep=',s,', delimiter=',', encoding="utf-8", skipinitialspace=True)
It looks like you are attempting to remove spaces in a string containing numbers. You can do this by:
df1['employee_id'] = df1['employee_id'].str.replace(" ","")
df2['employee_id'] = df2['employee_id'].str.replace(" ","")
Method 2
You can do the strip() in pandas.read_csv() as:
pandas.read_csv(..., converters={'employee_id': str.strip})
And if you need to only strip leading whitespace:
pandas.read_csv(..., converters={'employee_id': str.lstrip})
And to remove all spaces:
def strip_spaces(a_str_with_spaces):
return a_str_with_spaces.replace(' ', '')
pandas.read_csv(..., converters={'employee_id': strip_spaces})
Method 3
Df['employee']=Df['employee'].str.strip()
Method 4
The best and easiest way to remove blank whitespace in pandas dataframes is :-
df1 = pd.read_csv('input1.csv')
df1["employee_id"] = df1["employee_id"].str.strip()
That’s it
Method 5
In a dataframe (df) there may be multiple column name that have ‘ SPACE ‘. One of the general and easy way to do away with that is :- (df is the dataframe)
df.columns = df.columns.str.replace(' ', '')
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0