So my dataset has some information by location for n dates. The problem is each date is actually a different column header. For example the CSV looks like
location name Jan-2010 Feb-2010 March-2010 A "test" 12 20 30 B "foo" 18 20 25
What I would like is for it to look like
location name Date Value A "test" Jan-2010 12 A "test" Feb-2010 20 A "test" March-2010 30 B "foo" Jan-2010 18 B "foo" Feb-2010 20 B "foo" March-2010 25
My problem is I don’t know how many dates are in the column (though I know they will always start after name)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
UPDATE
From v0.20, melt is a first order function, you can now use
df.melt(id_vars=["location", "name"],
var_name="Date",
value_name="Value")
location name Date Value
0 A "test" Jan-2010 12
1 B "foo" Jan-2010 18
2 A "test" Feb-2010 20
3 B "foo" Feb-2010 20
4 A "test" March-2010 30
5 B "foo" March-2010 25
OLD(ER) VERSIONS: <0.20
You can use pd.melt to get most of the way there, and then sort:
>>> df
location name Jan-2010 Feb-2010 March-2010
0 A test 12 20 30
1 B foo 18 20 25
>>> df2 = pd.melt(df, id_vars=["location", "name"],
var_name="Date", value_name="Value")
>>> df2
location name Date Value
0 A test Jan-2010 12
1 B foo Jan-2010 18
2 A test Feb-2010 20
3 B foo Feb-2010 20
4 A test March-2010 30
5 B foo March-2010 25
>>> df2 = df2.sort(["location", "name"])
>>> df2
location name Date Value
0 A test Jan-2010 12
2 A test Feb-2010 20
4 A test March-2010 30
1 B foo Jan-2010 18
3 B foo Feb-2010 20
5 B foo March-2010 25
(Might want to throw in a .reset_index(drop=True), just to keep the output clean.)
Note: pd.DataFrame.sort has been deprecated in favour of pd.DataFrame.sort_values.
Method 2
Use set_index with stack for MultiIndex Series, then for DataFrame add reset_index with rename:
df1 = (df.set_index(["location", "name"])
.stack()
.reset_index(name='Value')
.rename(columns={'level_2':'Date'}))
print (df1)
location name Date Value
0 A test Jan-2010 12
1 A test Feb-2010 20
2 A test March-2010 30
3 B foo Jan-2010 18
4 B foo Feb-2010 20
5 B foo March-2010 25
Method 3
pd.wide_to_long
You can add a prefix to your year columns and then feed directly to pd.wide_to_long. I won’t pretend this is efficient, but it may in certain situations be more convenient than pd.melt, e.g. when your columns already have an appropriate prefix.
df.columns = np.hstack((df.columns[:2], df.columns[2:].map(lambda x: f'Value{x}')))
res = pd.wide_to_long(df, stubnames=['Value'], i='name', j='Date').reset_index()
.sort_values(['location', 'name'])
print(res)
name Date location Value
0 test Jan-2010 A 12
2 test Feb-2010 A 20
4 test March-2010 A 30
1 foo Jan-2010 B 18
3 foo Feb-2010 B 20
5 foo March-2010 B 25
Method 4
I guess I found a simpler solution
temp1 = pd.melt(df1, id_vars=["location"], var_name='Date', value_name='Value') temp2 = pd.melt(df1, id_vars=["name"], var_name='Date', value_name='Value')
Concat whole temp1 with temp2‘s column name
temp1['new_column'] = temp2['name']
You now have what you asked for.
Method 5
Adding a link to a notebook which you can duplicate, demonstrating @DMS’s answer using pandas.melt:
df.melt(id_vars=["location", "name"],
var_name="date",
value_name="value")
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0