Creating new columns that contain the value of a specific index

I have tried multiple methods that get me to a point close to but not exactly where I want to be with the final output. I am trying to first create a few columns that contain a specific within the raw dataframe based on it’s position, afterwards I am trying to make a particular row the header row and skip all the rows that were above it.

Raw input:

    |           NA            |  NA_1 |  NA_2  |  NA_3 |
0   | 12-Month Percent Change |  NaN  |  NaN   |  NaN  |
1   | Series Id: CUUR0000SAF1 |  NaN  |  NaN   |  NaN  |
2   |       Item: Food        |  NaN  |  NaN   |  NaN  |
3   |           Year          |  Jan  |  Feb   |  Mar  |
4   |           2010          | -0.4  | -0.2   |  0.2  |
5   |           2011          |  1.8  |  2.3   |  2.9  |

Code used:

df1['View Description'] = df1.iat[0,0]
df1['Series ID'] = df1.iat[1,1]
df1['Series Name'] = df1.iat[2,1]
df1

Resulted to:

    NA  NA.1    NA.2    NA.3    NA.4    NA.5    NA.6    NA.7    View Description    Series ID   Series Name
0   12-Month Percent Change NaN NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1    Food
1   Series Id:  CUUR0000SAF1    NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1    Food
2   Item:   Food    NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1    Food
3   Year    Jan Feb Mar Apr May Jun Jul 12-Month Percent Change CUUR0000SAF1    Food
4   2010    -0.4    -0.2    0.2 0.5 0.7 0.7 0.9 12-Month Percent Change CUUR0000SAF1    Food
5   2011    1.8 2.3 2.9 3.2 3.5 3.7 4.2 12-Month Percent Change CUUR0000SAF1    Food
6   2012    4.4 3.9 3.3 3.1 2.8 2.7 2.3 12-Month Percent Change CUUR0000SAF1    Food
7   2013    1.6 1.6 1.5 1.5 1.4 1.4 1.4 12-Month Percent Change CUUR0000SAF1    Food

Last thing is I want to make the header the row 3 and remove all the rows above it. BUT still keep the three columns at the end: 1) View Description, Series ID, Series Name.

Any suggestions with an efficient way that this can be done as next I want to scale it up with a for loop or something that would do this process for x10 files.

Thanks in advance!

Contents hide

Answers:

Method 1

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Here’s a way to do what I believe your question is asking:

# Parse and store the first 3 values in column 0 so that we can use them 
# as values for 3 new columns later.
new_columns = [x.split(':')[-1].strip() for x in df1.iloc[0:3,0].to_list()]

# Transpose so that we can use set_index() to replace the index 
# (the columns from the original df1) to ['Item: Food', NaN, NaN, NaN], 
# then transpose back so that the new index becomes the columns.
df1 = df1.T.set_index(3).T

# Use reset_index() to replace the index with a fresh range 
# index (0, 1, 2, ...) so we can use iloc() to discard the 
# first 3 unwanted rows, then call reset_index() again.
df1 = df1.reset_index(drop=True).iloc[3:].reset_index(drop=True)

# Get rid of vestigial name for columns.
df1.columns.names = [None]

# Add the three new columns set to the values saved earlier.
df1[['View Description', 'Series ID', 'Series Name']] = new_columns

Here is full test case (with the above annotated code compressed into fewer lines):

import pandas as pd
s = [
'    |           NA            |  NA_1 |  NA_2  |  NA_3 |',
'0   | 12-Month Percent Change |  NaN  |  NaN   |  NaN  |',
'1   | Series Id: CUUR0000SAF1 |  NaN  |  NaN   |  NaN  |',
'2   |       Item: Food        |  NaN  |  NaN   |  NaN  |',
'3   |           Year          |  Jan  |  Feb   |  Mar  |',
'4   |           2010          | -0.4  | -0.2   |  0.2  |',
'5   |           2011          |  1.8  |  2.3   |  2.9  |']

df1 = pd.DataFrame(
    [[x.strip() for x in y.split('|')[1:-1]] for y in s[1:]],
    columns = [x.strip() for x in s[0].split('|')[1:-1]],
)
print(df1)
new_columns = [x.split(':')[-1].strip() for x in df1.iloc[0:3,0].to_list()]
df1 = df1.T.set_index(3).T.reset_index(drop=True).iloc[3:].reset_index(drop=True)
df1.columns.names = [None]
df1[['View Description', 'Series ID', 'Series Name']] = new_columns
print(df1)

Output:

                        NA  NA_1  NA_2 NA_3
0  12-Month Percent Change   NaN   NaN  NaN
1  Series Id: CUUR0000SAF1   NaN   NaN  NaN
2               Item: Food   NaN   NaN  NaN
3                     Year   Jan   Feb  Mar
4                     2010  -0.4  -0.2  0.2
5                     2011   1.8   2.3  2.9
   Year   Jan   Feb  Mar         View Description     Series ID Series Name
0  2010  -0.4  -0.2  0.2  12-Month Percent Change  CUUR0000SAF1        Food
1  2011   1.8   2.3  2.9  12-Month Percent Change  CUUR0000SAF1        Food

UPDATE: This is code that allows us to configure (1) the cell coordinates of each of 3 cells to be used for new column values (new_col_coords) and (2) the header_row above which rows are discarded:

import pandas as pd
s = [
'    |           NA            |  NA_1 |  NA_2  |  NA_3 |',
'0   | 12-Month Percent Change |  NaN  |  NaN   |  NaN  |',
'91  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'1   | Series Id: CUUR0000SAF1 |  Abc  |  NaN   |  NaN  |',
'92  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'93  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'94  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'2   |       Item: Food        |  Xyz  |  NaN   |  NaN  |',
'95  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'96  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'97  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'98  | To be discarded         |  NaN  |  NaN   |  NaN  |',
'3   |           Year          |  Jan  |  Feb   |  Mar  |',
'4   |           2010          | -0.4  | -0.2   |  0.2  |',
'5   |           2011          |  1.8  |  2.3   |  2.9  |']

df1 = pd.DataFrame(
    [[x.strip() for x in y.split('|')[1:-1]] for y in s[1:]],
    columns = [x.strip() for x in s[0].split('|')[1:-1]],
)
print(df1)

# parse and store the 3 values at specified coordinates so that we can use them as values for 3 new columns later
new_col_coords = [[0,0], [2,1], [6,1]]
new_columns = [x.split(':')[-1].strip() for x in [df1.iloc[i, j] for i, j in new_col_coords]]

header_row = 11

# Here's how to do everything that follows in one line of code:
#df1 = df1.T.set_index(header_row).T.reset_index(drop=True).iloc[header_row:].reset_index(drop=True)

# Transpose so that we can use set_index() to change the index to ['Item: Food', NaN, NaN, NaN], then transpose back so that index becomes the columns
df1 = df1.T.set_index(header_row).T

# Use reset_index() to replace the index with a fresh range index (0, 1, 2, ...) so we can use iloc() to discard the unwanted rows above header_row, then call reset_index() again
df1 = df1.reset_index(drop=True).iloc[header_row:].reset_index(drop=True)

# Get rid of vestigial name for columns
df1.columns.names = [None]

# Add the three new columns set to the values saved earlier
df1[['View Description', 'Series ID', 'Series Name']] = new_columns

print(df1)

Output:

                         NA  NA_1  NA_2 NA_3
0   12-Month Percent Change   NaN   NaN  NaN
1           To be discarded   NaN   NaN  NaN
2   Series Id: CUUR0000SAF1   Abc   NaN  NaN
3           To be discarded   NaN   NaN  NaN
4           To be discarded   NaN   NaN  NaN
5           To be discarded   NaN   NaN  NaN
6                Item: Food   Xyz   NaN  NaN
7           To be discarded   NaN   NaN  NaN
8           To be discarded   NaN   NaN  NaN
9           To be discarded   NaN   NaN  NaN
10          To be discarded   NaN   NaN  NaN
11                     Year   Jan   Feb  Mar
12                     2010  -0.4  -0.2  0.2
13                     2011   1.8   2.3  2.9
   Year   Jan   Feb  Mar         View Description Series ID Series Name
0  2010  -0.4  -0.2  0.2  12-Month Percent Change       Abc         Xyz
1  2011   1.8   2.3  2.9  12-Month Percent Change       Abc         Xyz

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating