Regression coefficients do not match conditional means

You can download the following data set from this repo.

Y CONST T X1 X1T X2 X2T
0 2.31252 1 1 0 0 1 1
1 -0.836074 1 1 1 1 1 1
2 -0.797183 1 0 0 0 1 0

I have a dependent variable (Y) and three binary columns (T, X1 and X2). From this data we can create four groups:

  1. X1 == 0 and X2 == 0
  2. X1 == 0 and X2 == 1
  3. X1 == 1 and X2 == 0
  4. X1 == 1 and X2 == 1

Within each group, I want to calculate the difference in the mean of Y between observations with T == 1 and T == 0.

I can do so with the following code:

# Libraries
import pandas as pd

# Group by T, X1, X2 and get the mean of Y
t = df.groupby(['T','X1','X2'])['Y'].mean().reset_index()

# Reshape the result and rename the columns
t = t.pivot(index=['X1','X2'], columns='T', values='Y')
t.columns = ['Teq0','Teq1']

# I want to replicate these differences with a regression
t['Teq1'] - t['Teq0']

> X1  X2
> 0   0     0.116175
>     1     0.168791
> 1   0    -0.027278
>     1    -0.147601

Problem

I want to recreate these results with the following regression model (m).

# Libraries
from statsmodels.api import OLS

# Fit regression with interaction terms
m = OLS(endog=df['Y'], exog=df[['CONST','T','X1','X1T','X2','X2T']]).fit()

# Estimated values
m.params[['T','X1T','X2T']]

> T      0.162198
> X1T   -0.230372
> X2T   -0.034303

I was expecting the coefficients:

  1. T = 0.116175
  2. T + X1T = 0.168791
  3. T + X2T = -0.027278
  4. T + X1T + X2T = -0.147601

Question

Why don’t the regression coefficients match the results from the first chunk’s output (t['Teq1'] - t['Teq0'])?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Thanks to @Josef for noticing that T, X1 and X2 have eight different combinations while my regression model has six parameters. I was therefore missing two interaction terms (and thus two parameters).

Namely, the regression model needs to account for the interaction between X1 and X2 as well as the interaction between X1, X2 and T.

This can be done by declaring the missing interaction columns and fitting the model:

# Declare missing columns
df = df.assign(X1X2 = df['X1'].multiply(df['X2']),
               X1X2T = df['X1'].multiply(df['X2T']))

# List of independent variables
cols = ['CONST','T','X1','X1T','X2','X2T','X1X2','X1X2T']

# Fit model
m = OLS.fit(endog=df['Y'], exog=df[cols]).fit()

Alternatively, we can use the formula interface:

# Declare formula
f = 'Y ~ T + X1 + I(X1*T) + X2 + I(X2*T) + I(X1*X2) + I(X1*X2*T)'

# Fit model
m = OLS.from_formula(formula=f, data=df).fit()


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x