# semopy

## Intercepts

Although rarely necessary, sometimes researcher might be interested in observed variables' means. Model class does not provide intercept estimates per se, however. Furthermore, when considering Wishart ML, for instance, there is no place for a mean component in the distribution. In fact, Wishart can be considered as a case of natural reparametrization of Multivariate Normal, therefore means are not really useful for estimating variance components.

It's straightforward to estimate means after estimating variance components, however. After obtaining variance components, we can estimate means using a GLS procedure by calling a estimate_means function:

Python script:
Output:
```from semopy.examples import \
multivariate_regression
from semopy.means import estimate_means
from semopy import Model
desc = multivariate_regression.get_model()
data = multivariate_regression.get_data()

model = Model(desc)
model.fit(data)
print(estimate_means(model))```
```  lval op rval  Estimate
0   x1  ~    1 -0.043750
1   x2  ~    1  0.073704
2   x3  ~    1 -0.056712
3   y1  ~    1 -1.456624
4   y2  ~    1  0.929309
5   y3  ~    1  0.992039```
Notice that exogenous variables x1, x2, x3 means are equal to their sample means (as expected) by calling `print(data.mean())`:
```y1   -1.461698
y2    1.105265
y3    1.104522
x1   -0.043750
x2    0.073704
x3   -0.056712```

## Mean components model

semopy has a different model called ModelMeans that differs from Model in a way that it estimates mean components simultaneously with variance components as in a classical multivariate-normal MLE. Under the hood, we went even further and moved out all observed exogenous variables and their regression coefficients into the mean component together with intercepts. This leads to a possibly advantageous tradeoff in terms of a performance and numerical stability as we might decrease the size of model-implied covariance matrix (that is inverted each time an objective function is evaluated), but the complexity of the problem to be solved now also depends on the number of individuals in the dataset.

Example of running ModelMeans:

Python script:
Output:
```from semopy.examples import \
multivariate_regression
from semopy import ModelMeans
desc = multivariate_regression.get_model()
data = multivariate_regression.get_data()

model = ModelMeans(desc)
model.fit(data)
print(model.inspect())```
```   lval  op rval  Estimate  Std. Err    z-value       p-value
0    y1   ~   x1 -1.389720  0.073417 -18.929022  0.000000e+00
1    y1   ~   x2 -1.138398  0.087966 -12.941384  0.000000e+00
2    y1   ~   x3 -0.317953  0.072576  -4.380966  1.181546e-05
3    y2   ~   x1 -0.745740  0.097969  -7.611965  2.708944e-14
4    y2   ~   x2  1.074526  0.117383   9.154020  0.000000e+00
5    y2   ~   x3 -1.130938  0.096847 -11.677611  0.000000e+00
6    y3   ~   x1  0.702778  0.064269  10.934870  0.000000e+00
7    y3   ~   x2  1.235047  0.077005  16.038535  0.000000e+00
8    y3   ~   x3 -0.920454  0.063533 -14.487837  0.000000e+00
9    y1   ~    1 -1.456621  0.080268 -18.147061  0.000000e+00
10   y2   ~    1  0.929294  0.107110   8.676035  0.000000e+00
11   y3   ~    1  0.992040  0.070266  14.118332  0.000000e+00
12   y2  ~~   y2  1.135630  0.160602   7.071068  1.537437e-12
13   y3  ~~   y3  0.488724  0.069116   7.071068  1.537437e-12
14   y1  ~~   y1  0.637755  0.090192   7.071068  1.537437e-12```

In semopy syntax, character 1 stands for an intercept. It can be manipulated and specified the in the model formulation the same way as any other variable, however in ModelMeans it is assumed that all endogenous observed variables have intercepts by default. This behaviour can be disabled by setting intercepts argument of the ModelMeans constructor to False, i.e. `model = ModelMeans(desc, intercepts=False)`. In most cases it will produce the same estimates for other parameters if the data was centered beforehands (for example, by setting data -= data.mean()):

```   lval  op rval  Estimate  Std. Err    z-value       p-value
0    y1   ~   x1 -1.389724  0.073417 -18.929222  0.000000e+00
1    y1   ~   x2 -1.138377  0.087965 -12.941246  0.000000e+00
2    y1   ~   x3 -0.317948  0.072575  -4.380928  1.181750e-05
3    y2   ~   x1 -0.745742  0.097971  -7.611849  2.708944e-14
4    y2   ~   x2  1.074503  0.117385   9.153669  0.000000e+00
5    y2   ~   x3 -1.130923  0.096848 -11.677248  0.000000e+00
6    y3   ~   x1  0.702783  0.064270  10.934773  0.000000e+00
7    y3   ~   x2  1.235043  0.077006  16.038235  0.000000e+00
8    y3   ~   x3 -0.920444  0.063534 -14.487450  0.000000e+00
9    y2  ~~   y2  1.135669  0.160608   7.071068  1.537437e-12
10   y3  ~~   y3  0.488740  0.069118   7.071068  1.537437e-12
11   y1  ~~   y1  0.637744  0.090191   7.071068  1.537437e-12```

Compare this to the estimates provided by Model:

```   lval  op rval  Estimate  Std. Err    z-value       p-value
0    y1   ~   x1 -1.389754  0.073417 -18.929470  0.000000e+00
1    y1   ~   x2 -1.138405  0.087966 -12.941462  0.000000e+00
2    y1   ~   x3 -0.317893  0.072576  -4.380132  1.186073e-05
3    y2   ~   x1 -0.745837  0.097974  -7.612623  2.686740e-14
4    y2   ~   x2  1.074436  0.117388   9.152855  0.000000e+00
5    y2   ~   x3 -1.130890  0.096851 -11.676597  0.000000e+00
6    y3   ~   x1  0.702778  0.064270  10.934755  0.000000e+00
7    y3   ~   x2  1.235044  0.077006  16.038334  0.000000e+00
8    y3   ~   x3 -0.920469  0.063534 -14.487925  0.000000e+00
9    y2  ~~   y2  1.135729  0.160616   7.071068  1.537437e-12
10   y3  ~~   y3  0.488735  0.069118   7.071068  1.537437e-12
11   y1  ~~   y1  0.637755  0.090192   7.071068  1.537437e-12```

ModelMeans also has one arguable advantage over Model: as exogeneous variables are ruled out to a mean component, we can make no distribution assumptions on their nature.

## REML and ML

The idea behind Restricted Maximum Likelihood (REML) is to find a data tranform (preferably a full-rank) that strips the model off the mean component. It allows to separate parameters into two independendent subsets and then perform an optimization procedure on each of the subsets. This, in turn, produces an unbiased estimates of variance parameters. It also might facilitate optimization routine as two smaller problems are being solved instead of one big problem, as in ML.

In our tests, ML tends to produce an insignificantly smaller RMSE of parameters estimates than REML, however REML managed to achieve convergence on a slightly bigger number of models. Also, it is much faster. To switch to the REML estimator, pass an `"REML"` to an obj argument in fit method:

Python script:
Output:
```from semopy.examples import\
multivariate_regression
from semopy import ModelMeans
desc = multivariate_regression.get_model()
data = multivariate_regression.get_data()

model = ModelMeans(desc)
model.fit(data, obj="REML")
print(model.inspect())```
```   lval  op rval  Estimate  Std. Err    z-value       p-value
0    y1   ~   x1 -1.389720  0.074935 -18.545763  0.000000e+00
1    y1   ~   x2 -1.138386  0.089784 -12.679228  0.000000e+00
2    y1   ~   x3 -0.317946  0.074076  -4.292161  1.769426e-05
3    y2   ~   x1 -0.745751  0.099991  -7.458152  8.770762e-14
4    y2   ~   x2  1.074511  0.119805   8.968802  0.000000e+00
5    y2   ~   x3 -1.130937  0.098845 -11.441470  0.000000e+00
6    y3   ~   x1  0.702782  0.065594  10.714186  0.000000e+00
7    y3   ~   x2  1.235053  0.078591  15.714844  0.000000e+00
8    y3   ~   x3 -0.920453  0.064842 -14.195351  0.000000e+00
9    y1   ~    1 -1.456627  0.081926 -17.779709  0.000000e+00
10   y2   ~    1  0.929302  0.109321   8.500674  0.000000e+00
11   y3   ~    1  0.992041  0.071714  13.833349  0.000000e+00
12   y2  ~~   y2  1.182987  0.167300   7.071068  1.537437e-12
13   y3  ~~   y3  0.509070  0.071993   7.071068  1.537437e-12
14   y1  ~~   y1  0.664386  0.093958   7.071068  1.537437e-12```