Prediction and imputation

All semopy models are equipped with predict method that does a SEM regression onto missing data given at least some of the observed variables. The regression is done by means of a conditional expectation of the vector of missing variables given some of the observed.

predict method always returns DataFrame with all the variables present in the model. The DataFrame that is passed to predict is merely an information on the data present, everything that is lacking in the DataFrame will be guessed by the predict routine. It's possible to pass a DataFrame with an arbitrary set of available variables in it with missing values. Furthermore, if one passes a DataFrame with missing values and an employed semopy model supports missing data, then it can be thought of as a prediction scheme. Example:

from semopy.examples import political_democracy
from semopy import ModelMeans
import numpy as np

desc = political_democracy.get_model()
data = political_democracy.get_data()

i, v = 0, 'x1'
x = data[v].values[i]

data[v].values[i] = float('nan')
model = ModelMeans(desc), )
preds = model.predict(data)
diff = np.abs((x - preds[v].values[i])/x)
print('{:.2f}%'.format(diff * 100))
Here, relative error is 2.17%. However, whether this is an actually a good way to predict missing data is debatable as in practice the sample variance of difference between filled gap and true values tends to be equal to estimated variance of the corresponding variable.

Fast prediction with exogenous variables

If you use ModelMeans, ModelEffects or ModelGeneralizedEffects and don't need to impute missing data, you can quickly predict observed endogenous and latent variables by a known (at least, partially) set of exogenous variables by invoking the predict_exo method. Unlike predict, it considers only exogenous variables data for prediction. If any of the values in exogenous data are missing, or even if some of the exogenous variables themselves are missing, the missing values will be imputed with zeros.