semopy


The EFA procedure will be explained in the upcoming paper. Until then, use it with caution.

EFA in semopy

It's often the case that the latent structure is not crystal clear and a researcher must must seek extra advice in statistical software to guide him to a proper SEM model. semopy has some methods that might help extract latent structure from the data. Namely, efa.explore_cfa_model and efa.explore_pine_model.

CFA model

As the name suggests, explore_cfa_model helps to retrieve Confirmatory Factor Analysis model from the dataset, i.e. a model with latent factors that covary with each other. The function has the following arguments of interest:
  1. data — dataset in the form of pandas DataFrame;
  2. min_loadings — expected minimal number of indicators per latent factor. The default is 2. It is better to be left at 2, as it is rare for the procedure to finish it is work with less than 3 indicators per factor, yet a higher number might penalize it too much;
  3. pval — the p-value cutoff value. The model is guaranteed to have all regression coefficients to be at least as significant as this cutoff margin. The default is 0.01.

Pine model

Pine model is a generalization of CFA model. In Pine setting, we allow latent factors in CFA model to be indicators of some other "higher level" latent factor. explore_pine_model has the following arguments of interest:
  1. data — dataset in the form of pandas DataFrame;
  2. min_loadings — expected minimal number of indicators per latent factor. The default is 2. It is better to be left at 2, as it is rare for the procedure to finish it is work with less than 3 indicators per factor, yet a higher number might penalize it too much;
  3. pval — the p-value cutoff value. The model is guaranteed to have all regression coefficients to be at least as significant as this cutoff margin. The default is 0.01.
  4. levels — number of levels. If 1, the result will be the same as explore_cfa_model. Higher values allow for a more hierarchical model. The default is 2.

Example

Python script:

import semopy
import numpy as np
import pandas as pd

np.random.seed(123)

N = 100
eta1 = np.random.normal(size=N)
eta2 = np.random.normal(size=N)
eta1 += 0.3 * eta2

y1 = np.random.normal(size=N, scale=0.5) + eta1
y2 = np.random.normal(size=N, scale=0.5) + 2 * eta1
y3 = np.random.normal(size=N, scale=0.5) + 3 * eta1 + eta2
y4 = np.random.normal(size=N, scale=0.5) - eta2
y5 = np.random.normal(size=N, scale=0.5) + 1.5 * eta2
x = np.random.normal(size=N)
data = pd.DataFrame([y1, y2, y3, y4, y5, x],
                    index=['y1', 'y2', 'y3', 'y4', 'y5', 'x']).T

print(semopy.efa.explore_cfa_model(data_tr))

Result:

eta1 =~ y2 + y3 + y1
eta2 =~ y4 + y5 + y3

It is exactly the model that was used for data generation.