The sparseGFM
package provides methods for fitting sparse generalized factor models with various penalty functions. The package is designed to handle high-dimensional data and can adapt to weak factor scenarios, making it suitable for a wide range of applications in statistics and machine learning.
The package is now available on CRAN under the name sparseGFM.
The package is now available on CRAN under the name sparseGFM.
On GitHub, the development version is hosted under the repository name sparseGFM.
install.packages("sparseGFM")
# Load the package
library(sparseGFM)
You can also install the development version of sparseGFM from GitHub:
# Install devtools if you haven't already
install.packages("devtools")
# Install sparseGFM from GitHub
devtools::install_github("zjwang1013/sparseGFM")
# Load the package
library(sparseGFM)
- Sparse Loading Matrix Estimation: Efficiently estimates row-sparse loading matrices
- Multiple Penalty Functions: Supports lasso, adaptive group lasso (aglasso), and other penalties
- Weak Factor Adaptation: Automatically adapts to scenarios with weak factor structures
- Cross-Validation: Built-in cross-validation for optimal parameter selection
- Determine the Number of Factors: Automatic selection of the optimal number of factors using multiple information criteria
sparseGFM()
: Main function for fitting sparse generalized factor modelscv.sparseGFM()
: Cross-validation for lambda selectionfacnum.sparseGFM()
: Information criterion-based selection of factor number
add_identifiability()
: Apply identifiability constraints to factor/loading matricesevaluate_performance()
: Evaluate variable selection performance metrics
The package implements 12 different penalty functions:
Penalty | Description | Adaptive Version |
---|---|---|
lasso |
L1 penalty | alasso |
SCAD |
Smoothly Clipped Absolute Deviation | agSCAD |
MCP |
Minimax Concave Penalty | agMCP |
group /glasso |
Group Lasso | agroup /aglasso |
gSCAD |
Group SCAD | agSCAD |
gMCP |
Group MCP | agMCP |
The package automatically handles missing values.
library(sparseGFM)
set.seed(123)
# Parameters
n <- 200 # number of observations
p <- 200 # number of variables
a_param <- 0.9 # sparsity parameter
q <- 2 # number of factors
# Generate sparse structure
s <- ceiling(p^a_param) # number of non-zero loadings
# Generate factor matrix
FF <- matrix(runif(n * q, min = -3, max = 3), nrow = n, ncol = q)
# Generate sparse loading matrix (row-sparse)
BB <- rbind(matrix(runif(s * q, min = 1, max = 2), nrow = s, ncol = q),
matrix(0, nrow = (p - s), ncol = q))
# Generate intercepts
alpha_true <- runif(p, min = -1, max = 1)
# Add identifiability constraints
ident_res <- add_identifiability(FF, BB, alpha_true)
FF0 <- ident_res$H
BB0 <- ident_res$B
alpha0 <- ident_res$mu
# Generate data matrix
mat_para <- FF0 %*% t(BB0) + as.matrix(rep(1, n)) %*% t(as.matrix(alpha0))
x <- matrix(nrow = n, ncol = p)
for (i in 1:n) {
for (j in 1:p) {
x[i, j] <- rnorm(1, mean = mat_para[i, j])
}
}
# True variable selection indicator
index_true <- numeric(p)
if (s > 0 && s <= p) {
index_true[1:s] <- 1
}
# Perform cross-validation to select optimal lambda
cv_result <- cv.sparseGFM(x,
type = "continuous",
q = 2,
penalty = "aglasso",
C = 5,
lambda_range = seq(0.1, 1, by = 0.1),
verbose = FALSE)
# Extract optimal model
optimal_model <- cv_result$optimal_model
BB_hat <- optimal_model$BB_hat
FF_hat <- optimal_model$FF_hat
alpha_hat <- optimal_model$alpha_hat
# Evaluate model performance
mat_para_hat <- FF_hat %*% t(BB_hat) + as.matrix(rep(1, n)) %*% t(as.matrix(alpha_hat))
relative_error <- base::norm((mat_para_hat - mat_para), type = "F") / base::norm(mat_para, type = "F")
print(paste("Optimal lambda:", cv_result$optimal_lambda))
print(paste("Relative estimation error:", round(relative_error, 4)))
# Variable selection performance
index_pred <- rep(1, p)
index_pred[optimal_model$index] <- 0
performance <- evaluate_performance(index_true, index_pred)
print(performance)
# Fit sparse GFM with fixed lambda
result <- sparseGFM(x,
type = "continuous",
q = 2,
penalty = "aglasso",
lambda = 0.1,
C = 5)
# Extract results
BB_direct <- result$BB_hat
FF_direct <- result$FF_hat
alpha_direct <- result$alpha_hat
# View selected variables
selected_vars <- setdiff(1:p, result$index)
print(paste("Number of selected variables:", length(selected_vars)))
print(paste("Number of zero loadings:", length(result$index)))
# Calculate space angles for evaluation
BB_vcc <- eval.space(BB_direct, BB0)[1]
BB_tcc <- eval.space(BB_direct, BB0)[2]
FF_vcc <- eval.space(FF_direct, FF0)[1]
FF_tcc <- eval.space(FF_direct, FF0)[2]
print(paste("Loading matrix VCC:", round(BB_vcc, 4)))
print(paste("Loading matrix TCC:", round(BB_tcc, 4)))
# Select optimal number of factors using multiple criteria
facnum_result <- facnum.sparseGFM(x,
type = "continuous",
q_range = 1:5,
penalty = "aglasso",
lambda_range = c(0.1),
sic_type = "sic1",
C = 6,
verbose = FALSE)
# Extract information criteria results
df_dd <- facnum_result$df_dd
df_as <- facnum_result$df_as
# Get optimal factor numbers from different criteria
optimal_q_sic1 <- facnum_result$optimal_q
optimal_q_sic2 <- which.min(facnum_result$sic2)
optimal_q_sic3 <- which.min(facnum_result$sic3)
optimal_q_sic4 <- which.min(facnum_result$sic4)
print("Optimal number of factors:")
print(paste("SIC1:", optimal_q_sic1))
print(paste("SIC2:", optimal_q_sic2))
print(paste("SIC3:", optimal_q_sic3))
print(paste("SIC4:", optimal_q_sic4))
# Plot information criteria (if plotting functions are available)
plot(1:5, facnum_result$sic1, type = "b", col = "red",
xlab = "Number of Factors", ylab = "Information Criterion",
main = "Determine the Number of Factors")
lines(1:5, facnum_result$sic2, type = "b", col = "blue")
lines(1:5, facnum_result$sic3, type = "b", col = "green")
lines(1:5, facnum_result$sic4, type = "b", col = "purple")
legend("topright", c("SIC1", "SIC2", "SIC3", "SIC4"),
col = c("red", "blue", "green", "purple"), lty = 1)
x
: Data matrix (n × p)type
: Data type ("continuous" for Gaussian data)q
: Number of factorspenalty
: Penalty type ("lasso", "aglasso", etc.)lambda
: Regularization parameterC
: Constraint constant
lambda_range
: Range of lambda values for cross-validationverbose
: Whether to print progress information
q_range
: Range of factor numbers to considersic_type
: Type of information criterion for primary selection
The sparseGFM algorithm employs:
- Initialization: Using the GFM package for initial estimates
- Alternating minimization: Iteratively updating factors (F) and loadings (B)
- Sparsity induction: Applying various penalty functions to achieve variable selection
- Identifiability: Ensuring unique solutions through SVD-based constraints
- Convergence: Monitoring objective function changes until convergence
The sparseGFM package is specifically designed to handle:
- High-dimensional data where the number of variables may exceed the number of observations
- Sparse loading structures with row-wise sparsity patterns
- Weak factor scenarios where factors have relatively small eigenvalues
- Flexible penalty structures including adaptive penalties that can provide better variable selection
The adaptive group lasso (aglasso) penalty is particularly effective for row-sparse loading matrices, as it can select entire rows (variables) rather than individual elements.
- R (≥ 3.5.0)
- GFM
- MASS
- irlba
- stats
Please report any bugs or issues on the GitHub Issues page. When reporting, include:
- A clear description of the issue
- Reproducible code example
- Your session information (
sessionInfo()
) - Any relevant error messages
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss proposed modifications.
For more detailed information about the methodology and theoretical properties, please refer to the associated research papers and documentation. The related papers are currently under review; we will update the information as soon as they are accepted and published.
This package is licensed under GPL-3. See the LICENSE file for details.
For bug reports, feature requests, or questions, please visit the GitHub repository issues page.