| Title: | Machine Learning for Integrating Partially Overlapped Genetic Datasets |
|---|---|
| Description: | Tools to simulate genetic distance matrices, align and compare them via multidimensional scaling (MDS) and Procrustes, and evaluate imputation with the Bootstrapping Evaluation for Structural Missingness Imputation (BESMI) framework. Methods align with Zhu et al. (2025) <doi:10.3389/fpls.2025.1543956> and the associated software resource Zhu (2025) <doi:10.26188/28602953>. |
| Authors: | Jiashuai Zhu [aut, cre] (ORCID: <https://orcid.org/0000-0002-9916-9732>, affiliation: Faculty of Science, The University of Melbourne, Parkville, VIC, Australia; Agriculture Victoria, AgriBio Centre, Bundoora, VIC, Australia), The University of Melbourne [cph], Agriculture Victoria [cph] |
| Maintainer: | Jiashuai Zhu <[email protected]> |
| License: | GPL-3 |
| Version: | 1.3.2 |
| Built: | 2026-06-08 07:44:28 UTC |
| Source: | https://github.com/jiashuaiz/datafusion-gdm |
Procrustes alignment and mapping back to distances
apply_procrustes(X_base, Y_base, Y)apply_procrustes(X_base, Y_base, Y)
X_base |
Base coordinates for target alignment |
Y_base |
Base coordinates for source alignment |
Y |
Full source coordinates to transform |
Transformed coordinates matrix
Run BESMI imputation for a list of dataset paths
besmi_batch_impute( dataset_paths, the_method = "lasso.norm", max_iter = 5, imputation_convergence_threshold = 1e-06, propagation_convergence_threshold = 1e-06, distance_metric = "mae", output_dir = file.path(tempdir(), "DataFusionGDM_imputation"), k_filter = NULL, full_dataset_path = NULL )besmi_batch_impute( dataset_paths, the_method = "lasso.norm", max_iter = 5, imputation_convergence_threshold = 1e-06, propagation_convergence_threshold = 1e-06, distance_metric = "mae", output_dir = file.path(tempdir(), "DataFusionGDM_imputation"), k_filter = NULL, full_dataset_path = NULL )
dataset_paths |
Character vector of RDS paths to masked matrices |
the_method |
Imputation method (e.g., 'lasso.norm' or 'KNN') |
max_iter |
Maximum iterations for iterative methods |
imputation_convergence_threshold |
Convergence threshold for imputation metric |
propagation_convergence_threshold |
Convergence threshold for propagation metric |
distance_metric |
Distance metric for evaluation ('mae','ssd','rmse','correlation') |
output_dir |
Output directory for imputed matrices (defaults to a temporary location) |
k_filter |
Optional numeric filter for k value |
full_dataset_path |
Optional path to a full matrix RDS used as ground truth |
Data frame of metrics for all datasets
Create masked matrices for BESMI
besmi_create_masked_matrices(full_matrix, k, seed = NULL)besmi_create_masked_matrices(full_matrix, k, seed = NULL)
full_matrix |
Full symmetric matrix |
k |
Number of populations to mask (as U) |
seed |
Optional seed for reproducibility |
List with masked_matrix, mask_position, group_u, group_s, masked_percentage
Impute a single dataset from masked matrix path
besmi_impute_single_dataset( input_path, method = "lasso.norm", max_iterations = 5, imputation_convergence_threshold = 0.001, propagation_convergence_threshold = 0.001, distance_metric = "mae", output_dir = file.path(tempdir(), "DataFusionGDM_imputation"), full_dataset_path = NULL )besmi_impute_single_dataset( input_path, method = "lasso.norm", max_iterations = 5, imputation_convergence_threshold = 0.001, propagation_convergence_threshold = 0.001, distance_metric = "mae", output_dir = file.path(tempdir(), "DataFusionGDM_imputation"), full_dataset_path = NULL )
input_path |
Path to masked matrix RDS |
method |
Imputation method ('lasso.norm' or 'KNN') |
max_iterations |
Maximum iterations for iterative methods |
imputation_convergence_threshold |
Convergence threshold for imputation metric |
propagation_convergence_threshold |
Convergence threshold for propagation metric |
distance_metric |
Distance metric name |
output_dir |
Output directory for results (defaults to a temporary location) |
full_dataset_path |
Optional path to a full matrix RDS used as ground truth |
Data frame of per-iteration metrics
Iterative imputation with MICE (tails-chain)
besmi_iterative_imputation( M_input, M_mask, M_real = NULL, method = "lasso.norm", max_iterations = 5, imputation_convergence_threshold = 0.001, propagation_convergence_threshold = 0.001, distance_metric = "mae", k = NA, bs_i = NA )besmi_iterative_imputation( M_input, M_mask, M_real = NULL, method = "lasso.norm", max_iterations = 5, imputation_convergence_threshold = 0.001, propagation_convergence_threshold = 0.001, distance_metric = "mae", k = NA, bs_i = NA )
M_input |
Matrix with NAs to impute |
M_mask |
Logical mask matrix (TRUE indicates masked positions) |
M_real |
Optional ground truth matrix |
method |
MICE method (e.g., 'lasso.norm') |
max_iterations |
Max outer iterations |
imputation_convergence_threshold |
Threshold for imputation distance |
propagation_convergence_threshold |
Threshold for propagation distance |
distance_metric |
Distance metric name |
k |
Dataset parameter k (for logging) |
bs_i |
Bootstrap index (for logging) |
List with final_matrix, metrics, tails_chain
KNN imputation sweep (uses VIM::kNN)
besmi_knn_impute( M_input, M_mask, M_real = NULL, distance_metric = "mae", k = NA, bs_i = NA )besmi_knn_impute( M_input, M_mask, M_real = NULL, distance_metric = "mae", k = NA, bs_i = NA )
M_input |
Matrix with NAs |
M_mask |
Logical mask matrix |
M_real |
Optional ground truth |
distance_metric |
Distance metric name |
k |
Dataset parameter k |
bs_i |
Bootstrap index |
List with final_matrix, metrics, tails_chain
Prepare full GDM dataset from CSV or RData
besmi_prepare_full_dataset(input_path)besmi_prepare_full_dataset(input_path)
input_path |
Path to CSV or RData file containing the full distance matrix |
Symmetric numeric matrix
Convert coordinate matrix to distance matrix
coords_to_distances(coords)coords_to_distances(coords)
coords |
Numeric coordinate matrix |
Symmetric distance matrix
Returns a ggplot heatmap of the distance matrix using ggplot2 only (no Bioconductor dependencies).
create_distance_heatmap(dist_matrix, pop_info)create_distance_heatmap(dist_matrix, pop_info)
dist_matrix |
Symmetric numeric distance matrix with row/column names |
pop_info |
Data frame with at least |
A ggplot object
Create MDS plot of genetic distances
create_mds_plot(dist_matrix, pop_info)create_mds_plot(dist_matrix, pop_info)
dist_matrix |
Symmetric numeric distance matrix |
pop_info |
Data frame with metadata columns |
A ggplot object
Export a simulated GDM to CSV
export_simulated_gdm( output_file = tempfile("gdm_", fileext = ".csv"), scenario = "default", n_pops = 30, verbose = TRUE, seed = NULL )export_simulated_gdm( output_file = tempfile("gdm_", fileext = ".csv"), scenario = "default", n_pops = 30, verbose = TRUE, seed = NULL )
output_file |
Output CSV filename (defaults to a session-scoped temporary path) |
scenario |
Scenario name |
n_pops |
Number of populations |
verbose |
Verbose output |
seed |
Optional seed forwarded to run_genetic_scenario() |
Invisibly, the normalized path to the written CSV
tmp <- export_simulated_gdm(verbose = FALSE) if (file.exists(tmp)) unlink(tmp)tmp <- export_simulated_gdm(verbose = FALSE) if (file.exists(tmp)) unlink(tmp)
Perform MDS on a pair of distance matrices
perform_mds(A, B)perform_mds(A, B)
A |
First distance matrix |
B |
Second distance matrix |
A list with coordinates X, Y, optimal dimension d_opt, and variance info
Run simulation with predefined biological scenarios
run_genetic_scenario( scenario = "default", n_pops = 30, output_file = NULL, seed = NULL, verbose = TRUE )run_genetic_scenario( scenario = "default", n_pops = 30, output_file = NULL, seed = NULL, verbose = TRUE )
scenario |
Scenario name: 'default', 'island', 'stepping_stone', 'admixture', 'ancient_divergence', 'simple' |
n_pops |
Number of populations |
output_file |
Optional CSV path to write the distance matrix |
seed |
Optional seed forwarded to run_genetic_simulation() |
verbose |
Print diagnostic information |
Same structure as run_genetic_simulation()
Run a high-level genetic simulation with configurable model
run_genetic_simulation( n_pops = 30, n_major_groups = 4, n_subgroups = 8, model = "mixed", geo_dims = NULL, isolation_factor = NULL, genetic_dims = NULL, group_separation = 15, subgroup_separation = NULL, pop_dispersion = 0.5, admixture_prob = 0.15, bottleneck_prob = 0.1, use_subgroups = TRUE, use_genetic_dims = NULL, use_admixture = TRUE, use_bottlenecks = TRUE, use_isolation_by_distance = NULL, use_nonlinear = TRUE, use_noise = TRUE, seed = NULL, output_file = NULL, verbose = TRUE )run_genetic_simulation( n_pops = 30, n_major_groups = 4, n_subgroups = 8, model = "mixed", geo_dims = NULL, isolation_factor = NULL, genetic_dims = NULL, group_separation = 15, subgroup_separation = NULL, pop_dispersion = 0.5, admixture_prob = 0.15, bottleneck_prob = 0.1, use_subgroups = TRUE, use_genetic_dims = NULL, use_admixture = TRUE, use_bottlenecks = TRUE, use_isolation_by_distance = NULL, use_nonlinear = TRUE, use_noise = TRUE, seed = NULL, output_file = NULL, verbose = TRUE )
n_pops |
Number of populations |
n_major_groups |
Number of major groups |
n_subgroups |
Number of subgroups |
model |
One of "mixed", "geography", "genetics", or "custom" |
geo_dims |
Geographic dimensions (overrides default based on model if set) |
isolation_factor |
Geography-genetics balance (overrides default based on model if set) |
genetic_dims |
Genetic dimensions (overrides default based on model if set) |
group_separation |
Separation between major groups |
subgroup_separation |
Separation between subgroups (default: group_separation/3 when NULL) |
pop_dispersion |
Within-subgroup dispersion |
admixture_prob |
Proportion of admixed populations |
bottleneck_prob |
Proportion of bottlenecked populations |
use_subgroups |
Whether to create subgroups |
use_genetic_dims |
Whether to include genetic dimensions |
use_admixture |
Whether to include admixture |
use_bottlenecks |
Whether to include bottlenecks |
use_isolation_by_distance |
Whether to weight geographic distance |
use_nonlinear |
Whether to apply nonlinear transformation |
use_noise |
Whether to add noise |
seed |
Optional seed forwarded to simulate_genetic_distances() |
output_file |
Optional CSV file path to write the distance matrix |
verbose |
Print diagnostics |
List with results and plots (functions to print plots)
Generates a synthetic genetic distance matrix and metadata using hierarchical population structure, admixture and bottleneck options.
simulate_genetic_distances( n_pops = 50, n_major_groups = 5, n_subgroups = 12, geo_dims = 2, genetic_dims = 2, group_separation = 15, subgroup_separation = 5, pop_dispersion = 0.5, isolation_factor = 0.8, admixture_prob = 0.1, bottleneck_prob = 0.05, noise_level = 0.1, nonlinear_factor = 0.7, use_subgroups = TRUE, use_genetic_dims = TRUE, use_admixture = TRUE, use_bottlenecks = TRUE, use_isolation_by_distance = TRUE, use_nonlinear = TRUE, use_noise = TRUE, seed = NULL, verbose = TRUE )simulate_genetic_distances( n_pops = 50, n_major_groups = 5, n_subgroups = 12, geo_dims = 2, genetic_dims = 2, group_separation = 15, subgroup_separation = 5, pop_dispersion = 0.5, isolation_factor = 0.8, admixture_prob = 0.1, bottleneck_prob = 0.05, noise_level = 0.1, nonlinear_factor = 0.7, use_subgroups = TRUE, use_genetic_dims = TRUE, use_admixture = TRUE, use_bottlenecks = TRUE, use_isolation_by_distance = TRUE, use_nonlinear = TRUE, use_noise = TRUE, seed = NULL, verbose = TRUE )
n_pops |
Number of populations |
n_major_groups |
Number of major groups |
n_subgroups |
Number of subgroups |
geo_dims |
Geographic dimensions |
genetic_dims |
Additional genetic drift dimensions |
group_separation |
Separation between major groups |
subgroup_separation |
Separation between subgroups |
pop_dispersion |
Within-subgroup dispersion |
isolation_factor |
Weight for geography in isolation-by-distance model (0-1) |
admixture_prob |
Proportion of admixed populations |
bottleneck_prob |
Proportion of bottlenecked populations |
noise_level |
Noise level in transformation |
nonlinear_factor |
Nonlinearity factor in transformation |
use_subgroups |
Whether to create subgroups |
use_genetic_dims |
Whether to include genetic dimensions |
use_admixture |
Whether to include admixture |
use_bottlenecks |
Whether to include bottlenecks |
use_isolation_by_distance |
Whether to weight geographic distance |
use_nonlinear |
Whether to apply nonlinear transformation |
use_noise |
Whether to add noise |
seed |
Optional seed for reproducibility (NULL leaves the RNG state unchanged) |
verbose |
Print diagnostics |
A list with distance_matrix, population_info, position_matrix, and parameters.
Create plotting handles for simulation results
visualize_results(sim_results)visualize_results(sim_results)
sim_results |
A list returned by simulate_genetic_distances() or run_genetic_simulation() |
A list with heatmap and mds functions that print plots when called