Tutorial

EpiMethEx 2.0 package – Functions

The EpiMethEx 2.0 tool can be launched for CG clustering analyses using the following functions:

**CGclustAnn**
The CGclustAnn function performs clustering analysis on consecutive CG probesets using a data frame that contains CG probeset annotations (cgAnnotation). This data frame should include IDs, genomic coordinates, the associated RefGene_Name, RefGene_Accession, and RefGene_Region. Positional CG clustering can be performed based on genomicPosition, RefGene_Name, or RefGene_Accession annotations. Before running the positional clustering analysis, the parameters nCGclusters (minimum number of CG probesets per CG cluster) and CGclusterSize (maximum base pair distance between adjacent CG probesets) can be configured according to the user's specific requirements. The analysis returns a new data frame containing the original CG probeset Annotation details, along with additional information for each identified positional CG cluster. The new data frame includes the cluster identifier (clusterName), number of CG probesets (nCG), number of base pairs (size), chromosome location (chrcluster), starting position (startcluster), ending position (stopcluster), and type of clustering analysis (clusterType).

**methDNAcluster**
Starting from positional CG clusters based on genomicPosition or RefGene_Accession Annotations, the methDNAcluster function performs clustering analysis on consecutive CG probesets according to concordant median methDNA levels (Hypomethylated = median Beta values < 0.2; Partially methylated = 0.2 ≤ median Beta values ≤ 0.6; Hypermethylated = median Beta values > 0.6). To perform methDNA clustering, the following data frames are required: cgAnnotation (a data frame containing the CG probeset annotations); CGclusterAnn (a data frame representing CG cluster Annotation based on genomicPosition or RefGene_Accession); methDNAmatrix (a data frame with the median methDNA values of each CG probesets across all samples). Before executing the methDNA CG clustering analysis, the parameter nCGclusters (minimum number of CG probesets for each CG cluster) can be configured according to the user's specific requirements. The analysis returns a new data frame that includes the original CG cluster annotation details (based on genomicPosition or RefGene_Accession), along with additional information for each identified methDNA CG cluster. The resulting data frame includes: median methDNA values for each CG probeset (medianvalue), cluster identifier (cgclusterName), ratio between the number of CG probesets in each methDNA CG cluster and the corresponding CG cluster Annotation (ClusterCoverage), number of CG probesets (cgN), number of base pairs (CGsize), starting position (CGstartcluster), ending position (CGstopcluster), cumulative median methDNA levels for the cluster (CG.median.cluster), and interquartile of median methDNA levels (CG.IQR.cluster).

**betaDiffcluster**
Starting from positional CG clusters based on genomicPosition or RefGene_Accession Annotations, the betaDiffcluster function performs clustering analysis on consecutive CG probesets by concordant Betadiff values computed between comparison groups (Strongly methylated = median Betadiff values > 0.5; Weakly methylated = ≥ 0 median Betadiff values ≤ 0.5; Strongly demethylated = median Betadiff values < -0.5; Weakly demethylated = -0.5 ≥ median Betadiff values ≤ 0). To perform Betadiff clustering, the following data frames are required: cgAnnotation (a data frame containing the CG probeset annotations); CGclusterAnn (a data frame representing CG cluster Annotation based on genomicPosition or RefGene_Accession CG); methDNAmatrix (a data frame with median methDNA values of each CG probeset computed across all samples); betaDiffmatrix (a data frame containing Betadiff values and corresponding p-values of CG probesets computed between the comparison groups). Before executing the Betadiff CG clustering analysis, the betaDiffmatrix must be filtered based on a p-value threshold of ≤ 0.05. Moreover, the parameter nCGclusters (minimum number of CG probesets for each CG cluster) can be configured according to the user's specific requirements. The analysis returns a new data frame that includes the original CG cluster annotation details (based on genomicPosition or RefGene_Accession), along with additional information for each identified Betadiff CG cluster. The resulting data frame includes: median methDNA values for each CG probeset (median value), Betadiff value for each CG probeset computed between comparison groups (beta_diff), significance of Betadiff analysis for each CG probeset (pValue), cluster identifier (cgclusterName), ratio between the number of CG probesets in each Betadiff CG cluster and the correspondig CG cluster Annotation (ClusterCoverage), number of CG probesets (cgN), number of base pairs (CGsize), starting position (CGstartcluster), ending position (CGstopcluster), cumulative median Betadiff levels for the cluster (BetaDiff.median.cluster), and interquartile of median Betadiff levels (CG.IQR.cluster). Furthermore, the column “gene”, indicating the genes associated with the identified Betadiff CG clusters, is included only for Betadiff CG clusters based on genomicPosition clustering option.

**corrCluster**
Starting from positional CG clusters based on RefGene_Name or RefGene_Accession Annotations, the corrCluster function performs clustering analysis on consecutive CG probesets by concordant Correlation values (Pearson’s r) between methDNA levels at each CG probeset and expression of associated genes (Strongly positively correlated = r > 0.7; Moderately positively correlated = 0.3 ≤ r ≤ 0.7; Weakly positively correlated = 0 ≤ r < 0.3; Strongly negatively correlated = r < -0.7; Moderately Negatively correlated = -0.7 ≤ r ≤ -0.3; Weakly negatively correlated = -0.3 < r ≤ 0). To perform Corr clustering, the following data frames are required: cgAnnotation (a data frame containing the CG probeset annotations); CGclusterAnn (a data frame representing CG cluster Annotation based on RefGene_Name or RefGene_Accession); corrMatrix (a data frame containing Corr values and corresponding p-values computed between each CG probeset and its associated genes). Before executing the Corr CG clustering analysis, the corrMatrix must be filtered based on a p-value threshold of ≤ 0.05. Moreover, the parameter nCGclusters (minimum number of CG probesets for each CG cluster) can be configured according to the user's specific requirements. The analysis returns a new data frame that includes the original CG cluster annotation details (based on RefGene_Name or RefGene_Accession), along with additional information for each identified Corr CG cluster. The resulting data frame includes: median Corr values for each CG probeset (cor), p-value associated with Corr analysis for each CG probeset (p), cluster identifier (corrClusterName), ratio between the number of CG probesets in each Corr CG cluster and the corresponding CG cluster Annotation (ClusterCoverage), number of CG probesets (cgN), number of base pairs (CGsize), starting position (CGstartcluster), ending position (CGstopcluster), cumulative median Corr levels for the cluster (Corr.median.cluster), and interquartile of median Corr (Corr.IQR.cluster). Furthermore, the column “refGeneIso”, indicating the RefGene_Accession associated with the identified Corr CG clusters, is included only for Corr CG clusters based on RefGene_Accession clustering option.

**integrCluster**
The integrCluster function performs the integration of CG clusters obtained from methDNA, Betadiff, and Corr clustering analyses based on RefGene_Accession Annotation. This function enables the identification of consecutive CG probesets that show concordance for methDNA, Betadiff, and Corr values. To generate Integrated CG clusters, the following data frames are required: methDNAmatrix (a data frame with the median methDNA values of each CG probesets across all samples); betaDiffmatrix (a data frame containing Betadiff values and corresponding p-values of CG probesets computed between the comparison groups); corrMatrix (a data frame containing Corr values and corresponding p-values computed between each CG probeset and its associated genes); diffExprMatrix (a data frame containing differential analysis of gene expression between comparison groups with Fold Change and p-value); CGclusterAnn (a data frame representing CG cluster Annotation based on RefGene_Accession); methDNAclusters (a data frame containing methDNA CG clusters derived from clustering based on RefGene_Accession CG cluster Annotation); betaDiffclusters (a data frame containing Betadiff CG clusters derived from clustering based on RefGene_Accession CG cluster Annotation); corrClusters (a data frame containing Corr CG clusters derived from clustering based on RefGene_Accession CG cluster Annotation). Before executing the Integrated clustering analysis, the betaDiffmatrix and corrMatrix must be filtered based on a p-value threshold of ≤ 0.05. Moreover, the parameter nCGclusters (minimum number of CG probesets for each CG cluster) can be configured according to the user's specific requirements. The analysis returns a new data frame that includes the original CG cluster annotation details based on RefGene_Accession, along with additional information for each identified Integrated CG cluster. The resulting data frame includes: cluster identifier (integratedCluster), genomic coordinates (ClusterGenCoordinate), median methDNA levels for each CG probeset (CG.median.methDNA), median Corr levels for each CG probeset (CG.median.corr), p-value associated with Corr analysis for each CG probeset (CG.coor.pValue), median Betadiff levels for each CG probeset (CG.median.betaDiff), p-value associated with Betadiff analysis for each CG probeset (CG.betaDiff.pValue), median methDNA levels for the cluster (cluster.median.methDNA), median Corr levels for the cluster (cluster.median.corr), median Betadiff levels for the cluster (cluster.median.betaDiff), mean expression levels of associated gene in reference group (mean.expr.Ref), LogFC of gene expression between comparison groups (logFC_Ref.vs.CTRL), and p-value associated with gene expression differential analysis between comparison groups (pValue_logFC.expr).