Research Article - (2023) Volume 14, Issue 4
Received: 01-Aug-2023, Manuscript No. jbmbs-23-110237;
Editor assigned: 03-Aug-2023, Pre QC No. P-110237;
Reviewed: 17-Aug-2023, QC No. Q-110237;
Revised: 22-Aug-2023, Manuscript No. R-110237;
Published:
29-Aug-2023
, DOI: 10.37421/2155-6180.2023.14.179
Citation: Sanghera, Prabhjot, François Belzile, Waldiodio Seck and Pierre Dutilleul. “Cluster Analysis of 137 Soybean Lines Based on Root System Architecture Traits Measured in Rhizoboxes.” J Biom Biosta 14 (2023): 179.
Copyright: © 2023 Sanghera P, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
The reported study was motivated by the necessity to select 30 soybean lines from a total of 137 for a sophisticated 3-D phenotyping analysis of the Root System Architecture (RSA), which would not allow that all the lines be included and replicated. A representative subset of size 30 was found after performing four cluster analyses and comparing the results of two more particularly. These two cluster analyses are based on the data for 12 RSA-related traits previously collected in 2D on three replicates of the 137 soybean lines and the first six principal components representing 95% of the total dispersion after data standardization in a preliminary Principal Component Analysis (PCA). The two cluster analysis procedures provided 16 soybean lines that were the closest to the centroid of their respective cluster in both cases. Fourteen more were found to be common and at a distance from the centroid below a pre-set threshold value without being the closest. The final selection of 30 excludes two soybean lines that were the second member selected from their cluster, and includes instead two soybean lines that are the closest and second closest to their respective centroid in the cluster analysis after PCA on standardized data, but are not well represented in the other cluster analysis. In conclusion, the 93.3% overlap between the two sets of results shows a robust clustering structure in RSA 2-D phenotyping in soybean. Our statistical approaches and procedures can be followed and applied in other biological frameworks than plant phenotyping.
Cluster analysis • Data standardization • Distance to the centroid • Plant phenotyping • Principal component analysis • Root system architecture
One of the main difficulties in experimental research of biological systems is the bidirectional relationship between genotype and phenotype. Researchers in the omics sciences [1-7], including phenomics [8], are continuously developing new technologies that produce enormous amounts of data, which help improve our understanding of the complexity of living organisms provided they are analysed appropriately. To enable drawing biologically relevant conclusions, statistical methods, among others [9-12], must be optimized in parallel. To share raw data from omics experiments, they are presented in figures and visualized with meaningful representations. The primary goal of agricultural phenomics, or field omics [13], is to measure and compare phenotypes of crop plants. With the interpretation of dendrograms and proximity to centroids, cluster analysis represents a potential, very effective means to meet that objective. Different clustering algorithms exist that can, for given criteria, group individuals and identify them as cluster members [14].
Phenotypic variation in a germplasm pool is necessary for plant breeders to progress through selection. In this study, we have analysed phenotypic data for the Root System Architecture (RSA) of 137 soybean lines; source of data: [15]. The primary or tap root is the first organ formed by hypophysis in germinating seeds [16]. The thick soybean primary root produces primordia from the pericyclic cells, which grow into lateral roots [17]. Numerical variables, such as the quantity of secondary lateral roots, average root diameter, and root length, typically describe the size and abundance of the root system components. In other measured variables, the focus is on the topology or structure of the root system, like the type and angle of root connections [18]. Here, 12 RSA-related traits had previously been measured from 2-D images of the content of rhizoboxes in which soybean seedlings were grown: Total Length of Roots (TLR), Length of Primary Root (LPR), Length of Secondary Roots (LSR), Distribution of Total Root Length (DTLR), Total Number of Roots (TNR), Median number of roots (Med), Maximum Number of Roots (Max), Depth of Root System (DRS), Width of Root System (WRS), Surface of Root System (SRS), Diameter of Primary Roots (DR), and Surface Area of primary Root (SAR) [15].
We first performed cluster analysis on the dataset introduced above in four ways: without vs. with data standardization, combined or not with the application of a Principal Component Analysis (PCA) to reduce data dimensionality, and then focused on two ways called “Approach 1” and “Approach 2”. In doing so, our motivation was to answer best the questions: How to analyse RSA multivariate data to objectively define a given number (e.g., 30) of clusters? How can a relevant member (i.e., a soybean line) be identified for each of the 30 clusters? These questions are addressed while keeping in mind that the resulting 30 soybean lines would later be used for a sophisticated, time-consuming RSA phenotyping in 3D. We used the SAS software, Version 9.4 for Windows (SAS Institute Inc., Cary, NC, USA), to design and perform our cluster analyses.
Source of experimental data
The dataset used in the multivariate analyses described below consists of the mean values of phenotypic data collected for three seedlings per line (N=3) from 137 lines of soybean grown in Canada. The seeds were first germinated in Petri dishes filled with fine vermiculite and then transplanted into customdesigned rhizoboxes filled with vermiculite. After 10 days of growth, images of the roots were taken using a camera. The Automatic Root Image Analysis (ARIA) software was used to extract the RSA-related traits from each 2-D image: TLR, LPR, LSR, DTLR, TNR, Med, Max, DRS, WRS, SRS, DR, and SAR [15].
Cluster analysis
This multivariate statistical method is aimed at identifying “clusters”, or groups of individuals, and their “members” for given criteria of proximity in the multidimensional space of a quantitative dataset. In the plethora of existing cluster analysis procedures, clustering depends on the definition of proximity and the type of distance or similarity involved; see, e.g., [14]. In all cases, the basic principles of the method are the same: grouping individuals that are more similar in the same cluster around a “centroid”, in a way that maximizes the separation among clusters while minimizing the distances between members within clusters. We applied cluster analysis to obtain 30 clusters from 137 soybean lines (1 individual=1 soybean line). As a starting point in a given approach, we identified the soybean lines with greatest proximity to the centroid as representatives of the clusters. Our motivation is to select objectively 30 soybean lines for future research work that is practically impossible to undertake with all the 137 soybean lines (i.e., RSA phenotyping based on computed tomography scanning).
In this study, we performed disjoint cluster analyses with the SAS procedure FASTCLUS, in which a nearest centroid sorting algorithm is implemented. We used it without the option of cluster seeds as first guess for centroids, so that the algorithm initially considered each individual as a separate cluster. Distances between two individuals, between one individual and the centroid of one cluster with more than one member, and between two centroids of clusters with several members were computed based on the values of the input variables (using means when centroids of non-singleton clusters are involved); see the VAR statement in SAS scripts A1 and A3 in the appendix. By default, the Euclidean distance is used to assess the proximity among individuals and clusters. The algorithm merges the two closest clusters at each step until the desired number of clusters (MAXC) is reached. Unlike the SAS procedure CLUSTER, PROC FASTCLUS assigns each individual to a single cluster without organization in a hierarchical tree structure.
We developed and followed two approaches for clustering.
Approach 1: Cluster analysis with the 12 RSA-related traits. In SAS script A1, "MAXC=30" specifies the requested number of clusters, and the final cluster assignments are saved as output in "work.fastclus_scores".
Approach 2: Cluster analysis with 6 principal components (Prin1-Prin6). In this approach, results of a preliminary PCA are used; see the text below and SAS scripts A2 and A3. The input variables VAR in A3 are "Prin1-Prin6". These were chosen for cluster analysis after PCA (see below) showed that they accounted for 95% of the variability in the data table after column standardization. Prior to standardization, the data table (with 137 rows and 12 columns) contained the mean values (N=3) per soybean line for each of the 12 RSA-related traits. The other options in A3 (i.e., MAXC, OUT) are the same as in A1.
Principal component analysis
That multivariate statistical method can be performed on the same dataset as cluster analysis, but has a different aim than cluster analysis. PCA is used to examine the relationships among quantitative variables observed on a number of individuals in order to reduce dimensionality of the data space [14]. Matrix algebra tools applied to the sample correlation matrix (with ones as diagonal entries and standardized covariances off the diagonal) provides “principal components” based on eigenvalues and associated orthogonal eigenvectors. By performing PCA, we aimed to identify structural patterns in association of the 12 RSA-related traits over the 137 soybean lines and assess differences in cluster analysis results obtained with well-defined principal components (Approach 2) vs. with no data standardization and no dimensionality reduction (Approach 1).
In SAS script A2 in the appendix, the procedure PRINCOMP is called with "DATA=PCA_Seck_et_al_2020" to specify the input dataset and the option STANDARD to perform PCA on the 12 × 12 sample correlation matrix (i.e., after transforming the data for each variable to a sample variance of 1.0). The latter option facilitates the interpretation of results by focusing on associations among variables via correlations, while avoiding scale effects related to data dispersion and measurement units if the 12 × 12 sample variance-covariance matrix was used.
The first 6 principal components (out of a maximum of 12; there are 12 variables provided by the 12 soybean root traits) explain about 95% of the variability in the data table (Figure 1, top left panel). Several of the RSA-related traits are redundant; see SAR, DRS, DTLR, LSR, TLR and WRS, RS in the PCA biplots (Figure 1, other panels). The latter result confirms the correlation analysis results reported in Seck W, et al. [15].
Figure 1. Principal Component Analysis (PCA) results. Top left panel: Percentage of the variability in the data table explained by the 12 principal components, cumulative or not. Other panels: Biplots of Prin2 against Prin1, Prin3 against Prin1, and Prin3 against Prin2; Prin1, Prin2, Prin3 denote the first three principal components in descending order of the associated eigenvalues.
In a PCA with standardization of the data table, which is equivalent to performing the PCA on the sample correlation matrix [14], “variance”, “dispersion”, “variation”. And “variability” tend to mean the same thing.
Using the criterion of greatest proximity or smallest distance to the centroid, 16 soybean lines are found to be common to the lists of 30 names obtained in the cluster analyses along Approach 1 and Approach 2; see the yellow highlights in Table 1. Loosening the required proximity to a maximum difference of 0.15 with the smallest distance to the centroid on both sides, 14 more lines were found to be common and at a distance from the centroid below 0.15 without being the closest. The final selection of 30 (Table 2) excludes two soybean lines (Madoc, McCall) that were the second member selected from their cluster, and includes instead two soybean lines (Mandarin, Maple Arrow) that are the closest and second closest to their respective centroid in Approach 2, but are not well represented in Approach 1.
Analysis with the 12 RSA-related traits | Analysis with Prin1-Prin6 | ||||
---|---|---|---|---|---|
Cluster | Soybean line | Distance to the centroid | Cluster | Soybean line | Distance to the centroid |
1 | 4004P4J | 1.232596 | 1 | 4004P4J | 1.330548 |
2 | 4005_24j | 0 | 2 | 4005_24j | 0 |
3 | PS44 | 0.969116 | 3 | PS44 | 0.903904 |
4 | Jari | 1.385713 | 4 | OAC 7-26C | 1.124655 |
5 | Tundra | 0 | 5 | Gretna | 1.020093 |
6 | Delta | 0 | 6 | Madoc | 0.844417 |
7 | OAC 7-26C | 1.222566 | 7 | OAC Prudence | 1.121011 |
8 | Casino | 1.379225 | 8 | OAC Wallace | 0.944889 |
9 | 5055_43G | 0 | 9 | 5055_43G | 1.239143 |
10 | Costaud | 1.251672 | 10 | Costaud | 1.10524 |
11 | Madoc | 1.357312 | 11 | Mandarin | 0.929683 |
12 | Maple Ambr | 1.10394 | 12 | Venus | 0 |
13 | OAC 8-21C | 0.898254 | 13 | OAC 7-6C | 0 |
14 | Woodstock | 0 | 14 | Maple Glen | 1.025438 |
15 | S05-T6 | 1.191081 | 15 | Bravor | 1.111696 |
16 | Albinos | 1.447583 | 16 | Tundra | 0 |
17 | OAC 9-35C | 1.088592 | 17 | SECAN8-1 | 1.026367 |
18 | Clinton | 1.169068 | 18 | Woodstock | 0 |
19 | Maple Isle | 1.057824 | 19 | Jutra | 0.775305 |
20 | OAC Oxford | 1.118181 | 20 | OT94-47 | 0.728379 |
21 | S14-P6 | 1.081495 | 21 | Alta | 0.651441 |
22 | McCall | 1.27531 | 22 | McCall | 1.090354 |
23 | Gentleman | 1.505267 | 23 | 4067P17j | 1.107881 |
24 | Flambeau | 1.091571 | 24 | S03-W4 | 0 |
25 | OAC 7-6C | 0 | 25 | Roland | 0.975369 |
26 | OAC Wallace | 0.954529 | 26 | Maple Belle | 1.015101 |
27 | S03-W4 | 0 | 27 | OAC 7-4C | 0.812049 |
28 | Venus | 0 | 28 | S14-P6 | 1.002491 |
29 | Gaillard | 0.844357 | 29 | Mario | 0.924578 |
30 | OAC 7-4C | 0.998249 | 30 | OT05-20 | 1.209572 |
No. | Soybean line |
---|---|
1 | 4004P4J |
2 | 4005_24J |
3 | 5055_43G |
4 | AC2001 |
5 | Albinos |
6 | Casino |
7 | Clinton |
8 | Costaud |
9 | Delta |
10 | Elora |
11 | Gaillard |
12 | Gentleman |
13 | Mandarin |
14 | Maple Arrow |
15 | OAC 7-26C |
16 | OAC 7-4C |
17 | OAC 7-6C |
18 | OAC 8-21C |
19 | OAC 9-22C |
20 | OAC 9-35C |
21 | OAC Oxford |
22 | OAC Wallace |
23 | PS44 |
24 | Proteus |
25 | S03-W4 |
26 | S14-P6 |
27 | SECAN7-27 |
28 | Tundra |
29 | Venus |
30 | Woodstock |
The reported overlap of 93.3% [i.e., (16+14–2)/30=0.933] shows a robust clustering structure in RSA 2-D phenotyping in soybean. Thus, we compiled, in a rational way, a list of 30 representative soybean lines with distinct RSA patterns that provide a good basis for 3-D investigation. Of course, germination tests with available seed banks as well as preliminary tests with growing media other than vermiculite justify adjustments to that list later. It is worth mentioning that OAC Bayfield readily provides a substitute to OAC 7-26C if required, as these soybean lines belong to the same cluster with two members in both approaches (Tables B1 and B2); they are therefore at equal distance from the centroid and either can be randomly picked. A comparison with genomic clustering results falls beyond the scope of a Brief Report, but could be the topic of another, broader study.
The selected 30 soybean lines will be used in RSA phenotyping with stateof- the-art equipment, followed by sophisticated 3-D data and image analyses. Selecting representative lines that showcase the diversity in root system architecture and possess biological relevance is crucial. The soybean lines in Table 2 are objective starting points for further investigation into the functionality of specific RSA-related traits on plant performance and adaptation. Our cluster analysis results provide insight into phenotypic variation within the germplasm pool. Understanding root system diversity is crucial for breeders aiming to progress through selection. Advanced 3-D phenotypic analyses, e.g., based on computed tomography scanning, is expected to deepen our understanding of the RSA and its impact on plant productivity and stress tolerance.
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Google Scholar, Crossref, Indexed at
Journal of Biometrics & Biostatistics received 3496 citations as per Google Scholar report