Cluster Analysis of 137 Soybean Lines Based on Root System Architecture Traits Measured in Rhizoboxes

Prabhjot Sanghera1; Fran

doi:10.37421/2155-6180.2023.14.179

Research Article - (2023) Volume 14, Issue 4

Cluster Analysis of 137 Soybean Lines Based on Root System Architecture Traits Measured in Rhizoboxes

Prabhjot Sanghera¹, François Belzile², Waldiodio Seck² and Pierre Dutilleul¹^*

^*Correspondence: Pierre Dutilleul, Department of Plant Science, McGill University, Macdonald Campus, Sainte-Anne-de-Bellevue, QC, Canada, Email:

Author information

¹Department of Plant Science, McGill University, Macdonald Campus, Sainte-Anne-de-Bellevue, QC, Canada
²Département de phytologie et Institut de biologie intégrative et des systèmes, Université Laval, Québec, QC, Canada

Received: 01-Aug-2023, Manuscript No. jbmbs-23-110237; Editor assigned: 03-Aug-2023, Pre QC No. P-110237; Reviewed: 17-Aug-2023, QC No. Q-110237; Revised: 22-Aug-2023, Manuscript No. R-110237; Published: 29-Aug-2023 , DOI: 10.37421/2155-6180.2023.14.179
Citation: Sanghera, Prabhjot, François Belzile, Waldiodio Seck and Pierre Dutilleul. “Cluster Analysis of 137 Soybean Lines Based on Root System Architecture Traits Measured in Rhizoboxes.” J Biom Biosta 14 (2023): 179.
Copyright: © 2023 Sanghera P, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

The reported study was motivated by the necessity to select 30 soybean lines from a total of 137 for a sophisticated 3-D phenotyping analysis of the Root System Architecture (RSA), which would not allow that all the lines be included and replicated. A representative subset of size 30 was found after performing four cluster analyses and comparing the results of two more particularly. These two cluster analyses are based on the data for 12 RSA-related traits previously collected in 2D on three replicates of the 137 soybean lines and the first six principal components representing 95% of the total dispersion after data standardization in a preliminary Principal Component Analysis (PCA). The two cluster analysis procedures provided 16 soybean lines that were the closest to the centroid of their respective cluster in both cases. Fourteen more were found to be common and at a distance from the centroid below a pre-set threshold value without being the closest. The final selection of 30 excludes two soybean lines that were the second member selected from their cluster, and includes instead two soybean lines that are the closest and second closest to their respective centroid in the cluster analysis after PCA on standardized data, but are not well represented in the other cluster analysis. In conclusion, the 93.3% overlap between the two sets of results shows a robust clustering structure in RSA 2-D phenotyping in soybean. Our statistical approaches and procedures can be followed and applied in other biological frameworks than plant phenotyping.

Keywords

Cluster analysis • Data standardization • Distance to the centroid • Plant phenotyping • Principal component analysis • Root system architecture

Introduction

One of the main difficulties in experimental research of biological systems is the bidirectional relationship between genotype and phenotype. Researchers in the omics sciences [1-7], including phenomics [8], are continuously developing new technologies that produce enormous amounts of data, which help improve our understanding of the complexity of living organisms provided they are analysed appropriately. To enable drawing biologically relevant conclusions, statistical methods, among others [9-12], must be optimized in parallel. To share raw data from omics experiments, they are presented in figures and visualized with meaningful representations. The primary goal of agricultural phenomics, or field omics [13], is to measure and compare phenotypes of crop plants. With the interpretation of dendrograms and proximity to centroids, cluster analysis represents a potential, very effective means to meet that objective. Different clustering algorithms exist that can, for given criteria, group individuals and identify them as cluster members [14].

Phenotypic variation in a germplasm pool is necessary for plant breeders to progress through selection. In this study, we have analysed phenotypic data for the Root System Architecture (RSA) of 137 soybean lines; source of data: [15]. The primary or tap root is the first organ formed by hypophysis in germinating seeds [16]. The thick soybean primary root produces primordia from the pericyclic cells, which grow into lateral roots [17]. Numerical variables, such as the quantity of secondary lateral roots, average root diameter, and root length, typically describe the size and abundance of the root system components. In other measured variables, the focus is on the topology or structure of the root system, like the type and angle of root connections [18]. Here, 12 RSA-related traits had previously been measured from 2-D images of the content of rhizoboxes in which soybean seedlings were grown: Total Length of Roots (TLR), Length of Primary Root (LPR), Length of Secondary Roots (LSR), Distribution of Total Root Length (DTLR), Total Number of Roots (TNR), Median number of roots (Med), Maximum Number of Roots (Max), Depth of Root System (DRS), Width of Root System (WRS), Surface of Root System (SRS), Diameter of Primary Roots (DR), and Surface Area of primary Root (SAR) [15].

We first performed cluster analysis on the dataset introduced above in four ways: without vs. with data standardization, combined or not with the application of a Principal Component Analysis (PCA) to reduce data dimensionality, and then focused on two ways called “Approach 1” and “Approach 2”. In doing so, our motivation was to answer best the questions: How to analyse RSA multivariate data to objectively define a given number (e.g., 30) of clusters? How can a relevant member (i.e., a soybean line) be identified for each of the 30 clusters? These questions are addressed while keeping in mind that the resulting 30 soybean lines would later be used for a sophisticated, time-consuming RSA phenotyping in 3D. We used the SAS software, Version 9.4 for Windows (SAS Institute Inc., Cary, NC, USA), to design and perform our cluster analyses.

Materials and Methods

Source of experimental data

The dataset used in the multivariate analyses described below consists of the mean values of phenotypic data collected for three seedlings per line (N=3) from 137 lines of soybean grown in Canada. The seeds were first germinated in Petri dishes filled with fine vermiculite and then transplanted into customdesigned rhizoboxes filled with vermiculite. After 10 days of growth, images of the roots were taken using a camera. The Automatic Root Image Analysis (ARIA) software was used to extract the RSA-related traits from each 2-D image: TLR, LPR, LSR, DTLR, TNR, Med, Max, DRS, WRS, SRS, DR, and SAR [15].

Cluster analysis

This multivariate statistical method is aimed at identifying “clusters”, or groups of individuals, and their “members” for given criteria of proximity in the multidimensional space of a quantitative dataset. In the plethora of existing cluster analysis procedures, clustering depends on the definition of proximity and the type of distance or similarity involved; see, e.g., [14]. In all cases, the basic principles of the method are the same: grouping individuals that are more similar in the same cluster around a “centroid”, in a way that maximizes the separation among clusters while minimizing the distances between members within clusters. We applied cluster analysis to obtain 30 clusters from 137 soybean lines (1 individual=1 soybean line). As a starting point in a given approach, we identified the soybean lines with greatest proximity to the centroid as representatives of the clusters. Our motivation is to select objectively 30 soybean lines for future research work that is practically impossible to undertake with all the 137 soybean lines (i.e., RSA phenotyping based on computed tomography scanning).

In this study, we performed disjoint cluster analyses with the SAS procedure FASTCLUS, in which a nearest centroid sorting algorithm is implemented. We used it without the option of cluster seeds as first guess for centroids, so that the algorithm initially considered each individual as a separate cluster. Distances between two individuals, between one individual and the centroid of one cluster with more than one member, and between two centroids of clusters with several members were computed based on the values of the input variables (using means when centroids of non-singleton clusters are involved); see the VAR statement in SAS scripts A1 and A3 in the appendix. By default, the Euclidean distance is used to assess the proximity among individuals and clusters. The algorithm merges the two closest clusters at each step until the desired number of clusters (MAXC) is reached. Unlike the SAS procedure CLUSTER, PROC FASTCLUS assigns each individual to a single cluster without organization in a hierarchical tree structure.

We developed and followed two approaches for clustering.

Approach 1: Cluster analysis with the 12 RSA-related traits. In SAS script A1, "MAXC=30" specifies the requested number of clusters, and the final cluster assignments are saved as output in "work.fastclus_scores".

Approach 2: Cluster analysis with 6 principal components (Prin1-Prin6). In this approach, results of a preliminary PCA are used; see the text below and SAS scripts A2 and A3. The input variables VAR in A3 are "Prin1-Prin6". These were chosen for cluster analysis after PCA (see below) showed that they accounted for 95% of the variability in the data table after column standardization. Prior to standardization, the data table (with 137 rows and 12 columns) contained the mean values (N=3) per soybean line for each of the 12 RSA-related traits. The other options in A3 (i.e., MAXC, OUT) are the same as in A1.

Principal component analysis

That multivariate statistical method can be performed on the same dataset as cluster analysis, but has a different aim than cluster analysis. PCA is used to examine the relationships among quantitative variables observed on a number of individuals in order to reduce dimensionality of the data space [14]. Matrix algebra tools applied to the sample correlation matrix (with ones as diagonal entries and standardized covariances off the diagonal) provides “principal components” based on eigenvalues and associated orthogonal eigenvectors. By performing PCA, we aimed to identify structural patterns in association of the 12 RSA-related traits over the 137 soybean lines and assess differences in cluster analysis results obtained with well-defined principal components (Approach 2) vs. with no data standardization and no dimensionality reduction (Approach 1).

In SAS script A2 in the appendix, the procedure PRINCOMP is called with "DATA=PCA_Seck_et_al_2020" to specify the input dataset and the option STANDARD to perform PCA on the 12 × 12 sample correlation matrix (i.e., after transforming the data for each variable to a sample variance of 1.0). The latter option facilitates the interpretation of results by focusing on associations among variables via correlations, while avoiding scale effects related to data dispersion and measurement units if the 12 × 12 sample variance-covariance matrix was used.

Results and Discussion

The first 6 principal components (out of a maximum of 12; there are 12 variables provided by the 12 soybean root traits) explain about 95% of the variability in the data table (Figure 1, top left panel). Several of the RSA-related traits are redundant; see SAR, DRS, DTLR, LSR, TLR and WRS, RS in the PCA biplots (Figure 1, other panels). The latter result confirms the correlation analysis results reported in Seck W, et al. [15].

Figure 1. Principal Component Analysis (PCA) results. Top left panel: Percentage of the variability in the data table explained by the 12 principal components, cumulative or not. Other panels: Biplots of Prin2 against Prin1, Prin3 against Prin1, and Prin3 against Prin2; Prin1, Prin2, Prin3 denote the first three principal components in descending order of the associated eigenvalues.

In a PCA with standardization of the data table, which is equivalent to performing the PCA on the sample correlation matrix [14], “variance”, “dispersion”, “variation”. And “variability” tend to mean the same thing.

Using the criterion of greatest proximity or smallest distance to the centroid, 16 soybean lines are found to be common to the lists of 30 names obtained in the cluster analyses along Approach 1 and Approach 2; see the yellow highlights in Table 1. Loosening the required proximity to a maximum difference of 0.15 with the smallest distance to the centroid on both sides, 14 more lines were found to be common and at a distance from the centroid below 0.15 without being the closest. The final selection of 30 (Table 2) excludes two soybean lines (Madoc, McCall) that were the second member selected from their cluster, and includes instead two soybean lines (Mandarin, Maple Arrow) that are the closest and second closest to their respective centroid in Approach 2, but are not well represented in Approach 1.

**Table 1:** A summary of the initial cluster analysis results obtained in Approach 1 (Analysis with the 12 RSA-related traits) and Approach 2 (Analysis with Prin1-Prin6). Only the soybean lines that are the closest to the centroid of the cluster to which they belong are listed. Those that are highlighted in yellow appear in both lists. Complete results are given in Tables B1 and B2 in the appendix.
Analysis with the 12 RSA-related traits			Analysis with Prin1-Prin6
Cluster	Soybean line	Distance to the centroid	Cluster	Soybean line	Distance to the centroid
1	4004P4J	1.232596	1	4004P4J	1.330548
2	4005_24j	0	2	4005_24j	0
3	PS44	0.969116	3	PS44	0.903904
4	Jari	1.385713	4	OAC 7-26C	1.124655
5	Tundra	0	5	Gretna	1.020093
6	Delta	0	6	Madoc	0.844417
7	OAC 7-26C	1.222566	7	OAC Prudence	1.121011
8	Casino	1.379225	8	OAC Wallace	0.944889
9	5055_43G	0	9	5055_43G	1.239143
10	Costaud	1.251672	10	Costaud	1.10524
11	Madoc	1.357312	11	Mandarin	0.929683
12	Maple Ambr	1.10394	12	Venus	0
13	OAC 8-21C	0.898254	13	OAC 7-6C	0
14	Woodstock	0	14	Maple Glen	1.025438
15	S05-T6	1.191081	15	Bravor	1.111696
16	Albinos	1.447583	16	Tundra	0
17	OAC 9-35C	1.088592	17	SECAN8-1	1.026367
18	Clinton	1.169068	18	Woodstock	0
19	Maple Isle	1.057824	19	Jutra	0.775305
20	OAC Oxford	1.118181	20	OT94-47	0.728379
21	S14-P6	1.081495	21	Alta	0.651441
22	McCall	1.27531	22	McCall	1.090354
23	Gentleman	1.505267	23	4067P17j	1.107881
24	Flambeau	1.091571	24	S03-W4	0
25	OAC 7-6C	0	25	Roland	0.975369
26	OAC Wallace	0.954529	26	Maple Belle	1.015101
27	S03-W4	0	27	OAC 7-4C	0.812049
28	Venus	0	28	S14-P6	1.002491
29	Gaillard	0.844357	29	Mario	0.924578
30	OAC 7-4C	0.998249	30	OT05-20	1.209572

**Table 2:**Final selection of 30 soybean lines based on their membership of one of the 30 clusters identified in Approach 1 (Analysis with 12 root traits) and Approach 2 (Analysis with Prin1-Prin6) and their distance from the centroid. The 14 soybean lines highlighted in yellow here were also highlighted in yellow in Table 1; see text and Tables B1 and B2 for the selection of the other 16 soybean lines. In particular, Madoc and McCall, which are highlighed in yellow in Table 1, were eventually discarded to keep not more than one member per cluster after merging the two sets of cluster analysis results.
No.	Soybean line
1	4004P4J
2	4005_24J
3	5055_43G
4	AC2001
5	Albinos
6	Casino
7	Clinton
8	Costaud
9	Delta
10	Elora
11	Gaillard
12	Gentleman
13	Mandarin
14	Maple Arrow
15	OAC 7-26C
16	OAC 7-4C
17	OAC 7-6C
18	OAC 8-21C
19	OAC 9-22C
20	OAC 9-35C
21	OAC Oxford
22	OAC Wallace
23	PS44
24	Proteus
25	S03-W4
26	S14-P6
27	SECAN7-27
28	Tundra
29	Venus
30	Woodstock

The reported overlap of 93.3% [i.e., (16+14–2)/30=0.933] shows a robust clustering structure in RSA 2-D phenotyping in soybean. Thus, we compiled, in a rational way, a list of 30 representative soybean lines with distinct RSA patterns that provide a good basis for 3-D investigation. Of course, germination tests with available seed banks as well as preliminary tests with growing media other than vermiculite justify adjustments to that list later. It is worth mentioning that OAC Bayfield readily provides a substitute to OAC 7-26C if required, as these soybean lines belong to the same cluster with two members in both approaches (Tables B1 and B2); they are therefore at equal distance from the centroid and either can be randomly picked. A comparison with genomic clustering results falls beyond the scope of a Brief Report, but could be the topic of another, broader study.

Conclusion

The selected 30 soybean lines will be used in RSA phenotyping with stateof- the-art equipment, followed by sophisticated 3-D data and image analyses. Selecting representative lines that showcase the diversity in root system architecture and possess biological relevance is crucial. The soybean lines in Table 2 are objective starting points for further investigation into the functionality of specific RSA-related traits on plant performance and adaptation. Our cluster analysis results provide insight into phenotypic variation within the germplasm pool. Understanding root system diversity is crucial for breeders aiming to progress through selection. Advanced 3-D phenotypic analyses, e.g., based on computed tomography scanning, is expected to deepen our understanding of the RSA and its impact on plant productivity and stress tolerance.

References

Mochida, Keiichi and Kazuo Shinozaki. "Genomics and bioinformatics resources for crop improvement." Plant Cell Physiol 51 (2010): 497-523.

Cluster Analysis of 137 Soybean Lines Based on Root System Architecture Traits Measured in Rhizoboxes

Abstract

Keywords

Introduction

Materials and Methods

Results and Discussion

Conclusion

References

Awards & Nominations

50+ Million Readerbase

Journal Highlights

Google Scholar citation report

Citations: 3496

Journal of Biometrics & Biostatistics peer review process verified at publons

Indexed In

Related Links

Open Access Journals