Quantitative Genetics Methodology of Advanced Generation Seed Orchards

Published in

Ph.D. stories

11 min readApr 18, 2023

1. Summary:

The objective is to conduct original research in the quantitative genetic methodology of seed orchards. The main issue is that the existing seed orchard theory is built on several assumptions that are not realistic in advanced-generation seed orchards. The idea is to jointly review the existing body of research along with the current opportunities of mathematical programming. While it is possible to maximize genetic gain and constrain the effective population size, the general solution provided by Meuwissen (1997) is erroneous (formulation of the Lagrangian multipliers). My general research objective is to present a novel linearization of the basic quadratic formulation by Meuwissen (1997). Alongside this implementation, I utilize the AMPL software (Gay, 2015) to integrate additional variables important to operational forestry. In turn, I should deliver a new methodology suitable to advanced generations, maintaining all the benefits of the general mathematical-programming model (declaration of the optimality tolerance, etc.). The existing methods (1st gen seed orchards) can be considered a special case of the general solution. The AMPL describes optimization problems in a declarative language based directly on familiar algebraic terminology, making the optimization intuitive. The primary model should be presented in a highly-ranking 1Q journal within the Forestry category (e.g., Annals of Forest Science). Besides the theoretical work, it is crucial to involve additional experts willing to contribute with their expertise and datasets, as this should become a work in progress: my supervisor, prof. Lstibůrek set up an international research team involving Prof. Dr. Kang, College of Agriculture and Life Sciences, Seoul National University, Dr. Steffernen (NIBIO, Norway), Dr. Almquist (SkogForsk, Sweden), and Dr. Billir (Isparta University of Applied Sciences, Turkey).

2. The current state of the topic:

Seed orchards connect tree breeding programs and operational forestry. They are the source of genetically improved seed for afforestation. In the initial breeding cycle, phenotypically superior individuals (designated as plus trees) are identified in forest stands. They are typically grafted to establish the first-generation (clonal) seed orchards. As breeding progresses through a typical sequence of breeding activities (mating, testing, and selection), new seed orchards are planted with grafted top-ranking genetically improved individuals. Due to repeated selection cycles, advanced-generation seed orchards are associated with higher genetic gains in important quantitative and qualitative traits (White et al. 2007). The bulk of the quantitative-genetic theory of seed orchard production is limited to the initial breeding cycles (Kang 2001).

Several limitations are related to the current quantitative genetic models of seed orchard production: (#1) the presence of pair-wise and self-co-ancestries expel the problem from linear formulation, and complications may arise with singularities of the quadratic constraint; (#2) deriving optimum clonal proportions from breeding values relies on somewhat arbitrary weighting (e.g., see constant c in the “Model and mathematics” section in Lindgren and Matheson 1986); (#3) breeding values are typically treated as constants in calculating optimum proportions in animal and plant breeding (e.g., Meuwissen 1997); however, they are statistical predictions of cumulative additive gene effects; (#4) genetic gain is the product of clonal proportions and the respective breeding values conditional on random mating (equal likelihood of mating among all genetic entries across the grid); (#5) actual gametic contributions differ from calculated “ideal” proportions (complex reproductive biology in variable environmental conditions).

With the DNA markers and high-throughput phenotyping, one is no longer limited to inaccurate predictions of the orchard’s reproductive output (Funda and El-Kassaby 2013). Quantitative genetic models should be fine-tuned to a much more detailed description of the reproductive dynamics. Such methodology is currently missing, and current seed orchard management is inefficient.

3. Objectives:

The primary scientific goal of this project is to create a quantitative genetic model for maximizing genetic gain in seed orchards with consideration of previously mentioned limiting conditions and other inputs (genomic and high-throughput physiological data). The research is divided into two parts, i.e., analytical derivation of the optimization algorithm and pilot implementation study enabling verification of the algorithm on real datasets in the Czech Republic and the USA (ongoing collaboration with NC State CAMCORE cooperative).

More general objectives of the doctoral study program include: (1.) becoming an expert in quantitative forest genetics and seed orchards, (2.) gaining knowledge to conduct independent theoretical and applied research, (3.) networking with experts from international forest genetic community, (4.) teaching experience in forest genetic classes, and (4.) cultural experience while living in the Czech Republic.

4. Conceptual approaches and methodology:

4.1. Analytical derivation of the optimization model, linearization of the general selection algorithm, considering realized kinship relationships in the population, incorporating additional parameters (lower and upper limits of clonal sizes, number of selected individuals, declared constraint on genetic diversity).

4.2. The next step is to optimize the integration of the genomic kinship matrix into the optimization framework. The first question is about choosing the most suitable DNA marker type. The next step will be creating filter protocols, considering adaptive and non-adaptive DNA polymorphism.

4.3. Finally, additional variables (expert estimates of reproductive biology) will be integrated into the optimization framework.

5. Expected Outputs

As the first author, I am anticipating four scientific publications in peer-reviewed journals (WoS). Further, I intend to attend one or two international conferences as a speaker. I aim to develop publicly available software tools to be used by tree breeders worldwide. I also plan to present my research at meetings with the forest industry in the Czech Republic and possibly abroad.

6. Interim Results of Optimization of the founder population.

R script for founder simulation model for LD decay and Allele frequency spectrum.

set.seed(123)
library(MoBPS)
library(MoBPSmaps)
library(RandomFieldsUtils)
library(miraculix)
popnew <- creating.diploid(nsnp = 50000,
                           nindi = 5000,
                           chromosome.length = 100,
                           freq = "beta",
                           beta.shape1 = 0.2,
                           beta.shape2 = 1.1,
                           dataset = "random",
                           sex.s = "random")


popnew = optimize.cores(popnew)

# simulate for 5, 10, 20 generations
for(index in 1:30){
    print(index)
    popnew <- breeding.diploid(popnew, breeding.size = 5000,
                               selection.size = c(0, 0))
    
    if(index%%10==0){
        popnew = new.base.generation(popnew, base.gen = get.ngen(popnew))
        
    }
}

# allele frequency
genotype.check1 <- get.geno(popnew, gen = length(popnew$breeding))
p_i1 <- rowMeans(genotype.check1)/2
p_i1 <- data.frame(p_i1, row.names = NULL, check.rows = FALSE,
                   check.names = TRUE, fix.empty.names = TRUE,
                   stringsAsFactors = FALSE)
write.csv(p_i1, file = "p_i1new.csv",row.names = FALSE)

# genetic distance between two markers in cM
ldinfo1 <- ld.decay(popnew, genotype.dataset = genotype.check1,
                    step = 1, max = 500)
ldinfo1 <- data.frame(ldinfo1, row.names = NULL, check.rows = FALSE,
                      check.names = TRUE, fix.empty.names = TRUE,
                      stringsAsFactors = FALSE)
write.csv(ldinfo1, file = "ldinfo1new.csv",row.names = FALSE)

popnew <- creating.diploid(nsnp = 50000,
                           nindi = 5000,
                           chromosome.length = 100,
                           freq = "beta",
                           beta.shape1 = 0.2,
                           beta.shape2 = 1.1,
                           dataset = "random",
                           sex.s = "random")


popnew = optimize.cores(popnew)


for(index in 1:40){
    print(index)
    popnew <- breeding.diploid(popnew, breeding.size = 5000,
                               selection.size = c(0, 0))
    
    if(index%%10==0){
        popnew = new.base.generation(popnew, base.gen = get.ngen(popnew))
        
    }
}

# allele frequency
genotype.check2 <- get.geno(popnew, gen = length(popnew$breeding))
p_i2 <- rowMeans(genotype.check2)/2
p_i2 <- data.frame(p_i2, row.names = NULL, check.rows = FALSE,
                   check.names = TRUE, fix.empty.names = TRUE,
                   stringsAsFactors = FALSE)
write.csv(p_i2, file = "p_i2new.csv",row.names = FALSE)

# genetic distance between two markers in cM
ldinfo2 <- ld.decay(popnew, genotype.dataset = genotype.check2,
                    step = 1, max = 500)
ldinfo2 <- data.frame(ldinfo2, row.names = NULL, check.rows = FALSE,
                      check.names = TRUE, fix.empty.names = TRUE,
                      stringsAsFactors = FALSE)
write.csv(ldinfo2, file = "ldinfo2new.csv",row.names = FALSE)

popnew <- creating.diploid(nsnp = 50000,
                           nindi = 5000,
                           chromosome.length = 100,
                           freq = "beta",
                           beta.shape1 = 0.2,
                           beta.shape2 = 1.1,
                           dataset = "random",
                           sex.s = "random")


popnew = optimize.cores(popnew)


for(index in 1:50){
    print(index)
    popnew <- breeding.diploid(popnew, breeding.size = 5000,
                               selection.size = c(0, 0))
    
    if(index%%10==0){
        popnew = new.base.generation(popnew, base.gen = get.ngen(popnew))
        
    }
}

# allele frequency
genotype.check3 <- get.geno(popnew, gen = length(popnew$breeding))
p_i3 <- rowMeans(genotype.check3)/2
p_i3 <- data.frame(p_i3, row.names = NULL, check.rows = FALSE,
                   check.names = TRUE, fix.empty.names = TRUE,
                   stringsAsFactors = FALSE)
write.csv(p_i3, file = "p_i3new.csv",row.names = FALSE)

# genetic distance between two markers in cM
ldinfo3 <- ld.decay(popnew, genotype.dataset = genotype.check3,
                    step = 1, max = 500)
ldinfo3 <- data.frame(ldinfo3, row.names = NULL, check.rows = FALSE,
                      check.names = TRUE, fix.empty.names = TRUE,
                      stringsAsFactors = FALSE)
write.csv(ldinfo3, file = "ldinfo3new.csv",row.names = FALSE)

popnew <- creating.diploid(nsnp = 100000,
                           nindi = 5000,
                           chromosome.length = 100,
                           freq = "beta",
                           beta.shape1 = 0.2,
                           beta.shape2 = 1.1,
                           dataset = "random",
                           sex.s = "random")


popnew = optimize.cores(popnew)


for(index in 1:30){
    print(index)
    popnew <- breeding.diploid(popnew, breeding.size = 5000,
                               selection.size = c(0, 0))
    
    if(index%%10==0){
        popnew = new.base.generation(popnew, base.gen = get.ngen(popnew))
        
    }
}

# allele frequency
genotype.check4 <- get.geno(popnew, gen = length(popnew$breeding))
p_i4 <- rowMeans(genotype.check4)/2
p_i4 <- data.frame(p_i4, row.names = NULL, check.rows = FALSE,
                   check.names = TRUE, fix.empty.names = TRUE,
                   stringsAsFactors = FALSE)
write.csv(p_i4, file = "p_i4new.csv",row.names = FALSE)

# genetic distance between two markers in cM
ldinfo4 <- ld.decay(popnew, genotype.dataset = genotype.check4,
                    step = 1, max = 500)
ldinfo4 <- data.frame(ldinfo4, row.names = NULL, check.rows = FALSE,
                      check.names = TRUE, fix.empty.names = TRUE,
                      stringsAsFactors = FALSE)
write.csv(ldinfo4, file = "ldinfo4new.csv",row.names = FALSE)


popnew <- creating.diploid(nsnp = 100000,
                           nindi = 5000,
                           chromosome.length = 100,
                           freq = "beta",
                           beta.shape1 = 0.2,
                           beta.shape2 = 1.1,
                           dataset = "random",
                           sex.s = "random")


popnew = optimize.cores(popnew)


for(index in 1:40){
    print(index)
    popnew <- breeding.diploid(popnew, breeding.size = 5000,
                               selection.size = c(0, 0))
    
    if(index%%10==0){
        popnew = new.base.generation(popnew, base.gen = get.ngen(popnew))
        
    }
}

# allele frequency
genotype.check5 <- get.geno(popnew, gen = length(popnew$breeding))
p_i5 <- rowMeans(genotype.check5)/2
p_i5 <- data.frame(p_i5, row.names = NULL, check.rows = FALSE,
                   check.names = TRUE, fix.empty.names = TRUE,
                   stringsAsFactors = FALSE)
write.csv(p_i5, file = "p_i5new.csv",row.names = FALSE)

# genetic distance between two markers in cM
ldinfo5 <- ld.decay(popnew, genotype.dataset = genotype.check5,
                    step = 1, max = 500)
ldinfo5 <- data.frame(ldinfo5, row.names = NULL, check.rows = FALSE,
                      check.names = TRUE, fix.empty.names = TRUE,
                      stringsAsFactors = FALSE)
write.csv(ldinfo5, file = "ldinfo5new.csv",row.names = FALSE)

popnew <- creating.diploid(nsnp = 100000,
                           nindi = 5000,
                           chromosome.length = 100,
                           freq = "beta",
                           beta.shape1 = 0.2,
                           beta.shape2 = 1.1,
                           dataset = "random",
                           sex.s = "random")


popnew = optimize.cores(popnew)


for(index in 1:50){
    print(index)
    popnew <- breeding.diploid(popnew, breeding.size = 5000,
                               selection.size = c(0, 0))
    
    if(index%%10==0){
        popnew = new.base.generation(popnew, base.gen = get.ngen(popnew))
        
    }
}

# allele frequency
genotype.check6 <- get.geno(popnew, gen = length(popnew$breeding))
p_i6 <- rowMeans(genotype.check6)/2
p_i6 <- data.frame(p_i6, row.names = NULL, check.rows = FALSE,
                   check.names = TRUE, fix.empty.names = TRUE,
                   stringsAsFactors = FALSE)
write.csv(p_i6, file = "p_i6new.csv",row.names = FALSE)

# genetic distance between two markers in cM
ldinfo6 <- ld.decay(popnew, genotype.dataset = genotype.check6,
                    step = 1, max = 500)
ldinfo6 <- data.frame(ldinfo6, row.names = NULL, check.rows = FALSE,
                      check.names = TRUE, fix.empty.names = TRUE,
                      stringsAsFactors = FALSE)
write.csv(ldinfo6, file = "ldinfo6new.csv",row.names = FALSE)

ld1 <- read.csv("ldinfo1new.csv",header = TRUE,sep = ",", dec = ".", na.strings = c(".","","x",NA))
ld1 <- ld1[-c(1), -c(3,4)]
names = c('dist','ld1new')
colnames(ld1) <- names
ld1$dist <- as.numeric(ld1$dist/10)
summary(ld1)

ld2 <- read.csv("ldinfo2new.csv",header = TRUE,sep = ",", dec = ".", na.strings = c(".","","x",NA))
ld2 <- ld2[-c(1), -c(3,4)]
names = c('dist','ld2new')
colnames(ld2) <- names
ld2$dist <- as.numeric(ld2$dist/10)
summary(ld2)

ld3 <- read.csv("ldinfo3new.csv",header = TRUE,sep = ",", dec = ".", na.strings = c(".","","x",NA))
ld3 <- ld3[-c(1), -c(3,4)]
names = c('dist','ld3new')
colnames(ld3) <- names
ld3$dist <- as.numeric(ld3$dist/10)
head(ld3)
summary(ld3)

ld4 <- read.csv("ldinfo4new.csv",header = TRUE,sep = ",", dec = ".", na.strings = c(".","","x",NA))
ld4 <- ld4[-c(1), -c(3,4)]
names = c('dist','ld4new')
colnames(ld4) <- names
ld4$dist <- as.numeric(ld4$dist/10)
summary(ld4)

ld5 <- read.csv("ldinfo5new.csv",header = TRUE,sep = ",", dec = ".", na.strings = c(".","","x",NA))
ld5 <- ld5[-c(1), -c(3,4)]
names = c('dist','ld5new')
colnames(ld5) <- names
ld5$dist <- as.numeric(ld5$dist/10)
summary(ld5)

ld6 <- read.csv("ldinfo6new.csv",header = TRUE,sep = ",", dec = ".", na.strings = c(".","","x",NA))
ld6 <- ld6[-c(1), -c(3,4)]
names = c('dist','ld6new')
colnames(ld6) <- names
ld6$dist <- as.numeric(ld6$dist/10)
summary(ld6)

png(filename = "ldplot1-3new.png")
ldplot1 <- plot(ld1$dist, ld1$ld1new, pch = 16, col = "red", 
                xlab = "Distance in cM", ylab = expression(r^2),
                ylim = c(0, 0.0040),
                xlim = c(0, 50),
                main = "LD structure on chromosome 1",
                sub = "5,000 founders with 50,000 SNP")
points(ld2$dist, ld2$ld2new, pch = 17, col = "blue")
points(ld3$dist, ld3$ld3new, pch = 18, col = "green")
legend(x = "topright", text.font = 1.0, pch = c(16,17,18), 
       col = c("red", "blue", "green"),
       legend = c("5-Gen", "10-Gen", "20-Gen"))
dev.off()

png(filename = "ldplot4-6new.png")
ldplot2 <- plot(ld4$dist, ld4$ld4new, pch = 16, col = "red", 
                xlab = "Distance in cM", ylab = expression(r^2),
                ylim = c(0, 0.0040),
                xlim = c(0, 50),
                main = "LD structure on chromosome 1",
                sub = "5,000 founders with 100,000 SNP")
points(ld5$dist, ld5$ld5new, pch = 17, col = "blue")
points(ld6$dist, ld6$ld6new, pch = 18, col = "green")
legend(x = "topright", text.font = 1.0, pch = c(16,17,18), 
       col = c("red", "blue", "green"),
       legend = c("5-Gen", "10-Gen", "20-Gen"))
dev.off()

p_i1new <- read.csv("p_i1new.csv",header = TRUE,sep = ",", dec = ".", na.strings = c(".","","x",NA))
p_i2new <- read.csv("p_i2new.csv",header = TRUE,sep = ",", dec = ".", na.strings = c(".","","x",NA))
p_i3new <- read.csv("p_i3new.csv",header = TRUE,sep = ",", dec = ".", na.strings = c(".","","x",NA))
png(filename = "AlleleFreqScen01-03new.png")
par(mfrow = c(1, 3))
hist(p_i1new$p_i1,
     xlim = c(0, 1.0),
     ylim = c(0, 60000),
     xlab = "Allele Spectrum",
     ylab = "Frequency",
     col = "#33FFFF",
     main = "",
     sub = "Scenario-01-snp50K, gen-5")
text(0.4, 60000, "beta1=0.2, beta2=1.1")
hist(p_i2new$p_i2,
     xlim = c(0, 1.0),
     ylim = c(0, 60000),
     xlab = "Allele Spectrum",
     ylab = "",
     col = "#33FFFF",
     main = "Allele spectrum of founder",
     sub = "Scenario-02-snp50K, gen-10")
text(0.4, 60000, "beta1=0.2, beta2=1.1")
hist(p_i3new$p_i3,
     xlim = c(0, 1.0),
     ylim = c(0, 60000),
     xlab = "Allele Spectrum",
     ylab = "",
     col = "#33FFFF",
     main = "",
     sub = "Scenario-03-snp50K, gen-20")
text(0.4, 60000, "beta1=0.2, beta2=1.1")
dev.off()

p_i4new <- read.csv("p_i4new.csv",header = TRUE,sep = ",", dec = ".", na.strings = c(".","","x",NA))
p_i5new <- read.csv("p_i5new.csv",header = TRUE,sep = ",", dec = ".", na.strings = c(".","","x",NA))
p_i6new <- read.csv("p_i6new.csv",header = TRUE,sep = ",", dec = ".", na.strings = c(".","","x",NA))
png(filename = "AlleleFreqScen04-06new.png")
par(mfrow = c(1, 3))
hist(p_i4new$p_i4,
     xlim = c(0, 1.0),
     ylim = c(0, 60000),
     xlab = "Allele Spectrum",
     ylab = "Frequency",
     col = "#33FFFF",
     main = "",
     sub = "Scenario-04-snp100K, gen-5")
text(0.4, 60000, "beta1=0.2, beta2=1.1")
hist(p_i5new$p_i5,
     xlim = c(0, 1.0),
     ylim = c(0, 60000),
     xlab = "Allele Spectrum",
     ylab = "",
     col = "#33FFFF",
     main = "Allele spectrum of founder",
     sub = "Scenario-05-snp100K, gen-10")
text(0.4, 60000, "beta1=0.2, beta2=1.1")
hist(p_i6new$p_i6,
     xlim = c(0, 1.0),
     ylim = c(0, 60000),
     xlab = "Allele Spectrum",
     ylab = "",
     col = "#33FFFF",
     main = "",
     sub = "Scenario-06-snp100K, gen-20")
text(0.4, 60000, "beta1=0.2, beta2=1.1")
dev.off()

Fig.1: Showing Allele frequency spectrum of the founder population of Picea abies.

Fig.2: Showing Allele frequency spectrum of the founder population of Picea abies.

Fig.3: Showing the Linkage Disequilibrium decay simulation model for Picea abies

Fig.4: Showing the Linkage Disequilibrium decay simulation model for Pinus sylvestris

6. References:

1. Funda T, El-Kassaby YA (2013). Seed orchard genetics. Plant Science Rev 2012:21–43.

2. Kang KS (2001). Genetic gain and gene diversity of seed orchard crops. Swedish University of Agricultural Sciences, Ph.D. dissertation.

3. Lindgren D, Matheson AC (1986). An algorithm for increasing the genetic quality of seed from seed orchards by using the better clones in higher proportions. Silvae Genetica 35:173–177.

4. Meuwissen THE (1997). Maximizing the response of selection with a predefined rate of inbreeding. Journal Animal Science 75:934–940.

5. White TL, Adams WT, Neale DB (2007). Forest genetics. CAB International, Wallingford, Oxfordshire, UK, 682 p.

Written by csagariya