This is a note for the book Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing (2nd edition).
Exercise 1.3
In a given organism, some codons pairs occur less frequently than others.
(a) Given the set of all exons of an organism, write a program that computes the ratio z(XY) between the observed and the expected number of occurrences of every pair of consecutive codons XY . Note that the expected number of occurrences of pair XY depends on the frequency with which X and Y are used to encode their corresponding amino acids.
(b) Given a DNA sequence S that encodes a protein P, let f(S) be the average of z(XY) over all consecutive pairs of codons XY in S. Write a program that computes, if it exists, a permutation S’ of the codons of S, such that S’ still encodes P, but f(S’) < f(S).
After reviewing the paper titled Virus Attenuation by Genome-Scale Changes in Codon Pair Bias and exploring other information on the internet, I have gained an understanding that this relates to the phenomenon of codon usage bias.
A solution for (a):
1. Collect the data. Both the FASTA (Genome sequences) and GFF (Annotation features) files for Genome assembly ASM584v2 (Escherichia coli str. K-12 substr. MG1655) were downloaded from the NCBI.
2. Filter data. “Codon usage bias refers to differences in the frequency of occurrence of synonymous codons in coding DNA. — wiki)” A Bash script designed to extract the coding sequences (CDS) from the FASTA file, utilizing the GFF file in conjunction with two additional tools: bedtools and seqtk.
#!/bin/bash
# Filter CDS and positive strand
awk '
7 == "+" { print }' genomic.gff > positive-strand-cds.gff
# Extract positive strand sequences
bedtools getfasta -fi GCF_000005845.2_ASM584v2_genomic.fna -bed positive-strand-cds.gff -fo positive-strand-cds.fa
# Filter CDS and negative strand
awk '
7 == "-" { print }' genomic.gff > negative-strand-cds.gff
# Extract nagetive strand sequences
bedtools getfasta -fi GCF_000005845.2_ASM584v2_genomic.fna -bed negative-strand-cds.gff -fo negative-strand-cds.fa
# Get complement DNA sequences for nagetive strand sequences
seqtk seq -r negative-strand-cds.fa > complement-negative-strand-cds.fa
# Merge the sequences
cat positive-strand-cds.fa complement-negative-strand-cds.fa > cds.fa
In total, 4,337 sequences were obtained.
3. Calculate codon pair scores (excluding Stop codons). The calculation comprises the following steps:
3.1 Eliminate sequences lacking Start or Stop codons and those whose sequence length is not divisible by 3. (48 sequences were removed, which represents approximately 1% of the entire dataset consisting of 4,337 sequences, which is acceptable.)
3.2 Determine the count of codons, codon pairs, amino acids, and amino acid pairs.
3.3 Calculate codon pair scores using the following formula:
The equation used to calculate the CPS of a given codon pair independent of codon usage and amino acid bias, thus its relative expected frequency, where the codon pair AB encodes for amino acid pair XY and F denotes frequency (number of occurrences). This CPS score for a given pair determines if the pair is over-represented (+) or under-represented (-) in the human genome.[1]
4. Save the results to a CSV file.
Full code (two external Java libraries are employed: BioJava and Apache Commons CSV):
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVPrinter;
import org.biojava.bio.BioException;
import org.biojava.bio.seq.DNATools;
import org.biojava.bio.seq.RNATools;
import org.biojava.bio.seq.Sequence;
import org.biojava.bio.seq.SequenceIterator;
import org.biojava.bio.seq.db.SequenceDB;
import org.biojava.bio.seq.io.SeqIOTools;
import org.biojava.bio.symbol.*;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class CalculateCodonPairScores {
static void calculateCodons(String sequence, HashMap<String, Long> codonsCount, HashMap<String, Long> codonPairsCount,
HashMap<String, Long> aminoAcidsCount, HashMap<String, Long> aminoAcidPairsCount) throws IllegalAlphabetException, IllegalSymbolException {
String firstCodon = null;
for (int i = 0; i < sequence.length(); i += 3) {
String codon = sequence.substring(i, i + 3);
if(hasStopCodon(codon)) continue; // in case of Stop codon in the middle
if (!codonsCount.containsKey(codon)) {
codonsCount.put(codon, 1L);
} else {
codonsCount.replace(codon, codonsCount.get(codon), codonsCount.get(codon) + 1);
}
String seq = translateDNAToProtein(codon);
calculateAminoAcids(seq, aminoAcidsCount);
if (firstCodon != null) {
calculateCodonPairs(firstCodon, codon, codonPairsCount);
seq = translateDNAToProtein(firstCodon.concat(codon));
calculateAminoAcids(seq, aminoAcidPairsCount);
}
firstCodon = codon;
}
}
static void calculateCodonPairs(String firstCodon, String secondCodon, HashMap<String, Long> codonsCount) {
String codonPair = firstCodon.concat(secondCodon);
if (!codonsCount.containsKey(codonPair)) {
codonsCount.put(codonPair, 1L);
} else {
codonsCount.replace(codonPair, codonsCount.get(codonPair), codonsCount.get(codonPair) + 1);
}
}
static String translateDNAToProtein(String sequence) throws IllegalSymbolException, IllegalAlphabetException {
SymbolList symL = DNATools.createDNA(sequence);
symL = DNATools.toRNA(symL);
symL = RNATools.translate(symL);
return symL.seqString();
}
static void calculateAminoAcids(String sequence, HashMap<String, Long> aminoAcidsCount) {
if (!aminoAcidsCount.containsKey(sequence)) {
aminoAcidsCount.put(sequence, 1L);
} else {
aminoAcidsCount.replace(sequence, aminoAcidsCount.get(sequence), aminoAcidsCount.get(sequence) + 1);
}
}
static boolean hasStartCodon(String sequence) {
if (sequence.length() < 3) {
return false;
}
List<String> startCodons = List.of("TTG", "ATG", "GTG");
String firstCodon = sequence.substring(0, 3);
return startCodons.contains(firstCodon);
}
static boolean hasStopCodon(String sequence) {
if (sequence.length() < 3) {
return false;
}
List<String> stopCodons = List.of("TAA", "TGA", "TAG");
String lastCodon = sequence.substring(sequence.length() - 3);
return stopCodons.contains(lastCodon);
}
static boolean isDivisibleByThree(String sequence) {
return sequence.length() % 3 == 0;
}
public static void main(String[] args) throws IOException, BioException {
String filename = "/home/yan/Genome/ecoli/ncbi_dataset/data/GCF_000005845.2/cds.fa";
BufferedInputStream is = new BufferedInputStream(new FileInputStream(filename));
Alphabet alpha = AlphabetManager.alphabetForName("DNA");
SequenceDB db = SeqIOTools.readFasta(is, alpha);
SequenceIterator it = db.sequenceIterator();
HashMap<String, Long> codonsCount = new HashMap<>();
HashMap<String, Long> codonPairsCount = new HashMap<>();
HashMap<String, Long> aminoAcidsCount = new HashMap<>();
HashMap<String, Long> aminoAcidsPairsCount = new HashMap<>();
while (it.hasNext()) {
Sequence sequence = it.nextSequence();
String seq = sequence.seqString().toUpperCase();
if (hasStartCodon(seq) && hasStopCodon(seq) && isDivisibleByThree(seq)) {
seq = seq.substring(0, seq.length() - 3); // remove Stop codon
calculateCodons(seq, codonsCount, codonPairsCount, aminoAcidsCount, aminoAcidsPairsCount);
}
}
String csvFileName = "/home/yan/Genome/ecoli/ncbi_dataset/data/GCF_000005845.2/cps.csv";
FileWriter fileWriter = new FileWriter(csvFileName);
CSVPrinter csvPrinter = new CSVPrinter(fileWriter, CSVFormat.DEFAULT);
csvPrinter.printRecord("AA pair", "Codon pair", "Expected", "Observed", "Observed/Expected", "CPS");
for (Map.Entry<String, Long> codonPairs : codonPairsCount.entrySet()) {
String aaPair = translateDNAToProtein(codonPairs.getKey());
long observed = codonPairs.getValue();
String firstCodon = codonPairs.getKey().substring(0, 3);
String secondCodon = codonPairs.getKey().substring(3);
double expected = (double) (codonsCount.get(firstCodon) * codonsCount.get(secondCodon) * aminoAcidsPairsCount.get(translateDNAToProtein(codonPairs.getKey()))) /
(aminoAcidsCount.get(translateDNAToProtein(firstCodon)) * aminoAcidsCount.get(translateDNAToProtein(secondCodon)));
double cps = Math.log10(observed / expected);
csvPrinter.printRecord(aaPair,codonPairs.getKey(), expected, observed, observed/expected, cps);
csvPrinter.flush();
}
}
}
Result (the first 10 records):

A solution for (B):
This task reminded me of predicting RNA secondary structure, which involves utilizing dynamic programming algorithm to search the minimum free energy. I’ve come across a relevant paper addressing this issue: COSMO: A dynamic programming algorithm for multicriteria codon optimization . Given the challenges, I developed a “random pick” algorithm as a solution to this problem.
1. Create random coding DNA with 100,000 codons using an online tool: Random Coding DNA .
2. Calculate the codon-pair bias using the following formula:
CPB is the arithmetic mean of the individual codon pair scores of all pairs making up the ORF. [1]
3. Calculate new codon-pair bias by randomly altering a codon (from the first codon in the sequence).
Full code:
static double calculateCPB(HashMap<String, Long> codonsCount, HashMap<String, Long> codonPairsCount, HashMap<String, Long> aminoAcidsCount, HashMap<String, Long> aminoAcidsPairsCount) throws IllegalAlphabetException, IllegalSymbolException {
double sum = 0d;
for (Map.Entry<String, Long> codonPairs : codonPairsCount.entrySet()) {
String firstCodon = codonPairs.getKey().substring(0, 3);
String secondCodon = codonPairs.getKey().substring(3);
double expected = (double) (codonsCount.get(firstCodon) * codonsCount.get(secondCodon) * aminoAcidsPairsCount.get(translateDNAToProtein(codonPairs.getKey()))) /
(aminoAcidsCount.get(translateDNAToProtein(firstCodon)) * aminoAcidsCount.get(translateDNAToProtein(secondCodon)));
long observed = codonPairs.getValue();
double cps = Math.log10(observed / expected);
sum += cps;
}
return sum / codonPairsCount.size();
}
static double calculateNewCPB(String sequence, double cpb, int retries, HashMap<String, String> mapCodonToAminoAcid)
throws IllegalAlphabetException, IllegalSymbolException {
String proteinSeq = translateDNAToProtein(sequence);
int aaCount = proteinSeq.length();
int i = 1;
double oldCpb = cpb;
double result = cpb;
do {
String protein = proteinSeq.substring(i, i + 1);
String codon = sequence.substring(i*3, i*3+3);
String randomCodon = mapCodonToAminoAcid.entrySet().stream()
.filter(a -> (a.getValue().equals(protein) && !a.getKey().equals(codon)) ||
(a.getValue().equals(protein) && isUnique(mapCodonToAminoAcid.values(), a.getValue())))
.findAny().get().getKey();
sequence = sequence.substring(0, i) + randomCodon + sequence.substring(i + 3);
HashMap<String, Long> codonsCount = new HashMap<>();
HashMap<String, Long> codonPairsCount = new HashMap<>();
HashMap<String, Long> aminoAcidsCount = new HashMap<>();
HashMap<String, Long> aminoAcidsPairsCount = new HashMap<>();
calculateCodons(sequence, codonsCount, codonPairsCount, aminoAcidsCount, aminoAcidsPairsCount, mapCodonToAminoAcid);
double newCpb = calculateCPB(codonsCount, codonPairsCount, aminoAcidsCount, aminoAcidsPairsCount);
i++;
if (newCpb >= oldCpb) {
retries--;
sequence = sequence.substring(0, i) + codon + sequence.substring(i + 3);
} else {
oldCpb = newCpb;
result = newCpb;
}
} while(i < aaCount && retries>0);
return result;
}
static boolean isUnique(Collection<String> collection, String aminoAcid) {
long count = collection.stream().filter(c -> c.equals(aminoAcid)).count();
return count == 1;
}
Test result:
old cps:-0.007466135979402607
new cps:-0.007469375711616729