Back to Silas S. Brown's home page

Primer Pooler

Jump to: Download | Changes | Glossary | Citation

This is a program I wrote for cancer researchers and others who want to use Multiplex PCR to study DNA samples, and wish to optimise their combinations of primers to minimise the formation of dimers.

Primer Pooler can:

If your CPU is modern enough to have them, Primer Pooler will take advantage of 64-bit registers and multiple cores. But it also runs on older equipment.

Download Primer Pooler

If your lab runs Windows, you'll want:

The 64-bit Mac version should work on most modern Macs (OS X 10.7 or newer).

For all other systems (GNU/Linux, older Mac, Solaris...) please compile from source (below).

Example primers file

If you need it, here is an example primers file which references the human genome (download hg38.2bit from UCSC). Primer names in the examples file have been changed (the lab wouldn't like the real names released before they publish, so I replaced all their labels with obscure hex codes).

Source code

To build from source, you will need: Download the source code, unpack it, and type make or make win-crosscompile

Usage

The easiest way to run Primer Pooler for first-time users is to run it interactively. To do this, simply launch the program file (pooler or pooler64), and it should ask you a series of questions to take you through what you want to do.

Questions asked by Primer Pooler when running interactively:

Would you like to run interactively? (y/n):
You should answer y to this question, otherwise Primer Pooler will merely display the command-line help (see below) and exit.
Please enter the name of the primers file to read.
As the program further explains, it is expecting a text file in multiple-sequence FASTA format, such as:
>toySet1-F
AGCTGCTGCTGCGATCT
>toySet1-R
GGCTGAGCGCTCAGTTT
>toySet2-F
ACGGCTTGACACCGTTCGACTG
>toySet2-R
CAGACGTTCAG
(this example does not represent real primers). Degenerate bases are allowed using the normal letters, and both upper and lower case is allowed. Names of amplicons' primers should end with F or R (for Forward and Reverse), and otherwise match. Optionally include tags to apply to all primers (also called tailed primers or barcoding) using >tagF and >tagR. If you also have Taq probes or other primers that don't themselves make amplicons, you can include these ending with other letters, e.g. >toySet1-P---any set of names differing in only the last letter will be kept in the same pool, but you must use F for forward and R or B for reverse (backward) if you also want to check primer-pairs for overlaps in the genome.
Do you want to use deltaG? (y/n):
As the program explains, it will need to be told the temperature and concentration settings if you want it to use deltaG. Alternatively you can use the faster and simpler "score" method, but this is less accurate.
  • If you opt to use Score when your primers and/or tags are very long, you will be asked if you are really sure you don't want to use deltaG instead.
  • If you opt for deltaG, the following questions will be asked:
    Temperature:
    Enter a number (decimal fractions are allowed). You can enter it in Celsius, Kelvin, Fahrenheit or Rankine. Do not enter the suffix C or K or F or R---Primer Pooler will determine for itself which unit was meant, and ask you to confirm.
    Magnesium concentration in mM (0 for no correction):
    Enter your concentration of magnesium in nanomoles per cubic metre (decimal fractions are allowed). Enter 0 if you don't mind the deltaG figures not being corrected for magnesium concentration.
    Monovalent cation (e.g. sodium) concentration in mM:
    Enter your concentration of sodium etc in nanomoles per cubic metre (decimal fractions are allowed). If in doubt, try 50.
    dNTP concentration in mM (0 for no correction):
    Enter your concentration of deoxynucleotide (dNTP) in nanomoles per cubic metre (decimal fractions are allowed). Enter 0 if you don't mind the deltaG figures not being corrected for dNTP concentration.
Shall I count how many pairs have what score/deltaG range? (y/n):
Answer "y" if you want a fast summary of how many pairs of primers (in the entire collection, before pooling) have what range of interaction strengths. This could be used for example to check a pool that you have already chosen manually, or if you want a rough idea of the worst-case scenario that pooling aims to avoid.
  • If you answered yes to this question, the summary will be displayed on screen, and you will be asked if you also want to save it to a file. If you answer yes to this, you will be asked for a filename.
  • These up-front counts will include self-interactions (a primer interacting with itself), and interactions between the pair of primers in any given set. Self-interactions and in-set interactions are not counted when summarizing the counts of each pool (below).
Do you want to see the highest bonds of the whole file? (y/n):
Similar to the above question, this can be useful for checking a manual selection or for a rough idea. If you answer Yes, you will be asked for a deltaG or score threshold, and all interactions worse than that threshold will be displayed on-screen with bonds diagrams such as:
 5'-GGCTGAGCGCTCAGTTT-3'
    xx||||||||||||xx
3'-TTTGACTCGCGAGTCGG-5'
and you will then be asked if you wish to save it to a file, and, if so, what file name. You will then be asked if you would like to try another threshold.
Shall I split this into pools for you? (y/n):
Most users will want to say y here, unless you merely wanted to check a batch of primers that you picked some other way. If you say No, Primer Pooler will forget about the primers at hand and ask you if you want to start the program again or exit.
Shall I check the amplicons for overlaps in the genome? (y/n):
If you answer yes to this, Primer Pooler will prompt you for a genome file, either in .2bit format as supplied by UCSC, or in .fa (FASTA) format.
To obtain a .2bit file from UCSC:
  1. Go to http://hgdownload.cse.ucsc.edu/downloads.html
  2. Choose a species (e.g. Human)
  3. Choose "Full data set"
  4. Scroll down to the links, and choose the one that ends .2bit (e.g. hg38.2bit)
It will then ask for a maximum amplicon length (in base pairs): this is the maximum length of the product---the number does not include the length of any tag sequences you have added to the primers. Then it will scan through the genome data to detect where your amplicons start and finish, and which ones overlap.
  • After the overlap scan is complete, Primer Pooler will then have enough data to write an input file for MultiPLX if you wish to run that software as well for comparison. If you decline this, it will ask if you want it to write a simple text file with the locations of all amplicons, which you may accept or decline.
  • If you do not opt to check for overlaps in the genome, then Primer Pooler will not take overlaps into account when generating its pools. This is rarely useful unless you have already ensured there are no overlaps in the set of amplicons under consideration. Even then, I would recommend performing a scan anyway, just to double-check: an early version found 11 overlaps in a supposedly overlap-free batch drawn up by an experienced academic---we all make mistakes. But bypassing the overlap check might be useful if you are sure there are no overlaps and you don't want to download a very large genome file to the workstation you're using.
How many pools?
Enter a number of pools. Before answering this question, you will be given a "computer suggestion", which is the approximate lowest number of pools needed to achieve no worse than a deltaG of -7 (or a score of 7) in each. If you're not sure how many pools, just pick a number and see. You will be allowed to come back to this question later and try a different number if you weren't happy with the result.
Do you want to set a maximum size of each pool? (y/n):
As the program explains, setting a maximum size of each pool can make the pools more even. If you decide to set a maximum, you will be asked to set the maximum number of primer-sets in each pool. Before answering this question you will be given a computer suggestion and a lower limit.

You will not be allowed to set the maximum size of each pool lower than the average size of each pool, since that would make it logically impossible to fit all primer-sets into all pools. It is not advisable to set it just above the average either, since being overly strict about the evenness of the pools could hinder Primer Pooler from finding a solution with lower dimer formation. You might want to experiment with different maxima---you will be able to come back to this question and try again.

Do you want to give me a time limit? (y/n):
If you answer y, you will be asked to set a time limit in minutes. Normally 1 or 2 is enough, although you may wish to let it run a long time to see if it can find better solutions. You don't have to set a time limit: you may manually interrupt the pooling process at any time and have it give the best solution it has found so far, whether a time limit is in place or not. Additionally, Primer Pooler will stop automatically when it detects better solutions are unlikely to be found.
Do you want my "random" choices to be 100% reproducible for demonstrations? (y/n):
If you answer y, Primer Pooler's random choices will be generated in a way that merely look random but are in fact completely reproducible. This is useful for demonstration purposes---you'll know how long it will take to find the solution you want. Otherwise, the random choices will be less predictable, as a different sequence will be chosen depending on the exact time at which the pooling was started.
Pooling display
While pooling is in progress, Primer Pooler will periodically display a brief summary of the best solution found so far, showing the pool sizes, and the counts of interactions (by deltaG range or score) within each pool. As instructed on screen, you may press Ctrl-C (i.e. hold down Ctrl while pressing and releasing C, then release Ctrl) to cancel further exploration and use the best solution found so far.
Do you want to see the statistics of each pool? (y/n):
After the pooling is complete, or after you have interrupted it (by pressing Ctrl-C as instructed on screen), you will be asked if you wish to see the interaction counts of each pool (rather than a simple summary of all pools as appeared during pooling). If you want this, you will also be asked if you wish to save them to a file, and, if so, what file name.
Do you want to see the highest bonds of these pools? (y/n):
If you answer Yes, you will be asked for a deltaG or score threshold, and all interactions worse than that threshold will be displayed on-screen with bonds diagrams such as:
 5'-GGCTGAGCGCTCAGTTT-3'
    xx||||||||||||xx
3'-TTTGACTCGCGAGTCGG-5'
and you will then be asked if you wish to save it to a file, and, if so, what file name. You will then be asked if you would like to try another threshold.
Shall I write each pool to a different result file? (y/n):
If you answer y to this, you will be asked for a prefix, which will be used to name the individual results files. Otherwise, you will be asked if you wish to save all results to a single file. If you decline saving all results to a single file, the results will not be saved at all---this is for when you weren't happy with the solution and want to go back to try a different number of pools or a different maximum pool size.
Do you want to try a different number of pools? (y/n):
This question is self-explanatory. You can go back as many times as you like, trying different numbers of pools. But many researchers have a pretty good idea of how many pools they want to use, or else are happy with the computer's initial suggestion.
Would you like another go? (y/n):
If you answered No to trying a different number of pools, or if you didn't want the program to do pooling at all, then you will be asked if you want to start the program again. Answering No to this question will exit.

Command-line usage

Besides running interactively (see above), it is also possible to run Primer Pooler with command-line arguments. This section assumes familiarity with the concept of running programs from the command line.

The only mandatory argument (if not running interactively) is a filename for the primers file. This should be a text file in multiple-sequence FASTA format, such as:

>toySet1-F
AGCTGCTGCTGCGATCT
>toySet1-R
GGCTGAGCGCTCAGTTT
>toySet2-F
ACGGCTTGACACCGTTCGACTG
>toySet2-R
CAGACGTTCAG
(this example does not represent real primers). Degenerate bases are allowed using the normal letters, and both upper and lower case is allowed. Names of amplicons' primers should end with F or R, and otherwise match. Optionally include tags (tails, barcoding) to apply to all primers: >tagF and >tagR.

Processing options should be placed before this filename. Options are as follows:

--help or /help or /?
Show a brief help message and exit.
--counts
Show score or deltaG-range pair counts for the whole input. deltaG will be used if the --dg option is set (see below). This option produces a fast summary of how many primer pairs (in the entire collection, before pooling) have what range of interaction strengths. This could be used for example to check a pool that you have already chosen manually, or if you want a rough idea of the worst-case scenario that pooling aims to avoid.
--self-omit
Causes the --counts option to avoid counting self-interactions(a primer interacting with itself), and interactions between the pair of primers in any given set.
--print-bonds=THRESHOLD
Similar to --counts, this can be useful for checking a manual selection or for a rough idea. All interactions worse than the given threshold (deltaG if --dg is in use, otherwise score) will be written to standard output, with bonds diagrams.
--dg[=temperature[,mg[,cation[,dNTP]]]]
Set this option to use deltaG instead of score. Optional parameters are the temperature (default is human blood heat), the concentration of magnesium (default 0), the concentration of monovalent cation (e.g. sodium, default 50), and the concentration of deoxynucleotide (dNTP, default 0). Decimal fractions are allowed in all of these. Temperature is specified in kelvin, and all concentrations are specified in nanomoles per cubic metre.
--suggest-pools
Outputs a suggested number of pools. This is the approximate lowest number of pools needed to achieve no worse than a deltaG of -7 (or a score of 7) in each.
--pools[=NUM[,MINS[,PREFIX]]]
Splits the primers into pools. Optional parameters are the number of pools (if omitted or set to ? then the suggested number will be calculated and used), a time limit in minutes, and a prefix for the filenames of each pool (set this to - to write all to standard output).
--max-count=NUM
Set the maximum number of pairs per pool. This is optional but can make the pools more even. A maximum lower than the average is not allowed, and it's usually best to allow a generous margin above the average.
--genome=PATH
Check the amplicons for overlaps in the genome, and avoid these overlaps during pooling. The genome file may be in .2bit format as supplied by UCSC, or in .fa (FASTA) format.
--amp-max=LENGTH
Sets maximum amplicon length for the overlap check. The default is 220.
--multiplx=FILE
Write a MultiPLX input file after the --genome stage, to assist comparisons with MultiPLX's pooling etc.
--seedless
Don't seed the random number generator
--version
Just show the program version number and exit.

Changes

Defects fixed

A defective "Version 1.0" was on this site for only 2 days, but I have no access to the download logs so I have no idea if anybody got it. If you did, I strongly recommend re-downloading the current version and re-running your calculation, because Version 1.0 had important bugs that can affect results:
  1. an error in incremental-update logic sometimes had the effect of generating suboptimal solutions (in particular, pools could be unnecessarily empty, and/or full beyond any limit that was set);
  2. an error in the user-interface loop meant that if you use tags, run interactively, and answer "yes" to the question "Do you want to try a different number of pools", the second run will have been done without the tags, and its results will have been de-tagged twice, removing some bases from the output; moreover, the resulting truncated versions of your primers will have made it into the interaction calculations for any third run.
These bugs have now been fixed. In addition, Versions 1.1 through 1.13 had a bug related to the first fix, which would cause interaction-checking for pooling purposes to be performed without tags when running in interactive mode (command-line mode was not affected). I therefore recommend re-running in the latest version.

Versions prior to 1.17 also had a display bug: the concentrations for the deltaG calculation are in millimoles per litre, not nanomoles as stated on-screen in interactive mode (please ignore the on-screen instruction and enter millimoles, or upgrade to the latest version which fixes that instruction).

Versions prior to 1.34 would round down any decimal fraction you type when in interactive mode (for deltaG temperature, concentration and threshold settings). Internal calculation and command-line use was not affected by this bug.

Versions prior to 1.37 did not ignore whitespace characters after FASTA labels.

Notable additions

Version 1.2 added the MultiPLX output option, and Version 1.33 fixed a bug when MultiPLX output was used with tags and multiple chromosomes. Version 1.3 added genome reading from FASTA (not just 2bit), auto-open browser, and suggest number of pools.

Version 1.36 clarified the use of Taq probes, and allowed these to be in the input file during the overlap check. It's consequently stricter about the requirement that reverse primers must end with R or B: previous versions would accept any letter other than F for these.

Glossary

Base
The nitrogenous base part of a nucleotide in a DNA sequence, represented by A, C, G or T. Informally, "base" can also be used to refer to the entire nucleotide.
Complement
What the base binds with. T binds with A and C binds with G. Complementing a sequence means swapping A for T and C for G throughout.
Degenerate base
A base we're not sure about because of genetic variation in a population. We can use extra letters to specify which bases are allowable.
IUPAC/IUBMB degenerate-base codes
KG or T
YC or T
SC or G
WA or T
RA or G
MA or C
Bany except A
Dany except C
Hany except G
Vany except T
Nany
Primer or Oligo
A short string of bases (actually nucleotides) that's used to start copying from the strand of DNA we're testing. The primer matches up with the start of a section of DNA we want to copy. There are also extra structures at the two ends of the primer that set its direction: these are written as 5' (for the phosphate start) and 3' (for the hydroxyl end). The actual copying occurs from the complementary strand but we can ignore this. Primers are special cases of molecules called oligonucleotides.
Degenerate primer
A primer that has one or more degenerate bases. In practice, this means we manufacture separate primers for each combination of allowable bases and mix them together. So we have to make worst-case assumptions about these when checking for dimers or overlaps.
Amplicon
A section of the DNA we're interested in amplifying (producing copies of). Primers are designed to copy it.
Primer set
Two primers, corresponding to the start and end of an amplicon. They must be kept in the same pool. Sometimes called a "primer pair", but this might be confused with the two participants of a dimer (below) so I think "set" is better. The two primers in a set are called "forward" and "reverse" primers, but the reverse primer is not a backward copy of the forward one---if you're reading my code, you have to be aware of the distinction between backward, which is just a flipped-over copy of any sequence, and reverse, which is the second primer of a set. With assistance from an enzyme called polymerase, the forward primer begins copying from the start of the amplicon, while the reverse primer begins from the end of the amplicon. Although these initial copies continue for an indeterminate number of bases (probably not the whole chromosome, but longer than the region we want), the second cycle will apply the forward primer to the 'end' section of what the reverse primer produced, and conversely the reverse primer to the 'start' section of what the forward primer produced, in both cases resulting in exactly the amplicon we want (which is then reduplicated in subsequent cycles).
Negative strand
The complement of the normal (positive) sequence in the genome. If a primer is designed to match the negative strand then you need to complement it and read it backwards to match the (positive) genome data. In a set, one of the two primers will be a negative-strand primer, but the primer file won't tell us which one (it's not necessarily the "reverse" primer: when a chromosome has a gene on its negative strand, primers are typically labelled in the other direction so we'll see the "reverse" primer on the positive strand followed by the "forward" primer on the negative). You can't put both primers on the same strand because collisions would occur during copying.
Pool or Subpool or Group or Tube or Primer set combination (PSC)
A bunch of primer-sets all drifting around in the same mixture. When that mixture is added to some of our sample of DNA, the amplicons whose primer-sets are in that pool are copied (amplified) so we can measure them. If we can reduce the number of different pools we need, we can finish the testing more quickly and use up less of the sample, but on the other hand we want to avoid combinations that overlap or form dimers.
Overlap
Two primer-sets that access overlapping sections of the genome. If they are placed in the same pool, an unwanted shorter amplicon is produced. Consider the following toy example:
....1..2..3..4....
    A-----B
       C-----D
       C--B
If primers A and B are designed to obtain an amplicon from position 1 to 3, and C and D are designed to obtain an amplicon from 2 to 4, then placing them in the same pool will result in excessive pairings between C and B, producing a short amplicon from 2 to 3 at the expense of the other two. This is very bad news and we have to pick our pools to avoid it.
Dimer
Two primers stuck to each other. This is bad news because, if they're stuck to each other, they're not helping us test the sample. But a dimer is not as bad as an overlap: just because two primers can form a dimer doesn't mean they will, and the experiment might run anyway on the fraction of primers that didn't get stuck. But it's better if each pool can have a combination of primers that tends to produce as few dimers as possible.
Score
A number that gives a rough idea of how likely it is that two primers will make a dimer. It's just the number of bases that bond, minus the number of bases that don't, and ignoring any bases that are left dangling off either end. This is repeated for all positions and the worst case is taken.
Delta G (dG)
The change in Gibbs free energy when two primers make a dimer. The more negative this is, the more likely dimers will form. This thermodynamics calculation gives better results than score, while being only a little slower (unless you have ridiculous numbers of degenerate bases). It does need to know the temperature and amounts of various chemicals, but if you don't know these, the defaults should still be reasonable for comparisons.
Genome
All the DNA in the cell (most species have hundreds of megabytes at the very least). We need data about the whole genome to work out which amplicons will overlap. If some parts are still unknown, we ignore those and hope for the best.
Tag or index sequence or barcode or tail
A constant set of extra bases added to the beginning (5'---actually the end on the complimentary strand) of every forward or reverse primer. This is used for fishing the results out of the pool. If you tell Primer Pooler what tags you are using, it takes them into account when checking for dimers, while ignoring them when checking the genome for amplicon overlaps.
Efficiency
The rate at which amplicons are copied, as a fraction of the ideal rate. Particularly important in quantitative PCR (qPCR) as you need to know the copy rate for the final counts to be meaningful. Efficiency is improved with dimer reduction, but it can also depend on manufacturing quality and equipment quality, so each batch needs to be checked experimentally.
Massive(ly) parallel sequencing or next-generation sequencing or second-generation sequencing or high-throughput sequencing
Base-by-base reading of thousands of short sections of a genome in parallel. Less expensive machines in smaller labs typically need the relevant sections of the genome to be amplified first. If a reference copy of the genome has already been sequenced and we want to re-sequence specific sections to check them for alterations, then we can use multiplex PCR to pull out these sections. This may involve dealing with far more amplicons than is the case with PCR for detecting or counting genes.
AutoDimer
A 2004 program to check a single pool for dimers. AutoDimer was coded in Visual Basic 6 and its dimer search is several thousand times slower than Primer Pooler's; re-pooling must be done manually, as must the handling of degenerate bases.
Thresholding
A simple and fast way of grouping primer sets: "don't add a set to a pool if the interaction badness would exceed some threshold" (usually dG<-7 or overlap). The total number of pools required is discovered by the computer, not chosen by the user. Primer Pooler uses thresholding to suggest a number of pools, but allows the user to override it for minimisation.
Minimisation
Method used by Primer Pooler to group primer sets into a user-specified number of pools, seeking to minimise the interactions within each pool.
MPprimer
A 2009 GPLd Perl+Python program for finding optimal PSCs by thresholding. Slower than our C bit-patterns code and cannot cope with degenerate primers.
MultiPLX
A 2004 C++ program for grouping primer-sets by thresholding. No overlap checking: you are expected to divide the batches yourself and run them separately. MultiPLX can score on differences between melting temperatures, and also on unwanted extra interactions between primer and product-amplicon (which isn't normally a concern when large numbers of primers are involved); its interaction calculations are slower than ours and it makes up for this by giving you the option of not checking for every kind of interaction. Primer Pooler has an option to output your primers and their products (after genome search) in MultiPLX's input format if you wish to compare with MultiPLX's scoring.
Bit patterns
A computer programming technique that involves writing information about different items into different binary digits of the same number, loading that number into the computer's calculation circuitry, and getting it to do something to all its digits in one operation, thus processing many items together. This is even more effective on newer CPUs, because their wider registers can take even more digits at a time. Primer Pooler uses bit-pattern techniques for its bonding calculations.
C compiler
A computer program that takes something written in the C programming language and converts it into machine code that the CPU can run quickly. Modern C compilers can be frighteningly good at this, so a well-written C program can easily outpace what can be done in more "beginner-friendly" languages. This doesn't usually matter if you just want to show things on the screen and wait for input, but you will notice the difference when big calculations are involved.
C++
A computer language inspired by C but with many extra features which, if used well, can make programs easier to manage. In theory, well-written C++ can equal the speed of well-written C. In practice there can be problems with some C++ compilers. Since I was handling register-level bit patterns and builtins for specific CPU opcodes, I decided not to risk it and stick with C even though I could have done it in C++.
Command line
A way of interacting with the computer that involves typing commands on the keyboard and seeing the computer's response written below. It might not look as nice as a modern graphical desktop, but it can be quite efficient when you get used to it; moreover, if you're writing in C then the command line tends to be the easiest interface to write for, freeing up the programmer to concentrate on the calculation part instead of having to spend all their time making it look pretty. Sometimes another programmer who specialises in pretty front-ends will come along later and add one. (I'm more of a "back-end" than a "front-end" programmer.)
CRISPR
Naturally occuring DNA fragments in unicellular immune systems that have been repurposed for genetic engineering. Widely hailed as the "next big thing" after PCR, but doesn't replace PCR: PCR is about reading/testing, while CRISPR is about writing/changing like a Unix sed command---you script the edits but don't see them happen.

Citation

Silas S. Brown, Yun-Wen Chen, Ming Wang, Alexandra Clipson, Eguzkine Ochoa, and Ming-Qing Du (2017). PrimerPooler: automated primer pooling to prepare library for targeted sequencing. Biology Methods and Protocols. Oxford University Press. 2(1). doi:10.1093/biomethods/bpx006 

Thanks

I've lost track of how many giants I've stood on the shoulders of for this, but they include:

License

Primer Pooler is free software licensed under the GNU General Public License, version 3 or later. (Contact me if you need a different license.)
All material © Silas S. Brown unless otherwise stated.