Introduction
The RNAblueprint library solves the problem of stochastically sampling RNA/DNA sequences compatible to multiple structural constraints. It only creates sequences that fulfill all base pairs specified in any given input structure. Furthermore, it is possible to specify sequence constraints in IUPAC notation. Solutions are sampled uniformly from the whole solution space, therefore it is guaranteed, that there is no bias towards certain sequences.
The library is written in C++ with SWIG scripting interfaces for Python and Perl. Please cite the software as specified at the bottom of the page!
Dependencies
Required:
- GNU Automake
- Boost Graph Library
- C++ Standard Library
Optional:
- Boost Programm Options (default: on)
- SWIG for interfaces (default: on)
- Python for interface (default: on)
- Perl for interface (default: on)
- ExtUtils::Embed module for perl interface (default: on)
- Doxygen for documentation
- LaTeX for PDF documentation
- libGMP for multiprecision integers
- Boost Unit Test Framework
Installation
Just call these commands:
./autogen.sh
./configure
make
make install
In case of a local installation, please do not forget to adopt your path variables such as PATH
, LD_LIBRARY_PATH
, CPLUS_INCLUDE_PATH
, PYTHONPATH
, PERL5LIB
Most important configure options are:
- --prefix Specify an installation path prefix
- --with-boost Specify the installation directory of the boost library
- --disable-program Disable RNAblueprint program compilation
- --disable-swig Disable all SWIG scripting interfaces
- --enable-libGMP Enable the calculation of big numbers with multiprecision
TIP: You might want call ./configure --help
for all install options!
Interface Examples
Python example
import RNAblueprint as rd
structures = ['(((((....)))))', '(((....)))....']
dg = rd.DependencyGraphMT(structures)
print dg.get_sequence()
for i in range(0, 1000):
dg.sample_clocal()
print dg.get_sequence()
dg.revert_sequence();
print 'Maximal number of solutions: ' + str(dg.number_of_sequences())
print 'Number of Connected Components: ' + str(dg.number_of_connected_components())
dg1 = rd.DependencyGraphMT(dg)
Perl example
#!/usr/bin/perl
# This script is an example implementation on how to use the Perl
# interface. It generates 1000 neighbors of an initially sampled
# random sequence.
use RNAblueprint;
# define structures
@structures = ['(((((....)))))', '(((....)))....'];
# construct dependency graph with these structures
$dg = new RNAblueprint::DependencyGraphMT(@structures);
# print this sequence
print $dg->get_sequence()."\n";
# mutate globally for 1000 times and print
for($i=0; $i<1000; $i++) {
$dg->sample_clocal();
print $dg->get_sequence()."\n";
# revert to the previous sequence
$dg->revert_sequence();
}
# print the amount of solutions
print 'Maximal number of solutions: '.$dg->number_of_sequences()."\n";
# print the amount of connected components
print 'Number of Connected Components: '.$dg->number_of_connected_components()."\n";
# make a deep copy of the dependency graph
$dg1 = new RNAblueprint::DependencyGraphMT($dg)
C++ example
#include <vector>
#include <string>
#include <iostream>
#include <exception>
extern "C" {
#include "ViennaRNA/fold.h"
#include "ViennaRNA/part_func.h"
}
float energy_of_structure(std::string& sequence, std::string& structure) {
float energy = energy_of_structure(sequence.c_str(), structure.c_str(), 0);
return energy;
}
float fold(std::string& sequence, std::string& structure) {
char* structure_cstr = new char[sequence.length()+1];
float energy = fold(sequence.c_str(), structure_cstr);
structure = structure_cstr;
delete structure_cstr;
return energy;
}
float pf_fold(std::string& sequence, std::string& structure) {
char* structure_cstr = new char[sequence.size()+1];
float energy = pf_fold(sequence.c_str(), structure_cstr);
structure = structure_cstr;
delete structure_cstr;
return energy;
}
float objective_function(std::string& sequence, std::vector<std::string>& structures) {
int M = structures.size();
std::vector<float> eos;
for (auto s : structures) {
eos.push_back(energy_of_structure(sequence, s));
}
std::string pf_fold_struct;
float gibbs = pf_fold(sequence, pf_fold_struct);
float objective_difference_part = 0.0;
for (unsigned int i=0; i < eos.size(); i++) {
for (unsigned int j=i+1; j < eos.size(); j++) {
objective_difference_part += abs(eos[i] - eos[j]);
}
}
float eos_sum = 0;
for (int n : eos)
eos_sum += n;
return 1/M * (eos_sum - M * gibbs) + 0.5 * 2/(M * (M-1)) * objective_difference_part;
}
int main () {
std::vector<std::string> structures;
std::cout << "Input structures in dot-bracket (end with empty line): " << std::endl;
while (true) {
std::string structure;
std::getline(std::cin, structure);
if (structure.empty())
break;
else
structures.push_back(structure);
}
try {
} catch (std::exception& e) {
std::cout << "ERROR: " << e.what() << std::endl;
exit (EXIT_FAILURE);
}
for (unsigned int n=0; n<10; n++) {
std::string result_sequence = dependency_graph->
get_sequence();
float score = objective_function(result_sequence, structures);
for (unsigned int i=0; i<10000; i++) {
std::string current_sequence = dependency_graph->
get_sequence();
float this_score = objective_function(current_sequence, structures);
if (this_score < score) {
score = this_score;
result_sequence = current_sequence;
} else {
}
}
std::cout << result_sequence << "\t" << score << std::endl;
}
exit (EXIT_SUCCESS);
}
This file holds the external representation of the DependencyGraph, the main construct for designing ...
Dependency Graph which holds all structural constraints.
std::string get_sequence()
Get the current RNA sequence as a string.
SolutionSizeType sample_clocal(int min_num_pos, int max_num_pos)
Randomly chooses a connected component with the given size and samples a new sequence for the whole c...
SolutionSizeType number_of_sequences()
Returns the amount of solutions given the dependency graph and sequence constraints.
SolutionSizeType sample()
Resets all bases in the whole dependency graph and samples a new sequence randomly.
bool revert_sequence()
Reverts the sequence to the previous one.
Testing
Unit tests are available for many functions of the library. Please call make check
to run these tests!
How to cite
Stefan Hammer, Birgit Tschiatschek, Christoph Flamm, Ivo L. Hofacker, and Sven Findeiß. “RNAblueprint: Flexible Multiple Target Nucleic Acid Sequence Design.” Bioinformatics, 2017. doi:10.1093/bioinformatics/btx263.