./src/barnacle.pl -lib SIM06 -lib_dir /media/ben/Drive2/Bioinformatics/Capstone/barnacle-1.0.0/sample_data/SIM06 -config /media/ben/Drive2/Bioinformatics/Capstone/barnacle-1.0.0/sample_data/SIM06/project.cfg -identify_candidates
./src/barnacle.pl -lib SIM06 -lib_dir /media/ben/Drive2/Bioinformatics/Capstone/barnacle-1.0.0/sample_data/pre_assembled/SIM06/ -config /media/ben/Drive2/Bioinformatics/Capstone/barnacle-1.0.0/sample_data/SIM06/project.cfg -identify_candidates
ERROR: No BLAT command path found in config file
Added:
blat = /usr/bin/blat
to config file
ERROR: Cannot find 2bit genome file: /home/ben/barnacle-1.0.0/annotations/hg19.2bit
Downloaded that file from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
ERROR: No SAMtools command path found in config file: /media/ben/Drive2/Bioinformatics/Capstone/barnacle-1.0.0/sample_data/SIM06/project.cfg
Added to config
ERROR: Cannot find gene annotations file: /home/ben/barnacle-1.0.0/annotations/UCSC_genes_ref.txt
Found a file called setup_annotations.sh which seems to download required references
Running this identified did download several files.
ERROR: Cannot find gene feature coordinates file: /home/ben/barnacle-1.0.0/annotations/UCSC_genes_ref.exons.introns.std_chr.bed
Googling for this file leads me to a version of Barnacle available on GitHub. Haven't found the exact reference yet, but it does mention running setup.py so trying that. After some trial-and-error, determined that it wants a login to a cluster as an argument:
python setup.py befulton@mason.indiana.edu
Fails on ubuntu due to warnings being treated as errors. Asking a question on biostars led me to this line:
export CFLAGS="-O2 -U_FORTIFY_SOURCE"
Decided to start over by forking the Github project, which has a more detailed README.
Commited a few modifications to Git to allow compilation in Ubuntu. Now, following the README, run:
./src/barnacle.pl -lib SIM06 -lib_dir sample_data/SIM06 -config sample_data/sample.cfg -identify_candidates
Now it appears that input information comes from the Assembly dir inside the lib_dir option. That directory can contain a subdirectory called "current" which will be analyzed automatically, or a specific version can be analyzed with the assembly_ver option. Created a directory called new_stuff, then copied in the directory sample_data/pre_assembled/abyss-1.3.2/merge. Then:
./src/barnacle.pl -lib SIM06 -lib_dir sample_data/SIM06 -config sample_data/sample.cfg -identify_candidates -debug -assembly_ver new_stuff
Seems to work.
Now copy a Trinity output file, Trinity.fasta, to the sample_data folder in a Trinity folder, and run
./src/barnacle.pl -lib SIM06 -lib_dir sample_data/SIM06 -config sample_data/sample.cfg -identify_candidates -debug -assembly_ver Trinity
Result: ERROR: Cannot find contig-to-genome alignments directory: /home/ben/gsc/barnacle-1.0.0/sample_data/SIM06/Assembly/Trinity/merge/cluster/SIM06-contigs/output
Created that directory and reran.
Result: ERROR: Cannot find contig sequences: /home/ben/gsc/barnacle-1.0.0/sample_data/SIM06/Assembly/Trinity/merge/SIM06-contigs.fa
Renamed Trinity.fasta to merge/SIM06-contigs.fa and reran.
Result: ERROR: Could not find any contig-to-genome alignment files: /home/ben/gsc/barnacle-1.0.0/sample_data/SIM06/Assembly/Trinity/merge/cluster/SIM06-contigs/output/seq.*.psl
blat annotations/hg19.2bit sample_data/SIM06/Assembly/Trinity/merge/SIM06-contigs.fa output.psl
mv output.psl sample_data/SIM06/Assembly/Trinity/merge/cluster/SIM06-contigs/output/seq.1.psl
./src/barnacle.pl -lib SIM06 -lib_dir sample_data/SIM06 -config sample_data/sample.cfg -identify_candidates -debug -assembly_ver Trinity
Result: ERROR: Cannot find contig sequence file: /home/ben/gsc/barnacle-1.0.0/sample_data/SIM06/Assembly/Trinity/merge/cluster/SIM06-contigs/input/seq.1.fa
copied merge/SIM06-contigs.fa to sample_data/SIM06/Assembly/Trinity/merge/cluster/SIM06-contigs/input/seq.1.fa and reran.
Result: WARNING: contig comp7_c0_seq1 not present in alignment file
Error while searching for candidate contigs: Error reading contig sequence file: CGAACTCCGGGAGCCAGGAAGTACACTGCTTGCAAGACGCCTTTGCAGCCTGCTCCCTCC
CANDIDATE IDENTIFICATION FAIL
ERROR: while running /usr/bin/python /home/ben/gsc/barnacle-1.0.0/src/alignment_processing/identify_candidate_contigs.py /home/ben/gsc/barnacle-1.0.0/sample_data/SIM06/Assembly/Trinity/merge/cluster/SIM06-contigs/output/seq.1.psl /home/ben/gsc/barnacle-1.0.0/sample_data/SIM06/Assembly/Trinity/barnacle/ver_1.0.0.0/local_cid/job_1/ split-candidates num-aligns 500 min-identity 40 genes gene-coords /home/ben/gsc/barnacle-1.0.0/annotations/ensembl65_ref.exons.introns.std_chr.bed single-align 0.999 merge-overlap 0.8 smart-chooser maintain-pared-groups ctg-rep 0.85 no-mito log-file /home/ben/gsc/barnacle-1.0.0/sample_data/SIM06/Assembly/Trinity/barnacle/ver_1.0.0.0/SIM06.barnacle.log gap-candidates ctg-file /home/ben/gsc/barnacle-1.0.0/sample_data/SIM06/Assembly/Trinity/merge/cluster/SIM06-contigs/input/seq.1.fa gap-realigner /home/ben/gsc/barnacle-1.0.0/src/alignment_processing/gap_realigner gap-min-size 4 gap-min-identity 0.95 gap-min-fraction 0.3 gap-max-len 50000 gap-config /home/ben/gsc/barnacle-1.0.0/sample_data/SIM06/Assembly/Trinity/barnacle/ver_1.0.0.0/SIM06.barnacle.gap.cfg: 1
The problem appears to be that seq1.fa holds its sequences on multiple lines. The following script puts the sequences on single lines:
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < seq1.fa
then copied the resulting output back to seq1.fa