The Open Tree of Life is huge, with 2543160 named nodes. There are good reasons why you might want to slim it down by pruning or simplifying branches. For example, I find it useful to have a tree of everything resolved to the species level: i.e. without subspecies, varieties, etc. The perl code below creates such a tree. It is also easily modifiable to produce a genus or family-level tree. It takes about 25 seconds on my (low powered) MacBook Air, most of which is down to reading the 317 Mb taxonomy.tsv file. You can call it via
./subspecies_delete.pl OT_draftversion2.tre taxonomy.tsv > nosubsp.tre
For the current draft tree (version 2), it identifies the following problematic taxa:
Species Ascochyta_fabae_ott1084624 is nested within another species Species Fusarium_oxysporum_f_sp_lycopersici_ott810228 is nested within another species Species Hieracium_linahamariense_ott3897051 is nested within another species Species Centaurea_subjacea_ott3894019 is nested within another species Species Aegilops_triuncialis_ott608778 is nested within another species Species Aegilops_crassa_ott267029 is nested within another species Species Aegilops_tauschii_ott881533 is nested within another species Species Aegilops_longissima_ott267020 is nested within another species Species Aegilops_peregrina_ott34479 is nested within another species Species Aegilops_comosa_ott267017 is nested within another species
Removing subspecies etc. doesn’t slim the tree much though, the trimmed version still has 2498945 nodes, which is only 2% smaller.
subspecies_delete.pl (code in the public domain)
#!/usr/bin/perl -sw use strict; use File::ReadBackwards; my $tree = shift @ARGV; # first arg is location of tree my $OpenToLTaxonomy = shift @ARGV; # second arg is the corresponding taxonomy.tsv file open(TAXONOMY, "<", $OpenToLTaxonomy) or die "Cannot open taxonomy file $OpenToLTaxonomy: $!"; my @header = split("\t", <TAXONOMY>); my( $rank_index )= grep { $header[$_] eq "rank" } 0..$#header; my( $OTTid_index )= grep { $header[$_] eq "uid" } 0..$#header; my %species; while (<TAXONOMY>) { if ((split("\t"))[$rank_index] eq "species") { $species{(split("\t"))[$OTTid_index]} = 1; } } close(TAXONOMY); tie *BACKTREE, "File::ReadBackwards", $tree, ")" or die "can't read newick file '$tree' $!" ; my $del_depth=0; my $line = 0; my @omit; while (<BACKTREE>) { #read in reverse order, separated by close brackets my ($uid) = (/^[^,].*?_ott(\d+)[',;\)]/); #name for the prev brace (comma = no name) if ($del_depth) { #if we have started deleting, we keep track of which are deleted my $pos = length; do { $pos = rindex($_,'(',$pos-1); } while ($pos != -1 && --$del_depth); unshift @omit, [$line, $pos]; #$pos == -1 means omit all this line $del_depth++ if ($del_depth); #if still in brace nest, next loop increases depth }; if (defined $uid && exists $species{$uid}) { if ($del_depth) { my ($name) = (/^(.+?_ott\d+)[',;\)]/); warn("Species $name is nested within another species: ignoring this species"); } else { $del_depth = 1; } } $line++; } close(BACKTREE); #recalculate to count close braces from start of file foreach (@omit) { $_->[0] = $line - $_->[0]; } #now go forwards through the file, printing when @omit allows $/ = ")"; open(FORETREE, "<", $tree) or die "cannot open $tree: $!"; while(<FORETREE>) { if (@omit && ($. == $omit[0][0])) { print substr($_, 0, $omit[0][1]) if ($omit[0][1] != -1); shift @omit; } else { print; } }
I am trying to prune the tree for certain plant and animal families. I am far from an expert in Perl and I am not entirely sure where in the code can I prune for those families I need. Any help would be welcome
It sounds like you are trying to extract subtrees from the open tree, rather than prune tips. So you probably want my script at http://yanwong.me/?page_id=1090. You first need to find the ‘Open Tree Taxonomy ID’ for those families, which is a number, and simply pass that to the script. You can find this number by searching on the OpenTree website: e.g. for Brassicaceae you should be directed to https://tree.opentreeoflife.org/opentree/argus/ottol@309271/Brassicaceae from where you can find the ott id 309271. You can then call my subtree extraction script as
./subtree_extract.pl draftversion4.tre 309271
Note that for version 5 of the open tree, you’ll need to use the file
labelled_supertree_simplified_ottnames_with_monotypic.tre
or modify my script somehow.