What did the bird say? Part 2 - Taxonomy in a relational database

How to pivot hierarchy data to a flat table

Posted by yara_tchk on July 15, 2017
Article list

What did the bird say? Bird voice recognition. Part 1 - the beginning

What did the bird say? Bird voice recognition. Part 2 - taxonomy

What did the bird say? Bird voice recognition. Part 3 - Listen to the birds

What did the bird say? Bird voice recognition. Part 4 - Dataset choice, data download and pre-processing, visualization and analysis

What did the bird say? Bird voice recognition. Part 5 - Data pre-processing for CNNs

What did the bird say? Bird voice recognition. Part 6 - Neural network MVP

What  did the bird say? Bird voice recognition. Part 7 - full dataset preprocessing (169GB)

What  did the bird say? Bird voice recognition. Part 8 - fast Squeeze-net on 800k images - 500s per epoch (10-20x speed up)

 

Once a friend of mine asked me for a favor. He needed a birds taxonomy database, and he supposed the best solution is web scraping. So I was too lazy to do web scraping and I found a full taxonomy database on ITIS website here. They even have PostgreSQL dump, which is perfect.

If you want to download their database, you should use a tool which can continue download process after interruption. Even if your download process completes successfully, check a hash sum and compare to check sums on the ITIS website. I decided to use wget and after a fourth attempt a database archive was downloaded without any errors.

cd /path/to/downloadswget https://www.itis.gov/downloads/itisPostgreSql.zip
 

To check hash sum you can use md5sum:

md5sum /path/to/downloads/itisPostgreSql.zip

 

Now you can extract files. One of two files is an sql dump, the second one contains instructions. I wouldn't use ITIS instructions, because root user must not have access to database. It is better to move dump file to a directory owned by postgres user and make sure postgres user have necessary permissions on the file. Now simply change your user to postgres, create an ITIS database (it is important) and restore the dump into freshly created database:

psql ITIS < /path/to/postgres/owned/dir/itisPostgreSql063017/ITIS.sql 
 

The database does not contain any relations between entities, but you can find its documentation here with data model description. It is 100% helpful, so check it out before trying to use data and write queries.

 

Taxonomy is hierarchy, and in relational database it is stored perfectly as child and parent relationship, but when you want a full chain for every species, it is quite a challenge to query sql database and get a perfect human readable output.

First, I created a new table where I could insert a recursive query output. 

CREATE TABLE "public"."derived_aves" (
"complete_name" varchar(255),
"parent_tsn" int4 NOT NULL,
"rank_name" varchar(255) NOT NULL,
"tsn" int4 NOT NULL,
"rank_id" int4,
"level" int2,
"species" int4 NOT NULL,
"vernacular_names" json,
CONSTRAINT "derived_aves_pkey" PRIMARY KEY ("parent_tsn", "tsn", "species", "rank_name")
);

 

The recursive query starts with all birds species. This is achieved by rank_id filter and hierarchy chain filter (a chain must contain Aves Class unit). The idea is to have all the higher taxonomy units mapped to every their children unit of species level. So Aves Class will be repeated as many times, as there are species in Aves Class, the same is for Animals Kingdom.

with recursive parents(complete_name, parent_tsn, rank_name, tsn, rank_id, "level", species) as 
(
  SELECT distinct
    tun.complete_name,
    tun.parent_tsn,
    tut.rank_name,
    tun.tsn,
    tut.rank_id,
    h."level",
    tun.tsn as species 
  FROM
    taxonomic_units tun 
    join
      taxon_unit_types tut 
      on tun.rank_id = tut.rank_id 
      and tut.rank_id = 220 
      and tun.kingdom_id = 5 
      and tun.name_usage = 'valid' 
    join
      "hierarchy" h 
      on h.tsn = tun.tsn 
      and h.hierarchy_string like '%174371%' 
    UNION ALL
    SELECT
      tl.complete_name,
      tl.parent_tsn,
      tl.rank_name,
      tl.tsn,
      tl.rank_id,
      tl."level",
      p.species 
    FROM
      parents p,
      (
        SELECT distinct
          tun.complete_name,
          tun.parent_tsn,
          tut.rank_name,
          tun.tsn,
          tut.rank_id,
          h."level",
          tun.tsn as species 
        FROM
          taxonomic_units tun 
          join
            taxon_unit_types tut 
            on tun.rank_id = tut.rank_id 
            and tut.rank_id < 220 
            and tut.rank_id > 0 
            and tun.kingdom_id = 5 
            and tun.name_usage = 'valid' 
          join
            "hierarchy" h 
            on h.tsn = tun.tsn
      )
      tl 
    WHERE
      p.parent_tsn = tl.tsn
)
insert into derived_aves(complete_name, parent_tsn, rank_name, tsn, rank_id, "level", species) 
select distinct p.complete_name, p.parent_tsn, p.rank_name, p.tsn, p.rank_id, p."level", p.species from parents p

 

Now we have the recursive query output in the derived_aves table. I didn't include vernacular or usual common names of taxonomy units in the recursive query, because it would require LEFT JOIN expression, which would lead to out of memory exception due to computational expensiveness and, you know, recursive queries are hungry for memory.

So let's update our table with vernacular names.

update derived_aves set vernacular_names = sub1.vernacular_names from
  (
  select 
    tsn, 
    json_agg(vernacular_name) as vernacular_names 
   from 
    (
    select distinct 
      dav.tsn, 
      v.vernacular_name
    from derived_aves dav
      join vernaculars v on dav.tsn = v.tsn 
      and (v."language" = 'English' or v."language" = 'unspecified')
    ) sub
  group by 1
  ) sub1
where derived_aves.tsn = sub1.tsn

 

Now we create kind of chess board out of derived_aves table with CASE WHEN expression. And then collapse it with GROUP BY species.

SELECT
chess.species_tsn,
string_agg(kingdom, '') as kingdom,
string_agg(vkingdom, '') as vkingdom,
string_agg(subkingdom, '') as subkingdom,
string_agg(vsubkingdom, '') as vsubkingdom,
string_agg(infrakingdom, '') as infrakingdom,
string_agg(vinfrakingdom, '') as vinfrakingdom,
string_agg(division, '') as division,
string_agg(vdivision, '') as vdivision,
string_agg(subdivision, '') as subdivision,
string_agg(vsubdivision, '') as vsubdivision,
string_agg(infradivision, '') as infradivision,
string_agg(vinfradivision, '') as vinfradivision,
string_agg(superclass, '') as superclass,
string_agg(vsuperclass, '') as vsuperclass,
string_agg("class", '') as "class",
string_agg(vclass, '') as vclass,
string_agg("order", '') as "order",
string_agg(vorder, '') as vorder,
string_agg("family", '') as "family",
string_agg(vfamily, '') as vfamily,
string_agg(genus, '') as genus,
string_agg(vgenus, '') as vgenus,
string_agg(species, '') as species,
string_agg(vspecies, '') as vspecies
from
(select distinct
dav.species as species_tsn,
case when dav."level" = 0 then complete_name else '' end as kingdom,
case when dav."level" = 0 then vernacular_names::varchar else '' end as vkingdom,
case when dav."level" = 1 then complete_name else '' end as subkingdom,
case when dav."level" = 1 then vernacular_names::varchar else '' end as vsubkingdom,
case when dav."level" = 2 then complete_name else '' end as infrakingdom,
case when dav."level" = 2 then vernacular_names::varchar else '' end as vinfrakingdom,
case when dav."level" = 3 then complete_name else '' end as division,
case when dav."level" = 3 then vernacular_names::varchar else '' end as vdivision,
case when dav."level" = 4 then complete_name else '' end as subdivision,
case when dav."level" = 4 then vernacular_names::varchar else '' end as vsubdivision,
case when dav."level" = 5 then complete_name else '' end as infradivision,
case when dav."level" = 5 then vernacular_names::varchar else '' end as vinfradivision,
case when dav."level" = 6 then complete_name else '' end as superclass,
case when dav."level" = 6 then vernacular_names::varchar else '' end as vsuperclass,
case when dav."level" = 7 then complete_name else '' end as "class",
case when dav."level" = 7 then vernacular_names::varchar else '' end as vclass,
case when dav."level" = 8 then complete_name else '' end as "order",
case when dav."level" = 8 then vernacular_names::varchar else '' end as vorder,
case when dav."level" = 9 then complete_name else '' end as "family",
case when dav."level" = 9 then vernacular_names::varchar else '' end as vfamily,
case when dav."level" = 10 then complete_name else '' end as genus,
case when dav."level" = 10 then vernacular_names::varchar else '' end as vgenus,
case when dav."level" = 11 then complete_name else '' end as species,
case when dav."level" = 11 then vernacular_names::varchar else '' end as vspecies
from derived_aves dav ) as chess
group by 1

 

And here's how the output sample looks like:

 

Looks like it should, awesome!