Relational Database Design and Implementation for Biodiversity Informatics Paul J. Morris
The Academy of Natural Sciences 1900 Ben Franklin Parkway, Philadelphia, PA 19103 USA Received: 28 October 2004 - Accepted: 19 January 2005
The complexity of natural history collection information and similar information within the scope of biodiversity informatics poses significant challenges for effective long term stewardship of that information in electronic form. This paper discusses the principles of good relational database design, how to apply those principles in the practical implementation of databases, and examines how good database design is essential for long term stewardship of biodiversity information. Good design and implementation principles are illustrated with examples from the realm of biodiversity information, including an examination of the costs and benefits of different ways of storing hierarchical information in relational databases. This paper also discusses typical problems present in legacy data, how they are characteristic of efforts to handle complex information in simple databases, and methods for handling those data during data migration.
The data associated with natural history collection materials are inherently complex. Management of these data in paper form has produced a variety of documents such as catalogs, specimen labels, accession books, stations books, map files, field note files, and card indices. The simple appearance of the data found in any one of these documents (such as the columns for identification, collection locality, date collected, and donor in a handwritten catalog ledger book) mask the inherent complexity of the information. The appearance of simplicity overlying highly complex information provides significant challenges for the management of natural history collection information (and other systematic and biodiversity information) in electronic form. These challenges include management of legacy data produced during the history of capture of natural
history collection information into database management systems of increasing sophistication and complexity. In this document, I discuss some of the issues involved in handling complex biodiversity information, approaches to the stewardship of such information in electronic form, and some of the tradeoffs between different approaches. I focus on the very well understood concepts of relational database design and implementation. Relational1 databases have a strong (mathematical) theoretical foundation 1
Object theory offers the possibility of handling much of the complexity of biodiversity information in object oriented databases in a much more effective manner than in relational databases, but object oriented and object-relational database software is much less mature and much less standard than relational database software. Data stored in a relational DBMS are currently much less likely to become trapped in a dead end with no possibility of support than data in an object oriented DBMS.
PhyloInformatics 7: 2-66 - 2005
(Codd, 1970; Chen, 1976), and a wide range of database software products available for implementing relational databases.
Figure 1. Typical paths followed by biodiversity information. The cylinder represents storage of information in electronic form in a database.
The effective management of biodiversity information involves many competing priorities (Figure 1). The most important priorities include long term data stewardship, efficient data capture (e.g. Beccaloni et al., 2003), creating high quality information, and effective use of limited resources. Biodiversity information storage systems are usually created and maintained in a setting of limited resources. The most appropriate design for a database to support long term stewardship of biodiversity information may not be a complex highly normalized database well fitted to the...