Skip to content Skip to navigation

Manoharan A, Stamberger J, Yu Y, Paepcke A. 2008. Optimizations for the EcoPod field identification tool. BMC Bioinformatics 9.

Year Published: 2008
Abstract: 

Background: We point to the difficulty of gathering data sets that are large enough to support biodiversity studies. The recruitment of well informed amateurs offers one potential way around this problem. Reliability of the data, however, poses a threat to this solution. We briefly sketch our species identification tool, which runs on a palm sized computer and is designed to help knowledgeable observers participate in census activities. This tool is driven by an algorithm that turns an identification matrix into a series of questions that guide the operator towards species identification. Historic observation data from the geographic area of the census improves this algorithm, helping it to ask as few questions as possible. The body of the presented work explores how much historic data is required to noticeably boost algorithm performance, and whether the use of history negatively impacts the successful identification of rare species. We also explore how some aspects of the identification key matrix interact with the algorithm. Finally, we investigate how best to predict the probability of observing a previously unseen species in the future for this particular application. Results: Point counts of birds taken at Stanford University's Jasper Ridge Biological Preserve between 2000 to 2005 were used to examine the algorithm. During every experimental run each of the 104 observed bird species was identified by a computer that correctly answered all the algorithm's questions as a human operator ideally would. We repeated these runs, in turn making different sized subsets of the bird count data available to the algorithm. Each time we observed the number of questions required to identify each bird. We then added runs to repeatedly identify 50 birds that had not been observed in Jasper Ridge before. In addition to history use we explored how the character density of the key matrix and the theoretical minimum number of questions for each bird in the matrix influenced the algorithm. Our investigation of probability smoothing in this context focused on whether Laplace smoothing of observation probabilities was sufficient, or whether a version of the more complex Good-Turing technique is required for the identification tool to work well. Conclusions: We found that the use of historic data indeed improved identification speed, but that this effect only impacted the top 25% most frequently observed birds. For rare birds the history based algorithms did not impose a noticeable penalty in the number of questions required for identification. We found that for our dataset neither the age of the historic data, nor the number of aggregated years of observation impacted the algorithm. Any one year's worth of data pushed the algorithm to its highest performance. The density of characters for different taxa in the identification matrix did not impact the algorithms. Intrinsic differences in identifying different birds did affect the algorithm, but the differences affected the baseline method of not using historic data to exactly the same degree. Finally, we found that Laplace smoothing performed better for rare species than Simple Good-Turing, and that, contrary to expectation, the technique did not then adversely affect identification performance for frequently observed birds.

Article Title: 
Optimizations for the EcoPod field identification tool
Article ID: 
1157