May Science Update: Maintaining Quality in “Big Data”
Snapshot Wisconsin relies on different sources to help classify our growing dataset of more than 27 million photos, including our trail camera hosts, Zooniverse volunteers and experts at Wisconsin DNR. With all these different sources, we need ways to assess the quality and accuracy of the data before it’s put into the hands of decision makers.
A recent publication in Ecological Applications by Clare et. al (2019) looked at the issue of maintaining quality in “big data” by examining Snapshot Wisconsin images. The information from the study was used to develop a model that will help us predict which photos are most likely to contain classification errors. Because Snapshot-specific data were used in this study, we can now use these findings to decide which data to accept as final and which images would be best to go through expert review.
Perhaps most importantly, this framework allows us to be transparent with data users by providing specific metrics on the accuracy of our dataset. These confidence measures can be considered when using the data as input for models, when choosing research questions, and when interpreting the data for use in management decision making.
The study examined nearly 20,000 images classified on the crowdsourcing platform, Zooniverse. Classifications for each specie were analyzed to identify the false-negative error probability (the likelihood that a species is indicated as not present when it is) and the false-positive error probability (the likelihood that a species is indicated as present when it is not).
The authors found that classifications were 93% correct overall, but the rate of accuracy varied widely by species. This has major implications for wildlife management, where data are analyzed and decisions are made on a species-by-species basis. The graphs below show how variable the false-positive and false-negative probabilities were for each species, with the whiskers representing 95% confidence intervals.
Errors by species
We can conclude from these graphs that each species has a different set of considerations regarding these two errors. For example, deer and turkeys both have low false-negative and false-positive error rates, meaning that classifiers are good at correctly identifying these species and few are missed. Elk photos do not exhibit the same trends.
When a classifier identifies an elk in a photo, it is almost always an elk, but there are a fair number of photos of elk that are classified as some other species. For blank photos, the errors go in the opposite direction: if a photo is classified as blank, there is a ~25% probability that there is an animal in the photo, but there are very few blank photos that are incorrectly classified as having an animal in them.
Assessing species classifications with these two types of errors in mind helps us understand what we need to consider when determining final classifications of the data and its use for wildlife decision support.
When tested, the model was successful in identifying 97% of misclassified images. Factors considered in the development of the model included: differences in camera placement between sites; the way in which Zooniverse users interacted with the images; and more.
In general, the higher the proportion of users that agreed on the identity of the animal in the image, the greater the likelihood it was correct. Even seasonality was useful in evaluating accuracy for some species – snowshoe hares were found to be easily confused with cottontail rabbits in the summertime, when they both sport brown pelage.
Not only does the information derived from this study have major implications for Snapshot Wisconsin, the framework for determining and remediating data quality presented in this article can benefit a broad range of big-data projects.