What Makes Data “Good”?
The following piece was written by Snapshot Wisconsin’s Data Scientist, Ryan Bemowski.
Have you ever heard the term “Data doesn’t lie”? It’s often used when suggesting a conclusion based on the way scientific data tells a story. The statement is true, raw data is incapable of lying. However, data collection, data processing, data presentation and even the interpretation can be skewed or biased. Data is made “good” by understanding its collection, processing, and presentation methods while accounting for their pitfalls. Some might be surprised to learn it is also the responsibility of the consumer or observer of the data to be vigilant while making conclusions based on what they are seeing.
Thanks to the data collection efforts of more than 3,000 camera host volunteers over 5 years, Snapshot Wisconsin has amassed over 54,000,000 photos. Is all this data used for analysis and presentations? The short answer is, not quite. Snapshot Wisconsin uses a scientific approach and therefore any photos which do not follow the collection specifications are unusable for analysis or presentation. Under these circumstances, a certain amount of data loss is expected during the collection process. Let’s dive more into why some photos are not usable in our data analysis and presentations.
When data is considered unusable for analysis and presentation, corrections are made during the data processing phase. There are numerous steps in processing Snapshot Wisconsin data, and each step may temporarily or permanently mark data as unusable for presentation. For example, a camera which is baited with food, checked too frequently (such as on a weekly basis), checked too infrequently (such as once a year), or in an improper orientation may lead to permanently unusable photos. This is why it is very important that camera hosts follow the setup instructions when deploying a camera. The two photo series below show a proper camera orientation (top) and an improper camera orientation (bottom). The properly oriented camera is pointed along a flooded trail while the improperly oriented camera is pointed at the ground. This usually happens at no fault of the camera host due to weather or animal interaction but must be corrected for the photos to be usable for analysis and presentation.
In another case, a group of hard to identify photos may be temporarily marked as unusable. Once the identity of the species in the photo is expertly verified by DNR staff, they are used for analysis and presentation.
Usable data from the data processing phase can be analyzed and presented. The presentation phase often filters down the data to a specific species, timeframe, and region. With every new filter, the data gets smaller. At a certain point the size of the data can become too small and introduces an unacceptably high potential of being misleading or misinterpreted. In the Snapshot Wisconsin Data Dashboard, once the size of the data becomes too small to visualize effectively it is marked as “Insufficient Data.” Instead, this data is being used for other calculations where enough data is present but cannot reliably be presented on its own.
Let’s use the Data Dashboard presence map with deer selected as an example. The photo on the left contains 5,800,000 detections. A detection is a photo event taken when an animal walks in front of a trail camera. What if we were to narrow down the size of the data that we are looking at by randomly selecting only 72 detections, one per county? After taking that sample of one detection per county, only 12 of the detections had deer in them, as shown by the photo on the right. The second plot is quite misleading since it appears that only 12 counties have detected a deer. When data samples are too small, the data can easily be misinterpreted. This is precisely why data samples that are very small are omitted from data presentations.
There are a lot of choices to make as presentations of data are being made. We make it a priority to display as much information and with as much detail as possible while still creating reliable and easily interperatable visualizations.
In the end, interpretation is everything. It is the responsibility of the observer of the data presentation to be open and willing to accept the data as truth, yet cautious of various bias and potential misinterpretations. It is important to refrain from making too many assumptions as a consumer of the presentation. For example, in the Snapshot Wisconsin Data Dashboard detection rates plot (shown below), cottontails have only a fraction of the detections that deer have across the state. It is quite easy to think “The deer population in Wisconsin is much larger than the cottontail population,” but that would be a misinterpretation regardless of how true or false the statement may be.
Remember, the Snapshot Wisconsin Data Dashboard presents data about detections from our trail cameras, not overall population. There is no data in the Snapshot Wisconsin Data Dashboard which implies that one species is more populous than any other. Detectability, or how likely an animal is to be detected by a camera, plays a major role in the data used on the Snapshot Wisconsin Data Dashboard. Deer are one of the largest, most detectable species while the smaller, brush dwelling cottontail is one of the more difficult to detect.
So, is the data “good”?
Yes, Snapshot Wisconsin is full of good data. If we continue to practice proper data collection, rigorous data processing, and mindful data presentations Snapshot Wisconsin data will continue to get even better. Interpretation is also a skill which needs practice. While viewing any data presentation, be willing to accept presented data as truth but also be vigilant in your interpretation so you are not misled or misinterpret the data presentations.
May Science Update: Maintaining Quality in “Big Data”
Snapshot Wisconsin relies on different sources to help classify our growing dataset of more than 27 million photos, including our trail camera hosts, Zooniverse volunteers and experts at Wisconsin DNR. With all these different sources, we need ways to assess the quality and accuracy of the data before it’s put into the hands of decision makers.
A recent publication in Ecological Applications by Clare et. al (2019) looked at the issue of maintaining quality in “big data” by examining Snapshot Wisconsin images. The information from the study was used to develop a model that will help us predict which photos are most likely to contain classification errors. Because Snapshot-specific data were used in this study, we can now use these findings to decide which data to accept as final and which images would be best to go through expert review.
Perhaps most importantly, this framework allows us to be transparent with data users by providing specific metrics on the accuracy of our dataset. These confidence measures can be considered when using the data as input for models, when choosing research questions, and when interpreting the data for use in management decision making.
The study examined nearly 20,000 images classified on the crowdsourcing platform, Zooniverse. Classifications for each specie were analyzed to identify the false-negative error probability (the likelihood that a species is indicated as not present when it is) and the false-positive error probability (the likelihood that a species is indicated as present when it is not).
The authors found that classifications were 93% correct overall, but the rate of accuracy varied widely by species. This has major implications for wildlife management, where data are analyzed and decisions are made on a species-by-species basis. The graphs below show how variable the false-positive and false-negative probabilities were for each species, with the whiskers representing 95% confidence intervals.
Errors by species
We can conclude from these graphs that each species has a different set of considerations regarding these two errors. For example, deer and turkeys both have low false-negative and false-positive error rates, meaning that classifiers are good at correctly identifying these species and few are missed. Elk photos do not exhibit the same trends.
When a classifier identifies an elk in a photo, it is almost always an elk, but there are a fair number of photos of elk that are classified as some other species. For blank photos, the errors go in the opposite direction: if a photo is classified as blank, there is a ~25% probability that there is an animal in the photo, but there are very few blank photos that are incorrectly classified as having an animal in them.
Assessing species classifications with these two types of errors in mind helps us understand what we need to consider when determining final classifications of the data and its use for wildlife decision support.
When tested, the model was successful in identifying 97% of misclassified images. Factors considered in the development of the model included: differences in camera placement between sites; the way in which Zooniverse users interacted with the images; and more.
In general, the higher the proportion of users that agreed on the identity of the animal in the image, the greater the likelihood it was correct. Even seasonality was useful in evaluating accuracy for some species – snowshoe hares were found to be easily confused with cottontail rabbits in the summertime, when they both sport brown pelage.
Not only does the information derived from this study have major implications for Snapshot Wisconsin, the framework for determining and remediating data quality presented in this article can benefit a broad range of big-data projects.