What Makes Data “Good”?
The following piece was written by Snapshot Wisconsin’s Data Scientist, Ryan Bemowski.
Have you ever heard the term “Data doesn’t lie”? It’s often used when suggesting a conclusion based on the way scientific data tells a story. The statement is true, raw data is incapable of lying. However, data collection, data processing, data presentation and even the interpretation can be skewed or biased. Data is made “good” by understanding its collection, processing, and presentation methods while accounting for their pitfalls. Some might be surprised to learn it is also the responsibility of the consumer or observer of the data to be vigilant while making conclusions based on what they are seeing.
Thanks to the data collection efforts of more than 3,000 camera host volunteers over 5 years, Snapshot Wisconsin has amassed over 54,000,000 photos. Is all this data used for analysis and presentations? The short answer is, not quite. Snapshot Wisconsin uses a scientific approach and therefore any photos which do not follow the collection specifications are unusable for analysis or presentation. Under these circumstances, a certain amount of data loss is expected during the collection process. Let’s dive more into why some photos are not usable in our data analysis and presentations.
When data is considered unusable for analysis and presentation, corrections are made during the data processing phase. There are numerous steps in processing Snapshot Wisconsin data, and each step may temporarily or permanently mark data as unusable for presentation. For example, a camera which is baited with food, checked too frequently (such as on a weekly basis), checked too infrequently (such as once a year), or in an improper orientation may lead to permanently unusable photos. This is why it is very important that camera hosts follow the setup instructions when deploying a camera. The two photo series below show a proper camera orientation (top) and an improper camera orientation (bottom). The properly oriented camera is pointed along a flooded trail while the improperly oriented camera is pointed at the ground. This usually happens at no fault of the camera host due to weather or animal interaction but must be corrected for the photos to be usable for analysis and presentation.
In another case, a group of hard to identify photos may be temporarily marked as unusable. Once the identity of the species in the photo is expertly verified by DNR staff, they are used for analysis and presentation.
Usable data from the data processing phase can be analyzed and presented. The presentation phase often filters down the data to a specific species, timeframe, and region. With every new filter, the data gets smaller. At a certain point the size of the data can become too small and introduces an unacceptably high potential of being misleading or misinterpreted. In the Snapshot Wisconsin Data Dashboard, once the size of the data becomes too small to visualize effectively it is marked as “Insufficient Data.” Instead, this data is being used for other calculations where enough data is present but cannot reliably be presented on its own.
Let’s use the Data Dashboard presence map with deer selected as an example. The photo on the left contains 5,800,000 detections. A detection is a photo event taken when an animal walks in front of a trail camera. What if we were to narrow down the size of the data that we are looking at by randomly selecting only 72 detections, one per county? After taking that sample of one detection per county, only 12 of the detections had deer in them, as shown by the photo on the right. The second plot is quite misleading since it appears that only 12 counties have detected a deer. When data samples are too small, the data can easily be misinterpreted. This is precisely why data samples that are very small are omitted from data presentations.
There are a lot of choices to make as presentations of data are being made. We make it a priority to display as much information and with as much detail as possible while still creating reliable and easily interperatable visualizations.
In the end, interpretation is everything. It is the responsibility of the observer of the data presentation to be open and willing to accept the data as truth, yet cautious of various bias and potential misinterpretations. It is important to refrain from making too many assumptions as a consumer of the presentation. For example, in the Snapshot Wisconsin Data Dashboard detection rates plot (shown below), cottontails have only a fraction of the detections that deer have across the state. It is quite easy to think “The deer population in Wisconsin is much larger than the cottontail population,” but that would be a misinterpretation regardless of how true or false the statement may be.
Remember, the Snapshot Wisconsin Data Dashboard presents data about detections from our trail cameras, not overall population. There is no data in the Snapshot Wisconsin Data Dashboard which implies that one species is more populous than any other. Detectability, or how likely an animal is to be detected by a camera, plays a major role in the data used on the Snapshot Wisconsin Data Dashboard. Deer are one of the largest, most detectable species while the smaller, brush dwelling cottontail is one of the more difficult to detect.
So, is the data “good”?
Yes, Snapshot Wisconsin is full of good data. If we continue to practice proper data collection, rigorous data processing, and mindful data presentations Snapshot Wisconsin data will continue to get even better. Interpretation is also a skill which needs practice. While viewing any data presentation, be willing to accept presented data as truth but also be vigilant in your interpretation so you are not misled or misinterpret the data presentations.