Tag Archive | Data

What Makes Data “Good”?

The following piece was written by Snapshot Wisconsin’s Data Scientist, Ryan Bemowski. 

Have you ever heard the term “Data doesn’t lie”? It’s often used when suggesting a conclusion based on the way scientific data tells a story. The statement is true, raw data is incapable of lying. However, data collection, data processing, data presentation and even the interpretation can be skewed or biased. Data is made “good” by understanding its collection, processing, and presentation methods while accounting for their pitfalls. Some might be surprised to learn it is also the responsibility of the consumer or observer of the data to be vigilant while making conclusions based on what they are seeing.

A graphic showing how data moves from collection to processing and presentation.

Data Collection

Thanks to the data collection efforts of more than 3,000 camera host volunteers over 5 years, Snapshot Wisconsin has amassed over 54,000,000 photos. Is all this data used for analysis and presentations? The short answer is, not quite. Snapshot Wisconsin uses a scientific approach and therefore any photos which do not follow the collection specifications are unusable for analysis or presentation. Under these circumstances, a certain amount of data loss is expected during the collection process. Let’s dive more into why some photos are not usable in our data analysis and presentations.

Data Processing

When data is considered unusable for analysis and presentation, corrections are made during the data processing phase. There are numerous steps in processing Snapshot Wisconsin data, and each step may temporarily or permanently mark data as unusable for presentation. For example, a camera which is baited with food, checked too frequently (such as on a weekly basis), checked too infrequently (such as once a year), or in an improper orientation may lead to permanently unusable photos. This is why it is very important that camera hosts follow the setup instructions when deploying a camera. The two photo series below show a proper camera orientation (top) and an improper camera orientation (bottom). The properly oriented camera is pointed along a flooded trail while the improperly oriented camera is pointed at the ground. This usually happens at no fault of the camera host due to weather or animal interaction but must be corrected for the photos to be usable for analysis and presentation.

Good Data Graphic2

A properly oriented camera (top) compared to an improperly oriented camera (bottom).

In another case, a group of hard to identify photos may be temporarily marked as unusable. Once the identity of the species in the photo is expertly verified by DNR staff, they are used for analysis and presentation.

Data Presentation

Usable data from the data processing phase can be analyzed and presented. The presentation phase often filters down the data to a specific species, timeframe, and region. With every new filter, the data gets smaller. At a certain point the size of the data can become too small and introduces an unacceptably high potential of being misleading or misinterpreted. In the Snapshot Wisconsin Data Dashboard, once the size of the data becomes too small to visualize effectively it is marked as “Insufficient Data.” Instead, this data is being used for other calculations where enough data is present but cannot reliably be presented on its own.

Good Data Graphic 3

Snapshot Wisconsin Data Dashboard presence plot with over 5,800,000 detections (left) and a similar plot with only 72 detections sampled (right).

Let’s use the Data Dashboard presence map with deer selected as an example. The photo on the left contains 5,800,000 detections. A detection is a photo event taken when an animal walks in front of a trail camera. What if we were to narrow down the size of the data that we are looking at by randomly selecting only 72 detections, one per county? After taking that sample of one detection per county, only 12 of the detections had deer in them, as shown by the photo on the right. The second plot is quite misleading since it appears that only 12 counties have detected a deer. When data samples are too small, the data can easily be misinterpreted. This is precisely why data samples that are very small are omitted from data presentations.

There are a lot of choices to make as presentations of data are being made. We make it a priority to display as much information and with as much detail as possible while still creating reliable and easily interperatable visualizations.

Interpretation

In the end, interpretation is everything. It is the responsibility of the observer of the data presentation to be open and willing to accept the data as truth, yet cautious of various bias and potential misinterpretations. It is important to refrain from making too many assumptions as a consumer of the presentation. For example, in the Snapshot Wisconsin Data Dashboard detection rates plot (shown below), cottontails have only a fraction of the detections that deer have across the state. It is quite easy to think “The deer population in Wisconsin is much larger than the cottontail population,” but that would be a misinterpretation regardless of how true or false the statement may be.

A bar graph showing detections per year of the five most common species.

Remember, the Snapshot Wisconsin Data Dashboard presents data about detections from our trail cameras, not overall population. There is no data in the Snapshot Wisconsin Data Dashboard which implies that one species is more populous than any other. Detectability, or how likely an animal is to be detected by a camera, plays a major role in the data used on the Snapshot Wisconsin Data Dashboard. Deer are one of the largest, most detectable species while the smaller, brush dwelling cottontail is one of the more difficult to detect.

So, is the data “good”?

Yes, Snapshot Wisconsin is full of good data. If we continue to practice proper data collection, rigorous data processing, and mindful data presentations Snapshot Wisconsin data will continue to get even better. Interpretation is also a skill which needs practice. While viewing any data presentation, be willing to accept presented data as truth but also be vigilant in your interpretation so you are not misled or misinterpret the data presentations.

What Happens to Photos Once Uploaded?

The following piece was written by OAS Communications Coordinator Ryan Bower for the Snapshot Wisconsin newsletter. To subscribe to the newsletter, visit this link.

Since Snapshot reached 50 million photos, the Snapshot team felt it was a good time to address one of the most asked questions about photos: what happens to photos once they are uploaded by volunteers? At first, the process seems complicated, but member of the Snapshot team, Jamie Bugel, is here to walk us through the process, one step at a time.

Bugel is a Natural Resources Educator and Research Technician at the DNR, but she works on the volunteer side of Snapshot. Bugel said, “I mainly help volunteers troubleshoot issues with their equipment or with their interactions with the MySnapshot interface. I am one of the people who answer the Snapshot phone, and I help update the user interface by testing functionality. There is also lots of data management coordination on the volunteer side of the program that I help with.”

Bugel listed off a few of the more common questions she and the rest of the Snapshot team get asked, including who reviews photos after the initial classification, what happens to the photos that camera hosts can’t identify and how do mistakes get rectified. “We get asked those [questions] on a weekly to daily basis,” said Bugel.

It Starts With a Three-Month Check and an Upload

Every three months, trail camera hosts are supposed to swap out the SD card and batteries in their trail camera. At the same time, volunteers fill out a camera check sheet, including what time of day they checked the camera, how many photos were on the SD card and if there was any equipment damage.

“You should wait at least three months to check their camera, because you won’t disturb the wildlife by checking more often. We want to view the wildlife with as minimal human interference as possible,” said Bugel. “At the same time, volunteers should check [their camera] at least every three months, because batteries don’t last much longer than three months. Checking this often is important to avoid missing photos.”

After the volunteer does their three-month check, they bring their camera’s SD card back to their home and enter the information on their camera check sheet into their MySnapshot account and upload their photos.

Bugel said it can take anywhere from 4 to 48 hours for the photos to appear in the volunteer’s MySnapshot account. Fortunately, the server will send an email when the photos are ready, so volunteers don’t have to keep checking. Volunteers can start classifying their photos after receiving the email.

A fisher walking through the snow

Initial Classification By Camera Hosts

The first round of classification is done by the trail camera hosts. The returned photos will sit in the Review Photos section of their MySnapshot account while the host classifies the photos as Human, Blank or Wildlife. The wildlife photos are also further classified by which species are present in the photo, such as beaver, deer or coyote.

This initial classification step is very important for protecting the privacy of our camera hosts, as well as helps on the back end of data processing. Over 90% of all photos are classified at this step by the camera hosts. When they are done classifying photos, they click “review complete,” and the set of photos is sent to the Snapshot team for the second round of classification.

Staff Review

The second round of classification is the staff review. Members of the Snapshot team review sets of photos to verify that all human or blank photos have been properly flagged. “For example, a deer photo may include a deer stand in the background. That type of photo will not go to Zooniverse because there is a human object in the photo,” said Bugel. Fortunately, nearly all human photos are taken during the initial camera setup or while swapping batteries and SD card, so they are usually clumped and easy to spot.

The second reason that staff review photos after the initial classification is for quality assurance. Since some animal species are tricky to correctly classify, someone from the Snapshot team reviews sets to verify that the photos were tagged with the correct species. This quality assurance step helps rectify mistakes. “Sometimes there are photos classified as blank or a fawn that are actually of an adult deer,” said Bugel. “We want to catch that mistake before it goes into our final database.”

In cases where the set of photos wasn’t classified by the camera host, the team will also perform the initial classification to remove human and blank photos. The Snapshot team wants to make sure any photos that reveal the volunteer’s identity or the location of the camera are removed before those photos continue down the pipeline.

Branching Paths

At this point in the process, photos branch off and go to different locations, depending on what classification they have. Blank (43%) and human (2%) photos are removed from the pipeline at this point. Meanwhile, the wildlife photos (20%) move on to either Zooniverse for consensus classification or move directly to the final dataset. The remaining photos don’t fall into one of our categories, such as the unclassified photos still awaiting initial review.

Photos of difficult-to-classify species, such as wolves and coyotes, are sent to Zooniverse for consensus classification. Bugel explained, “The photos [of challenging species] will always go to Zooniverse, even after volunteer classification and staff member verification, because we’ve learned we need more eyes on those to get the most accurate classification possible,” another layer of quality assurance.

Alternatively, photos with easy-to-classify species, such as deer or squirrel, go directly to the final dataset. Bugel said, “If a photo is classified as a deer or fawn, we trust that the volunteer correctly identified the species.” These photos do not go to Zooniverse.

A deer fawn leaping through

Zooniverse

Photos of difficult-to-classify species or unclassified photos move on to Zooniverse, the crowdsourcing platform, for consensus classification. “Wolf and coyote photos, for example, always go to Zooniverse, because it is so hard to tell the difference, especially in blurry or nighttime photos,” said Bugel.

The Snapshot team has run accuracy analyses for most Wisconsin species to determine which species’ photos need consensus classification. All photos of species with low accuracies go to Zooniverse.

On Zooniverse, volunteers from around the globe classify the wildlife in these photos until a consensus is reached, a process called consensus classification. Individual photos may be classified by up to eleven different volunteers before it is retired, but it could be as few as five if a uniform consensus is reached early. “It all depends on how quickly people agree,” said Bugel.

Team members upload photos to Zooniverse in sets of ten to twenty thousand, and each set is called a season. Bugel explained, “Once all of the photos in that season are retired, we take a few days break to download all of the classifications and add them to our final dataset. Then, a Snapshot team member uploads another set of photos to Zooniverse.” Each set takes roughly two to four weeks to get fully classified on Zooniverse.

To date, over 10,400 people have registered to classify photos on Zooniverse, and around 10% of the total photos have been classified by these volunteers on Zooniverse.

Expert Review

It is also possible for no consensus to be reached, even after eleven classifications. This means that no species received five or more votes out of the eleven possible classifications. These photos are set aside for later expert review.

Expert review was recently implemented by the Snapshot team and is the last step before difficult photos go into the final dataset. The team has to make sure all photos have a concrete classification before they can go into the final dataset, yet some photos never reached a consensus. Team members review these photos again, while looking at the records of how each photo was classified during initial review and on Zooniverse. While there will always be photos that are unidentifiable, expert review by staff helps ensure that every photo is as classified as possible, even the hard ones.

The Final Dataset and Informing Wildlife Management

Our final dataset is the last stop for all photos. This dataset is used by DNR staff to inform wildlife management decisions around the state.

Bugel said, “The biggest management decision support that Snapshot provides right now is fawn-to-doe ratios. Jen [Stenglein] uses Snapshot photo data, along with data from other initiatives, to calculate a ratio of fawns to does each year and that ratio feeds into the deer population model for the state.”

Snapshot has also spotted rare species too, such as a marten in Vilas county and a whooping crane in Jackson county. Snapshot cameras even caught sight of a cougar in Waupaca county, one of only a handful of confirmed sightings in the state.

The final dataset feeds into other Snapshot Wisconsin products, including the Data Dashboard, and helps inform management decisions for certain species like elk. Now that the final dataset has reached a sufficient size, the Snapshot team is expanding its impact by feeding into other decision-making processes at the DNR and developing new products. 

The Snapshot team hopes that this explanation helps clarify some of the questions our volunteers have about what happens to their photos. We know the process can seem complicated at first, and the Snapshot team is happy to answer additional questions about the process. Reach out to them through their email or give them a call at +1 (608) 572 6103.

An infographic showing how photos move from download to final data

May Science Update: Maintaining Quality in “Big Data”

Snapshot Wisconsin relies on different sources to help classify our growing dataset of more than 27 million photos, including our trail camera hosts, Zooniverse volunteers and experts at Wisconsin DNR. With all these different sources, we need ways to assess the quality and accuracy of the data before it’s put into the hands of decision makers.

A recent publication in Ecological Applications by Clare et. al (2019) looked at the issue of maintaining quality in “big data” by examining Snapshot Wisconsin images. The information from the study was used to develop a model that will help us predict which photos are most likely to contain classification errors. Because Snapshot-specific data were used in this study, we can now use these findings to decide which data to accept as final and which images would be best to go through expert review.

Perhaps most importantly, this framework allows us to be transparent with data users by providing specific metrics on the accuracy of our dataset. These confidence measures can be considered when using the data as input for models, when choosing research questions, and when interpreting the data for use in management decision making.

False-positive, false-negative

The study examined nearly 20,000 images classified on the crowdsourcing platform, Zooniverse. Classifications for each specie were analyzed to identify the false-negative error probability (the likelihood that a species is indicated as not present when it is) and the false-positive error probability (the likelihood that a species is indicated as present when it is not).

false_negative_graph

Figure 2 from Clare et al. 2019 – false-negative and false-positive probabilities by species, estimated from expert classification of the dataset. Whiskers represent 95% confidence intervals and the gray shading in the right panel represents the approximate probability required to produce a dataset with less than 5% error.

The authors found that classifications were 93% correct overall, but the rate of accuracy varied widely by species. This has major implications for wildlife management, where data are analyzed and decisions are made on a species-by-species basis. The graphs below show how variable the false-positive and false-negative probabilities were for each species, with the whiskers representing 95% confidence intervals.

Errors by species

We can conclude from these graphs that each species has a different set of considerations regarding these two errors. For example, deer and turkeys both have low false-negative and false-positive error rates, meaning that classifiers are good at correctly identifying these species and few are missed. Elk photos do not exhibit the same trends.

When a classifier identifies an elk in a photo, it is almost always an elk, but there are a fair number of photos of elk that are classified as some other species. For blank photos, the errors go in the opposite direction: if a photo is classified as blank, there is a ~25% probability that there is an animal in the photo, but there are very few blank photos that are incorrectly classified as having an animal in them.

Assessing species classifications with these two types of errors in mind helps us understand what we need to consider when determining final classifications of the data and its use for wildlife decision support.

Model success

When tested, the model was successful in identifying 97% of misclassified images. Factors considered in the development of the model included: differences in camera placement between sites; the way in which Zooniverse users interacted with the images; and more.

In general, the higher the proportion of users that agreed on the identity of the animal in the image, the greater the likelihood it was correct. Even seasonality was useful in evaluating accuracy for some species – snowshoe hares were found to be easily confused with cottontail rabbits in the summertime, when they both sport brown pelage.

bear_photo

Not only does the information derived from this study have major implications for Snapshot Wisconsin, the framework for determining and remediating data quality presented in this article can benefit a broad range of big-data projects.