HelpingScience Newsletter, March 2010
In this Issue:
- Collections currently providing images
- Where we are with the software
- New this month
- What are our plans for April
- Summer Imaging Tour
Since our public announcement at TDWG 2009 we have had the chance to talk with a number of large and small collections interested in this method and technology for processing herbarium label images. It is our goal to implement an international service that can help all herbaria.
Collections currently providing images:
We would like to thank the following collections for providing sample specimen images for testing. Right now we have over 10,000 different images from around the world as early as the 1800’s to present. We have seen the worst and the best examples and hope to address each style of label until we have a tool that is just right for everyone.
- Kew Gardens
- CyberFlora Louisiana
- National Herbarium of Victoria
- MorphBank
- McDaniel College Herbarium
Where we are with the software:
A quick outline of what is currently working with our main workflow:
- Specimen images can be retrieved directly from client image servers, our servers, or Amazon S3 storage
- Labels on a specimen sheet can be tagged and saved as fast as 8 seconds with the easiest cases
- Evernote OCR engine can return an analysis of a label within 2 minutes of submission
- Evernote detects boundary boxes for both typed and handwritten labels
- Evernote provides possible values for both typed and handwritten labels
- Our prediction engine can identify Country, State, Dates
- DwC Fields can be tagged and have been averaging around 25-45 seconds on 1940’s+ labels
- Top 5 most related historical labels are available to autocomplete DwC Field tagging
- Keystroking words averages 3-8 seconds
- User Bots can run and act as virtual users and submit values like any other user but basis its decision on ocr and lookup databases. Currently we have bots for DwC type: Family, Genus, SpecificEpithet, Author, Country, StateProvince
- Labels can be reassembled when all the marked fields have been verified
- Initial development of Data Export Wizard to provide data in CSV format. (Other formats like DwC Archive, JSON, XML, ABCD are already on the todo list)
Screenshots available at: http://www.helpingscience.org/docs/screenshots/workcenter.html
A few points about the client center for managing your specimen sheets:
- Ability to view and filter a listing of all your specimens and their current stage of processing
- Specimen Viewer now lets you view all aspects of the sheet and everything that has been done to it down to the smallest details.
- Reports for viewing average speeds, specimen sheets processed, and the amount of items in each type of queue.
- Prediction grid to see which values have been captured from our prediction engine. This will be extended in the coming months to include reports and priorities for queued items.
Screenshots available at: http://www.helpingscience.org/docs/screenshots/clientcenter.html
Over the past month of testing we have learned a few things.
Tagging a modern specimen with no or one determination takes around 5-7 seconds. The request and downloading of the image was taking 5-8 seconds depending on bandwidth. So the user has been waiting longer for the image then to do the work. That being said we rewrote our software to buffer one specimen sheet so when the user saves the current label coordinates the next image is instantly available while an additional image is downloaded in the background. This minor variation has allowed us to double the speed of tagging from ~15-20 seconds down to 6-8 seconds. We are happy with this discovery and looking for other finding for us to improve upon.
New this month:
This month’s focus has been on allowing the majority of the people in the world to easily sign in and start helping. To do that we are making use of OpenID and FaceBooks API. As of now we have successfully linked to these providers. To the right shows the number of users that exist with each organization and the potential users that can help.
- Google Accounts: ~146 Million Users
- Yahoo: ~254 Million Users
- Facebook: ~200 Million
What this will do is grant any user that already has an account with one of these providers to sign in on their website and automatically log them into HelpingScience. The only thing we collect is the username and email address. If a user has never signed into HelpingScience and chooses to use a 3rd party we will create a HS account with an encrypted password. If they ever decide to use HS login they simply have to reset their password and an email will be sent to their related email account.
HelpingScience iPhone App:
With over 40 million iPhone / iPod Touch users we decided that it would be useful to start developing some interfaces for this source of citizen scientists. So far we have introduced our basic word typing game. You can see some screen shots here. At this time it is not available on the app store and only for internal testing. It is our hope to have this available in the app store in coordination with this summer’s public demo.
What are our plans for April?
We have 2 main goals for April. The first is to finalize our sign in system which includes starting to implement our user and group level permission for individual collections and projects within those collections. The second goal is to improve our prediction engine to include Family, Genus, Species, Infra ranks, Authorships, Counties. At this time our prediction engine is only finding Country, US States, Months (English).
The prediction engine is a standalone application that takes a formatted list of ids, box coordinates, and list of possible values. The engine analyses the possible values, order, and proximity of words to determine if any words could be associated to DarwinCore fields. If they are these boxes and values are flagged for stage 2.
We have also started plans on making 2 processing clusters. They are based on Dell PowerEdge servers and should release more information in the coming months. One cluster will be located in the U.S. and another in Europe. If the demand for processing images is in the millions then it will be essential to have the technology close to the people that need it, making it as fast as possible. We are also starting to look at which pieces of the processing can be moved into the Amazon EC2 cloud computing centers. This will be very useful when certain processing queues require more time and computer cycles. This way we can launch 1-100 extra machines to run as long as necessary to make sure any backlog is never due to the computer processing.
Summer Imaging Tour:
Starting July 19th we will be demonstrating our method for rapid imaging, HelpingScience, and showing the results in our SilverCollection web portal software. This will be a case study to show what is possible and how SilverBiology can help provide these imaging and processing services in a compressed amount of time and have this data online and available in a matter of days. We plan to visit herbaria over a 1 month timeframe and image at least 5,000 specimens from 10-15 herbaria. This will be are first real world test with HelpingScience and hope to see wonderful results. All information, data, and statistics will be online along with our findings.
More information about the imaging tour can be found at: http://botany2010.silverbiology.com
Have some ideas?
Hope it can do something more?
Want to see another demo?
We are developing this so that everyone will want to use the service. We will continue to listen and do whatever we can to make this software as user friendly and useful as possibly. If you have an idea or just want to see a demo you can send an email to: contact@helpingscience.org or reply back to this email and let me know what is on your mind.
If you know someone that would like to receive this newsletter they can go to:
http://www.silverbiology.com/newsletter/
Tags: citizen science, crowd sourcing, helpingscience, herbarium imaging, specimen label processing
Partner