Over winter break, I started learning about computer vision algorithms and working on some intro vision projects. This book has been really great for learning about it. It’s really clear and introduces both the theory and practice of computer vision.
Anyways, the project: our TreeHacks hack (we built it over 36 hours) is an image-recognition project wrapped into a tour guide application. You take a picture of some landmark around you and it tells you what the landmark is and gives you a history lesson about it. We called it Wanderlust, here’s a demo:
I worked on the top level architecture of everything and then implemented the vision system using OpenCV in Python. We have a few different sources of data and images: first, I wrote a scrapy script to grab landmarks and associated locations and descriptions off of the UNESCO world heritage sites page. The locations were passed to the Panoramio API to get a library of images for each landmark. For different cases, we also went around and took pictures of Stanford and grabbed other images manually and tagged them with our own information.
Then the cool stuff: for each image, we run OpenCV feature detectors (SIFT) and pickle these features into MongoDB along with metadata about the image (location, descriptions). To match an image, we compare the features of the image against the library (almost literally a dot product of each feature in each image), and return the associated metadata, which we serve up to the app through a flask REST API. The front-facing Android app POSTs an image to this API and gets back usable information for a map and history blurb.
The android app remains unpublished but the REST api is directly accessible through the website, (click through to /wonders/ and try uploading a picture of the Taj Mahal!). For demoing our app, the dataset is small (~200 images) so that it runs in reasonable time, but it scales linearly to whatever size of library we would want. The quality of the image matching is far from perfect; if realtime was not an issue, this would be improved as well.
Moving forward, we got some useful feedback and had some thoughts that would make this viable:
we could easily filter for images by a rough idea of your location. Actually, because we’re using MongoDB (which supports queries by geolocation) and we already have location data, this would be almost trivial to implement. This means we could have millions of images condensed into features and stuffed into MongoDB, but only have to search through on the order of hundreds to actually match an image.
cheat. Instead of the very best matching image, present a few of the best matches and let the user choose which they want to learn more about.
moving to PCA-SIFT or some other feature descriptor algorithm might improve the vision system accuracy itself
Fun times! Looking forward to many more image recognition projects in the future.