Deep Learning is all the rage these days. I get it, the idea of letting a computer extract out features and find things is amazing. I love it. In fact while, working on my Masters at Georgia Tech I took any Machine Learning class I could get my hands on and the Reinforcement Learning course that was also offered. They were AWESOME and they really opened my eyes to how a lot of this stuff worked.
A few months back I started with using darknet and the tiny YOLO approach. I setup an RTSP server on my raspberry pi in python to pull images and wanted to see how it did, I experimented on a few images, but to my surprise, (although in retrospect, unsurprisingly) it failed.... pretty bad. The first attempt it actually saw the car on the side of my yard, which was utterly amazing. I ran it again mere minutes later and it never saw the car again... after multiple attempts. Of course this was all running quite slow, if I recall on the order of 60 seconds per image. I decided that as long as there was some kind of "motion" that zoneminder detected then I could have it process on that. I was slightly less concerned about realtime than I was about just simply notifying me at some point in the future if something odd was going on.
I decided that over the holiday break I would try to work on this detector more. I started downloading all my camera data (which is really not all that old), and it turned out I had about 65GB of image data from motion captures using zoneminder. First off, sharing that is difficult if I wanted to with anyone to work on this, so I am working on pairing it down, it is at least broken down by camera, so I should be able to divy it up that way.
I decided to try out using the image recognition to even detect anything out of these images as a first pass. I got everything setup and used https://www.tensorflow.org/tutorials/images/image_recognition
I modified it to run through an entire folder, it then spits out 3 different files. First is a mapping of image -> human string, score and index. The index is used for looking up the human string in the mapping. The second file is a listing of the index with all images which had that index (didn't matter the score as long as it was in the top 5). Finally, I output the index to human string mapping, so I could easily look things up. The files are named image_filtering_analysis, image_filter_mapping, and image_filtering_by_class. I opened up the image_filtering_by_class and looked up the first thing I saw, which was 160. This translated to "wire-haired fox terrier". This was a view of my driveway and I thought....well I mean I suppose that might be possible.
|wire-haired fox terrier
Ok so first one is not very awesome. Though to be really fair, I looked up the confidence score and that one got 0.016631886. Interestingly these were the top 5 scores.
I looked up what some of the labels were in the dataset they used. I didn't see any just plain "car" labels. This is just odd to me. I will likely have to re-train on my data, but for now I need to figure out a way to find common features.
- submarine, pigboat, sub, U-boat, 0.27390754
- patio, terrace, 0.064846955
- steam locomotive, 0.040173832
- fountain, 0.023076175
- wire-haired fox terrier, 0.016631886
So of course improvements could be to require the scoring to at least reach some threshold, so suppose we set it to 50%, this would prevent seeing that particular failure. Searching through my data (which by the way hasn't finished running). It found another version of the same image above but called it a patio, terrace with over a 90% confidence. In fact the image with the highest confidence was also labeled a patio, terrace. Here it is.
I wonder if someone could (maybe me ;)) build a form of a feature detector much like is used in CV, but instead of being trained to a specific task, it's generic. This might enable a way to group common features across images and make creating a supervised dataset easier.
I'll post more interesting ones as I come across them. I think at this point most of my image recognition's are at night and will eventually hit daytime which I think will show some even more humorous results.
For now, it's suffice to say. This detection is not doing well. I wanted to at least run this with the hopes that it would find cars or other things for building up a supervised set, however I don't think that's going to happen.
Here are some plots of what labels had been applied to the images (of course none of these have any references to confidence, but it's interesting).