What I Wish Everyone Knew About Data Management in Deep Learning Computer Vision

January 30, 2019

In contrast with a traditional software product, it is “data” rather than “code” that is of primary importance in a deep learning system. A neural network for computer vision will only perform well in the field if it is thoroughly trained with a dataset that effectively mimics the environment it will see in production. While this requires large amounts of data, it is not the sheer volume that determines success, but the degree to which the training images and videos reflect the specific reality your edge devices will be exposed to. This is a very hard problem to solve.

Data itself is often skewed or biased. Your data may be too clean if it was collected in situations that were more optimal than those your sensors will see in reality. Conversely, the data could be too dirty for precisely the opposite reason. It could also lack sufficient noise; for example, a data set used to train a neural network to detect cats might have too few images that have nothing to do with cats at all .

Computer vision data is opaque. Even developing an understanding of and gaining true visibility into the makeup of a data set is extremely difficult. This is especially true in the computer vision domain. The sheer volume of images and videos makes understanding a data set’s distribution across various qualities impossible without tailor made tools.

Working with your data represents further challenges. Training a neural network regularly requires slicing multiple data sets and recombining them back. This can get messy, leading to accidental deletions and other errors. You may have multiple people working with the data, creating challenges related to access management, security and privacy.

Deep learning represents a very different AI paradigm, especially insofar as it relates to signal-based data rather than structured data. The world’s biggest technology companies have realized this and have responded by assembling large teams of experts and spending tens of millions of dollars and several years to develop their own platforms. But for the vast majority of companies, this is neither a core capability nor an effective use of resources. You need to maximize the talents of your data science and engineering resources by letting them build solutions, not platforms. The allegro data manager is designed to flexibly address this rich set of challenges. To learn about how it’s done – go here.