A detailed breakdown of the differences and similarities of both DataStores and FeatureStores with examples, attributes, uses, and users, while asking if their only difference actually depends on how they are each approached and by which user.
DataStore vs FeatureStore
I think it’s safe to say that one of the worst things in Machine Learning is the terminology. The maths and statistics are definitely part of the learning curve, but more than that, it feels like you are learning a new language. In some ways, you are. DataStore and FeatureStore are two of the current buzzwords that people are trying to understand.
To be fair, DataStore and FeatureStore feel like family rather than strangers. Both of these stores share a desire to store data, to have the data stable, and to have it versioned and “Safe”. I think there are some differences of focus that make a difference when selecting one. With that in mind here is my understanding of them;
Datastore – Path A
Let us start with the simpler one for me. DataStore. The word alone conjures to mind certain connections with databases of the past. I guess I am dating myself with saying database instead of “data warehouse”. C’est la vie. There are other similar words such as “Data lakes” and “data silos” (if you remember that). It’s not the purview of this blog to go in depth on the minutiae of these.
The DataStore then, you can think of as a repository service that allows you to store and manage files and objects (such as models). It is largely agnostic about what data goes into the DataStore, or indeed, where the data is stored. DataStores allow you to choose your favourite filesystem, potentially even a distributed fs for recovery and stability. DataStores usually also decline to enforce input or output typing. The idea being that this should be done via python/R/Julia code and using system like python’s “great expectations” for example
Most DataStores also support the idea of versioning the data. Ideally, this means that at a later date, you can ask for the files/objects back from a certain time in your pipeline. This allows any 3rd party to repeat your experiment with a level of certainty, that they are infact, recreating your experiment. There would not be much value in a DataStore if data could change randomly. Of course, you can update and increment versions of the data. For an analogy, I like to say a DataStore is git for you data (but please don’t actually use git for your data).
Even in DataStore, there are different shades of grey. The primitive DataStore will store files and models in a ziplike structure. That’s the barebones right there. To be fair, most DataStore providers dont stop there though. The more advanced DataStore can take certain inbound files and run operations on them. Did you want to upload a CSV? It can take that 2gb file and store it as data points in a key value style. Did you want to upload images ? It can take them and let you provide annotations even after the fact. If I had to compare it to something, it’s almost like an OODBMS – there I go, dating myself again. Sorry.
Much like an OODBMS, advanced DataStores also provide you with a query language. Some of them go for an SQL variant, which makes sense. Large companies such as Azure or Google Cloud are keen to push DataStores, and having a language that is “common” makes the learning curve easier. There are others that use GraphQL and even some that allow things like Ancestor queries or Kindless queries.
Given that the DataStore probably sounds very familiar to database users, and I assume the vast majority of readers, what is a FeatureStore ?
FeatureStore – Path B
I think it’s fair to say that the FeatureStore is the ‘new kid on the block’ (I have to stop with these dated references). It initially grew out of the desire of ML practitioners to not be “munging data” all the time. Let’s be honest, messy data is really the number one timesink. Word vectorization, standardising enumerated types, which ISO date format did we get today. All these things, take time to deal with and process.
All of this, in the FeatureStore world, is done ahead of time and usually via a DSL. A domain specific language. This allows people to learn a relatively small amount of code, that they can then use to agree on the data types they are expecting as input. That date ? format it into YYYY-MM-DD please. It allows also the export to be strongly typed as well.
Another potential win with the FeatureStore ideology, is usually on loading, it can be told to track and compute feature statistics. This means calculations such as the median or quartiles can automatically be updated along with the data. This comes in handy when you are trying to keep track of things like data drift/data skew or increases in loss over time.
The export process really shines when there is more than one “downstream” consumer. This lets the people know what to expect. Of course, if the data you are receiving is not what you wanted, most FeatureStores allow you to change the columnar data you receive. One of the really nice things that is also possible, is to have an offline and online setup.
The offline is, to all intents and purposes, a hybrid of above and a database. You can query and select and have your data returned to you. In the case that time is off the essence, or if you want to say do time windows or checkpointing, then you can use the FeatureStore online. Think of online as similar to Kafka or some other message streaming system. This can be very handy in the case of NLP (for large textual datasets in real time) or OpenCV (for watermarks). It is a somewhat “niche” case but, if you need it, then you will know it.
Most FeatureStore’s pride themselves on giving back data in a format that requires no further massaging though. Panda’s dataframes, Numpy arrays or similar can be given as desired. This is in stark contrast to the DataStore prior, which leaves this mostly upto the consumer to do. Normally, this isn’t a problem. The code or process is not too difficult. However, it is an extra step that must be taken, and it may cost precious processing time.
To be fair, both FeatureStore and DataStore, to my mind, still need that often maligned role in companies. The “data Tzar”. That one person who either does the ETL’s, or checks why a data file was not imported. That person who is responsible, at a business level, for saying “we only provide dates in ISO format”, even if we could use others.
So, What is the Conclusion ? What is the biggest difference ?
One of the most commonly misinterpreted lines in all of western culture comes from a poem by American writer Robert Frost.
“I took the one less traveled by,
And that has made all the difference.” – Robert Frost
The reason it is misinterpreted, is it’s taken out of context. When read in the proper context, it is shown that, really, there is no distinction between the paths at all. Some had winding paths, another one more green, but they still led to the same conclusion.
Thus, when we ask, what is the big differentiator between the two storage solutions, there is an apparent answer. Datastore works at a file or columnar level and FeatureStore works at a feature, or individual numbers/words level.
I think that this is repeating the same misinterpretation as with the Robert Frost poem. To me, the answer lies with the user themselves and their context. If they have managed to find their way into ML/DL by ways of computer science, then everything will look like it needs a few lines of code. Outbound data validation ? a few lines of code. Data import ? a few lines of code.
Conversely, if someone has managed to find their way into the ML/DL world by way of statistics or maths, then I strongly doubt they will reach for a coding solution. Rather, I suspect, they prefer to have this abstracted away from them.
One is not better than the other, it’s simply a difference of world views. A difference that has generated two different approaches, to the same problem. Over time, I have very little doubt that both will grow to take features (no pun intended) from each other. I think we are better as a community for this, rather than poorer, and I for one celebrate that we have that choice.