Skip to content
This repository has been archived by the owner on Nov 17, 2022. It is now read-only.

Latest commit

 

History

History
175 lines (128 loc) · 7.68 KB

language.md

File metadata and controls

175 lines (128 loc) · 7.68 KB

Language Development for Leveling-up Decentralized Data Marketplaces

Decentralized data marketplaces like the one developed by the Ocean Protocol represent an emerget economic & technical field. Most importantly, though, they pose a significant challenge for users when communicating. The concept of data as a new asset class is so recent that the actual language used to communicate concepts hasn't caught up with the field's technological advances yet.

It is, hence, our conviction that the language about data being a novel asset first has to be explictly developed and discussed, before it can find applience in practice.

With this document, here at rugpullindex.com, we want to help drive forward the concept of decentralized data marketplaces through creating and maintaing this document.

The criteria we use to evaluate this document's quality, shall be that it allows a technically-sophisticated reader to get a precise overview over the terms and language used in decentralized data marketplaces. Ideally, separate sections of this document are directly linkable on the web and the document itself is default-alive, meaning that anyone ought to edit and expand it.

Glossary

Data

There's no consensus on the word "data". Most commonly, it is used as a noun, however. Generally, there's a variety of usage surrounding the word, some of which we want to enumerate here:

"Data" can be seen as the plural from of "datum", where "datum" commonly represents a a recorded value of some sort. In German, the word "datum" literally means "date", e.g. "2021-11-18", so indeed multiple "datum" are "data", e.g. "2021-11-18,2021-11-19,...".

In the emerget tech world, the term "data" is often used when talking about the information stored or extracted from individuals using e.g. social media sites. The saying that, "If you're not paying for the product, you're the product" tries to imply that free usage of e.g. facebook.com is subsidized for the user by extracting their "personal data" for commerical use.

Already through framing the last paragraph, it's clear that the term data was used in particular ways: "extracting data", "personal data", "data is information", "data can be stored", "data is valuable?"

The Form of Data

In all of the above sections, it has not become apparent "what data looks like." Is it useful to represent data as a tangible structure and think of it as e.g. a material box? After all, a box full of e.g. patient folders at a doctor's office could be called "data".

Certainly, that framing is a valid one. But additionally, e.g. the number of daily or weekly COVID-19 infections could be declared being "data too". However, in their case, thinking of a tangible, material structure is far more difficult.

In newspaper article, this type of data, also called the incidence, is often represented as a function plot where the time dimension is uniformly laid out on the x-axis and the number of cases are shown on the y-axis.

We call this form or type of data "time series" as the production of datum's happened in relation to time. Practically speaking, in the case of COVID-19 incidence data, the responsible data maintainer would count all cases of infection for one day before adding a new point to the chart representing all commulative new values of the day.

Similarly, when for example "sampling" a thermometer to learn more about a room's temperature throughout a year, we'd define the sampling rate of our temperature signal as the amount of measurements per time unit. For example, the unit of Hertz (Hz) is used to define the frequency of measurement. 1 Hz is one measurement per second, meaning that we could sample the room's temperature at 1 Hertz - once a second.

But it's not necessarily to strictly record data in dependence of time.

During the parlamentarian vote in a democratic society, the people delegate their voice to representatives through voting. The subsequent count of votes is usually represented as the proportional amount of votes a representative managed to get. If e.g. two candiates (A and B) are running, then a potential result is "A: 51% and B: 49%".

Here, neither datum "A:51%" nor "B:49%" is recorded in dependence of time. However, the count's result can still be considered data.

Furthermore, as displayed in the popular use case for data, in self driving cars, "data" may be represented not only in text for either but higher level media. It's commonly portrait that "to solve self-driving cars", lots of diverse video footage has to be captured about all kinds of unique driving conditions to make self-driving cars safer.

In that case, the recorded datums may have many varying degrees of relevant dimensions, to elaborate those would bust the scope of this document.

And going back to our original question about whether data should be represented as a tangible, material object - in the case of self-driving car data, is it helpful?

Relevancy of Data

Data, as other information has a quality we herein call "relevancy." It defines the usefulness and applicability of the data within the temporal or situational context.

Data about US voters in the 2016 election has commonly been made out to be "highly relevant" in influencing the vote's outcome and potentially upfront manipulating voters. Known as the "Cambridge Analytica Data Set", it was described as predictive for the persuasibility of certain voters in the United States.

It was later argued, that indeed the same data set was also used to a degree in the UK's Brexit referendum.

In both cases, we'd say that Cambridge Analytica's data set was highly relevant for pursuaiting voters in the UK and US. This quality, we believe, is important to highlight as it portraits specificity.

Is Data Heterogeneous?

Looking at the data offerings of the Ocean Protocol Marketplace today, we can easily see their variability. Behind QUICRA-0 seems to be the promise of an ever-growing data set through unionizing data annotation workers, whereas LUMSTA-42 can potentially represent a clearly tangible .zip or .csv file containing a list of all products from amazon.com in 2018.

Other popular online marketplaces like Uber, AirBnB or Lyft don't share this type of offering. Neatly, their markets are divided in suppliers and consumers, with - in the case of Uber and Lyft - all offerings being homogenious, hence, equal or fungible.

In theory, which one of the many Uber taxis you get must not make a difference in service quality. Or at least that's Uber Inc's designated goal.

AirBnB's marketplace is somewhat similar in that AirBnB Inc's goal is to guarantee a threshold of quality to any consumer. They make sure that "what you book is what you get."

Famously, the story for AirBnB goes that demand increased when Paul Graham suggested AirBnB's founders to improve the site's pictures that are shown to potential customers.

As the quality for an AirBnB can greatly vary given e.g. an offer's pricing, to improve the market's quality, Paul Graham righly suggested creating more transparency within the market to ease consumer's decision making. The idea being that in the ideal case a consumer knows "everything" and its price about an apparentment, then they are more capable of making a decision compared to when they know only the price but nothing else.

Similarly, we believe, an distinction about a data's quality and the associated risk why buying it has to be made. Today's data on today's Ocean Marketplace is extremely heterogenius: There are not two alike data sets.

Compared to some of the quoted prices, the level of due diligence data consumers are capable of doing is low and hence the risk of "buying a cat in a bag" is high.

Forms of Manipulation

Mining

Extracting

Storing

Accessing

Writing

Reading

Deriving

Publishing