- you should be logged into the internal nexus repository
Since the registry uses https but without a valid certificate, you need to add the registry to the insecure registries. Then you can run the following command.
$ docker login https://10.98.74.120:5000
sbt clean compile
sbt docker:publish
Now we can deploy the application in Kubernetes, with the following steps:
- ssh in the edge1 or edge2 and check the folder
/glusterfs/volume1/storage-manager/conf
exists - place a file
production.conf
with all the production configuration overrides. cd kubernetes
- execute
$ deploy.sh
. it creates a file output.yml that replaces the variable STORAGE_MANAGER_VERSION with 1.0.0-SNAPSTHOT, callskubectl delete -f output.yml
andkubectl create -f output.yml
This component exposes APIs to query and download data from HDFS.
In particular it allows to:
- download a full dataset by a
logical_uri
- get the schema of a dataset by a
logical_uri
- get the results of a query operated on a dataset, referred to by its
logical_uri
The query API allows filtering of a dataset by application of a subset of functionality as derived from traditional SQL. The exposed functions come in the form of nested JSON objects, as explained below, with an obvious tradeoff between abstraction, verbosity and functionality.
Below you can find an example of body that can be used to make queries.
http://[host]:[port]/dataset-manager/v1/dataset/[logical-url]/search
where logical-url
would be a URL-encoded URI as defined by the catalog-manager APIs, e.g. daf%3A%2F%2Fdataset%2Ford%2Falessandro%2Fdefault_org%2FAGRI%2Forganizzazioni%2Fagency_infer_ale
A query is made up of clauses, of which the following are currently supported:
- select
- where
- groupBy
- having
- limit
- join
- union
In the example below, we are selecting a column named col1
and a column named col2
with an alias of alias_col2
.
{
"select": [
{ "name": "col1" },
{ "name": "col2", "alias": "alias_col2" }
]
}
This is equivalent to an SQL of
SELECT col1, col2 AS alias_col2
FROM table
Note that the select
clause is optional, where the wildcard *
is assumed in case it is omitted.
In the example below, we are applying a filter where col1
must be greater than col2
.
{
"where": {
"gt": { "left": "col1", "right": "col2" }
}
}
This is equivalent to an SQL of
SELECT *
FROM table
WHERE col1 > col2
More complex filters are also possible by nesting logical operators such as and
, or
and not
{
"where": {
"not": {
"and": [{
"or": [{
"gt": { "left": "col1", "right": "col2" }
}, {
"eq": { "left": "col3", "right": false }
}]
}, {
"neq": { "left": "col4", "right": "'string'" }
}]
}
}
}
This is equivalent to an SQL of
SELECT *
FROM table
WHERE NOT(
col1 > col2
OR col3 == false
AND col4 <> 'string'
)
Note how string
is enclosed in single quotes to represent the constant string
in the JSON example.
In this example, we are grouping by col1
and col2
, aggregating col3
using max
and counting all in-group occurrences, applying no filters.
{
"select": [
{ "name": "col1" },
{ "name": "col2" },
{
"max": { "name": "col3", "alias": "max_col3" }
}, {
"count": { "name": "*" }
}
],
"groupBy": [
{ "name": "col1" },
{ "name": "col2" }
]
}
This is equivalent to an SQL of
SELECT col1, col2, MAX(col3) AS max_col3, COUNT(*)
FROM table
GROUP BY col1, col2
Currently, the validation for groupBy
queries is pushed down to the query-engine, which validates the query before applying it to the dataset. This relieves he web-layer from complex validations, but in turn gives up control over the error messages that can be returned in case of invalid group-by syntax.
In this example, we extend the one we've seen above in 3., adding a having
clause, which is used to filter out group-by result by aggregations.
{
"select": [
{ "name": "col1" },
{ "name": "col2" },
{
"max": { "name": "col3" },
"alias": "max_col3"
}, {
"count": { "name": "*" }
}
],
"groupBy": [
{ "name": "col1" },
{ "name": "col2" }
],
"having": [
{
"gt": { "left": "max_col3", "right": 50 }
}
]
}
This is equivalent to an SQL of
SELECT col1, col2, MAX(col3) AS max_col3, COUNT(*)
FROM table
GROUP BY col1, col2
HAVING max_col3 > 50
Note that the application will not allow a query to have a having
clause without a groupBy
.
The limit
clause will simply return at most the requested number of rows in the result.
{
"limit": 5
}
This is equivalent to an SQL of
SELECT *
FROM table
LIMIT 5
The join
clause will accepts any number of join clauses whose type is one of left
, right
, inner
or outer
. All join
types, share the same representation, with the joined table being represented as a URI that would be used in other APIs to fetch catalog data. This URI is resolved by the engine, its permissions validated and aliases applied as shown below.
{
"select": [
{ "name": "T1.col1", "alias": "t1_col1" },
{ "name": "JT1.col2", "alias": "t2_col2" }
],
"join": [
{
"inner": {
"uri": "daf://dataset/test/otherTable",
"on": {
"eq": { "left": "JT1.col3", "right": "T1.col3" }
}
}
}
],
"where": {
"eq": { "left": "JT1.col4", "right": 100 }
}
}
This is equivalent to an SQL of
SELECT T1.col1 AS t1_col1, T2.col2 AS t2_col2
FROM table T1
JOIN otherTable JT1
ON JT1.col3 = T1.col3
WHERE JT1.col4 = 100
Note how the aliases are applied to the tables in the SQL statement; the engine will internally apply an alias of T1
to the main table in the relation, i.e. the table in the FROM
part of the query. The other tables in any of the subsequent join clauses will have aliases starting from JT1
going up. So, for instance, a table which is joined two ways would produce an alias T1
for the main table, and then another two JT1
and JT2
for the joined tables.
The union
clause accepts what is essentially a sub-query which supports only select
and where
clauses that are operated on a uri
. Note that both where
and select
are optional, where no select
clause produces an SQL equivalent of SELECT *
.
{
"union": [
{
"uri": "daf://dataset/test",
"where": {
"eq": { "left": "col2", "right": 150 }
}
}
],
"where": {
"eq": { "left": "col3", "right": false }
}
}
This is equivalent to an SQL of
SELECT *
FROM table
WHERE col3 = false
UNION ALL
SELECT *
FROM otherTable
WHERE col2 = 150
Note that for performance reasons, only UNION ALL
is supported. The hight cost of applying DISTINCT
over a large union might be prohibitive for most practical cases.
This section will outline the different types of representations exposed by the Query API, attempting to make parallels with SQL where possible for simplicity.
A Column can take one various shapes, where each is possible to use in different contexts.
Represents a column in a table, referred to by name
and possibily given an alias.
{ "name": "col1" }
{ "name": "col2", "alias": "col2_alias" }
If used in a select
, in SQL this would be equivalent to
SELECT col1, col2 AS col2_alias
Represents a column with a constant and possibly an alias.
{ "value": "string" }
{ "value": 1, "alias": "one_alias" }
If used in a select
, in SQL this would be equivalent to
SELECT 'string', 1 AS one_alias
Represents a column to which an aggregation function is applies, possibly given an alias
{
"max": { "name": "col1", "alias": "max_col1" }
}
In a select
in SQL, this would be equivalent to
SELECT MAX(col1) AS max_col1
The API currently supports only min
, max
, count
, sum
and avg
, which may be extended to support more aggregation in the near future.
Filter operators can be either comparison
operators, which compare columns to values or other column, or logical
, which act to affect comparison
operators.
An example of a comparison
operator would be gt
, which can be represented as
{
"gt": { "left": "col3", "right": 50 }
}
{
"eq": { "left": "col1", "right": "col2" }
}
Here we see first a gt
comparison between a column col3
and a constant value 50
, whereas in the second we see an eq
comparison between two columns col1
and col2
.
The API current supports only gt
, gte
, lt
, lte
, eq
and neq
, including support for in
and like
(with limitations) in the future.
logical
operators can be used to decorate or group a number of comparison
s or other filters. This means that logical
operators can be nested to create complex filters.
{
"not": {
"gt": { "left": "col3", "right": 50 }
}
}
{
"and": [{
"gt": { "left": "col3", "right": 50 }
}, {
"eq": { "left": "col1", "right": "col2" }
}]
}
{
"or": [{
"gt": { "left": "col3", "right": 50 }
}, {
"eq": { "left": "col1", "right": "col2" }
}]
}