Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental dataTypeCase and functionCase options #673

Merged
merged 18 commits into from
Dec 7, 2023
Merged

Add experimental dataTypeCase and functionCase options #673

merged 18 commits into from
Dec 7, 2023

Conversation

karlhorky
Copy link
Contributor

@karlhorky karlhorky commented Nov 27, 2023

Closes #237
Closes #653


dataTypeCase (experimental)

Converts data types to upper- or lowercase.

Note: Casing of function names like VARCHAR(30) are not modified - instead rely on the functionCase option for this.

Options

  • "preserve" (default) preserves the original case.
  • "upper" converts to uppercase.
  • "lower" converts to lowercase.

preserve

CREATE TABLE
  users (
    id InTeGeR PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    first_name VARCHAR(30) NOT NULL,
    bio teXT,
    is_email_verified BooL NOT NULL DEFAULT FALSE,
    created_timestamp timestamPtz NOT NULL DEFAULT NOW()
  )

upper

CREATE TABLE
  users (
    id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    first_name VARCHAR(30) NOT NULL,
    bio TEXT,
    is_email_verified BOOL NOT NULL DEFAULT FALSE,
    created_timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW()
  )

lower

CREATE TABLE
  users (
    id integer PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    first_name VARCHAR(30) NOT NULL,
    bio text,
    is_email_verified bool NOT NULL DEFAULT FALSE,
    created_timestamp timestamptz NOT NULL DEFAULT NOW()
  )

functionCase (experimental)

Converts functions to upper- or lowercase.

Options

  • "preserve" (default) preserves the original case.
  • "upper" converts to uppercase.
  • "lower" converts to lowercase.

preserve

CREATE tabLE
  users (
    id iNtegeR PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    first_name VarChaR(30) NOT NULL,
    bio TEXT,
    is_email_verified BOOL NOT NULL DEFAULT FALSE,
    created_timestamp TIMESTAMPTZ NOT NULL DEFAULT NoW()
  )

upper

CREATE tabLE
  users (
    id iNtegeR PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    first_name VARCHAR(30) NOT NULL,
    bio TEXT,
    is_email_verified BOOL NOT NULL DEFAULT FALSE,
    created_timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW()
  )

lower

CREATE tabLE
  users (
    id iNtegeR PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    first_name varchar(30) NOT NULL,
    bio TEXT,
    is_email_verified BOOL NOT NULL DEFAULT FALSE,
    created_timestamp TIMESTAMPTZ NOT NULL DEFAULT now()
  )

@karlhorky karlhorky changed the title Add experimental datatypeFormat option for changing data type case Add experimental datatypeCase option for changing data type case Nov 27, 2023
@nene
Copy link
Collaborator

nene commented Nov 27, 2023

Thanks for starting work on this.

FYI, I will be away from computer for about a week. So likely won't be reviewing this during that time.

@karlhorky karlhorky changed the title Add experimental datatypeCase option for changing data type case Add experimental dataTypeCase option for changing data type case Nov 28, 2023
@karlhorky karlhorky changed the title Add experimental dataTypeCase option for changing data type case Add experimental dataTypeCase and functionCase options Nov 28, 2023
@karlhorky
Copy link
Contributor Author

karlhorky commented Nov 28, 2023

@nene just so you know, digging into the problem a bit, some notes and some ideas for approaches for this PR:

  1. I see there is no distinction in the keyword tokens between all and dataType / datatype / dataTypes:

    // https://www.postgresql.org/docs/14/sql-keywords-appendix.html
    all: [
    'ABORT',
    'ABSOLUTE',
    'ACCESS',

    dataType: [
    'DATALENGTH',
    'IDENT_CURRENT',
    'IDENT_INCR',
    'IDENT_SEED',
    'IDENTITY',

    datatypes: [
    'NUMBER',
    'DECIMAL',
    'NUMERIC',
    'INT',

    // Parameterized types
    // https://docs.microsoft.com/en-us/sql/t-sql/data-types/data-types-transact-sql?view=sql-server-ver15
    dataTypes: [
    'DECIMAL',
    'NUMERIC',
    'FLOAT',
    'REAL',

    This is because of the flatKeywordList() function, which effectively discards the key information:

    // Used for flattening keyword lists
    export const flatKeywordList = (obj: Record<string, string[]>): string[] =>
    dedupe(Object.values(obj).flat());

  2. I found out that there are also function data types (eg. PostgreSQL parameterized data types)

  3. Because of this, I would like to propose multiple potentially controversial changes (I would like to do 2 separate PRs for 3i and 3ii below before I start on #673, to keep the diffs small and PR reviews simple):

    1. Simplify the src/languages/*/*.keywords.ts files to export 1 flat array (with empty lines + comments as separators instead of object keys) and delete the flatKeywordList() function - this will prevent unusual, inconsistent naming conventions and also clarify that the object keys do nothing
    2. I would want to then do a separate PR to separate out all of the data type information in src/languages/*/*.keywords.ts in these flat arrays in into a separate (optional) export in each file called export const dataTypes, which would temporarily merged together with keywords to keep compatibility before reservedDataTypes in the next step is implemented
    3. I would like to introduce a new token type called reservedDataTypes to TokenizerOptions also in #673, and pass in the dataTypes export from each file to this - I would do this in a way similarly to how you implemented multi-word keywords in #363
    4. I would like to introduce a second option at the same time in #673 called functionCase, which will have the same options as dataTypeCase and keywordCase, but will case functions such as varchar() and decimal() (and also non-data type functions such as now())

Thoughts on these implementation approaches?

@nene
Copy link
Collaborator

nene commented Nov 28, 2023

Seems like a good approach.

I agree that the current structured objects used for keywords can cause unnecessary confusion. Probably best to da away with these. 👍

@karlhorky
Copy link
Contributor Author

karlhorky commented Nov 29, 2023

PRs open here:

@nene Since this may need to wait for your return, I will continue on with #673 by rebasing on #677's branch

@karlhorky
Copy link
Contributor Author

Further progress:

  • 3iii (TokenType.RESERVED_DATA_TYPE) and dataTypeCase option has been implemented here: a555575

This is my first time working on code for a parser, so it's very possible that I got something wrong. I hope it's not too far off!

@karlhorky
Copy link
Contributor Author

karlhorky commented Nov 30, 2023

Finally, last changes:

  • 3iv (functionCase option) has been implemented here 4ae6c48
  • to avoid this being a breaking change, I also changed the default of dataTypeCase and functionCase to the options.keywordCase value passed in by the user
    • if it's better to avoid this connection and make it a breaking change (default 'preserve') I can revert to this

So I think after merging #677 and #676, this PR ready for a first round of feedback @nene. Hopefully my parser / tokenizer code is not too crazy.

@karlhorky karlhorky marked this pull request as ready for review November 30, 2023 17:22
nene added a commit that referenced this pull request Dec 1, 2023
Part of preparations for PR #673

Copied from [my original
comment](#673 (comment)):

I see there is no distinction in the keyword tokens between `all` and
`dataType` / `datatype` / `dataTypes`:


https://github.com/sql-formatter-org/sql-formatter/blob/290f3b767492d5e3204866447f9f4e4c098f825f/src/languages/postgresql/postgresql.keywords.ts#L4-L8


https://github.com/sql-formatter-org/sql-formatter/blob/290f3b767492d5e3204866447f9f4e4c098f825f/src/languages/transactsql/transactsql.functions.ts#L78-L83


https://github.com/sql-formatter-org/sql-formatter/blob/290f3b767492d5e3204866447f9f4e4c098f825f/src/languages/snowflake/snowflake.keywords.ts#L102-L106


https://github.com/sql-formatter-org/sql-formatter/blob/290f3b767492d5e3204866447f9f4e4c098f825f/src/languages/transactsql/transactsql.functions.ts#L303-L309

This is because of the `flatKeywordList()` function, which effectively
discards the key information:


https://github.com/sql-formatter-org/sql-formatter/blob/290f3b767492d5e3204866447f9f4e4c098f825f/src/utils.ts#L18-L20

Because of this, I would like to simplify the
`src/languages/*/*.keywords.ts` files to export 1 flat array (with empty
lines + comments as separators instead of object keys) and delete the
`flatKeywordList()` function - this will prevent unusual, inconsistent
naming conventions and also clarify that the object keys do nothing.

## Reviewer notes

1. Best to review this with whitespace changes turned off:
https://github.com/sql-formatter-org/sql-formatter/pull/676/files?w=1
2. This PR also enables GitHub Actions workflows CI for pull requests on
branches other than `master` and `develop`, so that PRs from community
members can be tested on GitHub Actions
3. while removing the object keys, I added comments as such:
a. if there was a commented URL above the key, I removed the key and
left the URL as the sole comment (eg. in
`src/languages/db2i/db2i.functions.ts`)
b. if there was no other comment (or if the commented URL was generic,
for multiple sections), I left the key name as the comment (eg. in
`src/languages/hive/hive.functions.ts`)
@nene
Copy link
Collaborator

nene commented Dec 1, 2023

In general this looks fine.

to avoid this being a breaking change, I also changed the default of dataTypeCase and functionCase to the options.keywordCase value passed in by the user

if it's better to avoid this connection and make it a breaking change (default 'preserve') I can revert to this

Yes. I think it would keep the API simpler if there aren't any dependencies between these options. This whole thing will definitely be a major version bump, so we can break backwards compatibility a bit. For anybody upgrading, the change is pretty trivial to do, plus they can make a meaningful choice of whether they actually want function names and data types uppercased. I do very much appreciate your concern for backwards compatibility though.

I would also like to see some tests for these new options. Like something similar that exists in test/options/keywordCase.ts. Also some tests to cover how the keywordCase and dataTypeCase effect the ARRAY keyword/data-type. Probably in test/features/arrayLiterals.ts.

I'd like to explore this implementation a bit more when I get back to the computer. But for now at least nothing obviously wrong jumps out to me.

nene added a commit that referenced this pull request Dec 1, 2023
Part of preparations for PR #673 

---

Copied from [my
comment](#673 (comment)):

I would want to then do a separate PR to separate out all of the data
type information in `src/languages/*/*.keywords.ts` in these flat arrays
in into a separate (optional) export in each file called `export const
dataTypes`, which would temporarily merged together with `keywords` to
keep compatibility before `reservedDataTypes` in the next step is
implemented
Copy link

codesandbox-ci bot commented Dec 1, 2023

This pull request is automatically built and testable in CodeSandbox.

To see build info of the built libraries, click here or the icon next to each commit SHA.

@karlhorky
Copy link
Contributor Author

karlhorky commented Dec 2, 2023

Yes. I think it would keep the API simpler if there aren't any dependencies between these options.

Ok, I've reverted that change in 4f46845

I would also like to see some tests for these new options.

Like something similar that exists in test/options/keywordCase.ts.

Done in 8d81244

One note about this approach below

Also some tests to cover how the keywordCase and dataTypeCase effect the ARRAY keyword/data-type. Probably in test/features/arrayLiterals.ts.

Done in 66d596e

@karlhorky
Copy link
Contributor Author

karlhorky commented Dec 2, 2023

I tested the version published by CodeSandbox CI in one of our projects as noted below, and both options dataTypeCase and functionCase seem to be working 🙌

package.json

{
  "dependencies": {
    "sql-formatter": "https://pkg.csb.dev/sql-formatter-org/sql-formatter/commit/a4e40704/sql-formatter"
  }
}

Comment on lines 37 to +71
if (!nextToken || !isOpenParen(nextToken)) {
return { ...token, type: TokenType.RESERVED_KEYWORD };
return {
...token,
type:
// Function names which are also data types
[
'BIGINT',
'BINARY',
'BIT',
'BLOB',
'BOOLEAN',
'CHAR',
'DATE',
'DECIMAL',
'DOUBLE',
'FLOAT',
'INT',
'INTEGER',
'JSON',
'NCHAR',
'NUMBER',
'NUMERIC',
'NVARCHAR',
'REAL',
'SMALLINT',
'TEXT',
'TIME',
'TIMESTAMP',
'TINYINT',
'VARCHAR',
'XML',
].includes(token.text)
? TokenType.RESERVED_DATA_TYPE
: TokenType.RESERVED_KEYWORD,
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nene I'm not sure about this approach - I would rather get the list of data types from the dialect.

But since disambiguateTokens() is used in the parser, it seems like passing in dialect information would not be desirable.

So instead I went for a list of commonly-overlapping function + data types - these will only be classified as TokenType.RESERVED_DATA_TYPE if they are *also* a function in the dialect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is definitely problematic.

The root problem seems to be that these data types are also defined as functions, which previously was only done because we formatted those data-types like functions e.g. DECIMAL(5, 2). Now that we have data types defined separately, there's really no reason to include them among the functions any more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another alternative approach that I thought of would be:

  1. map each dialect to an array
    1. import + deduplicate all function names
    2. import + deduplicate all data type names
    3. find the overlapping entries
  2. merge all of the arrays and deduplicate

this would make the items more accurate

but I think you're suggesting to go in a different direction with the grammar... I don't think I fully understand what you mean yet

Copy link
Contributor Author

@karlhorky karlhorky Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to understand what you mean here:

The root problem seems to be that these data types are also defined as functions, which previously was only done because we formatted those data-types like functions e.g. DECIMAL(5, 2). Now that we have data types defined separately, there's really no reason to include them among the functions any more.

my assumptions are:

A) sql-formatter would like to keep DECIMAL(5, 2) recognized as a function, to allow for formatting of the function arguments and parentheses
B) the conversion of the DECIMAL part of this to a data type is still necessary

so I'm unsure of what you mean to suggest here

2 hypotheses of what you mean:

X) it sounds like maybe you're saying that we could remove the funcNameToKeyword function altogether from disambiguateTokens. just as a note, if this is the case, I did try going down this path, but it caused problems (although I'm probably not well-versed enough with the code in order to know if these problems are easy to resolve)

Y) another way of interpreting your comment leads me to believe you're suggesting removing DECIMAL (and anything else that is a data type) from *.functions.ts - but I think that also would be problematic...?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would indeed suggest this last option:

Y) another way of interpreting your comment leads me to believe you're suggesting removing DECIMAL (and anything else that is a data type) from *.functions.ts - but I think that also would be problematic...?

Simply removing them will of course result in DECIMAL(1, 2) being formatted as DECIMAL (1, 2) as the formatter no more recognizes as formats it as a function. What should be done, is also changing the parser so it recognizes both <func-name> <parenthesis> and <data-type> <parenthesis> as function calls. Or alternatively we could also introduce a separate node like ParameterizedDataType for the latter.

I'm happy to look into this by myself as I'm now finally back at my computer, and I have a hunch that there are some quirks of the parser that will also be getting in the way of this.

Copy link
Contributor Author

@karlhorky karlhorky Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for picking this one up!

One thing to keep in mind and decide on is what the behavior of conflicting options would be in that DECIMAL(1, 2) case, eg:

{
  functionCase: 'lower',
  dataTypeCase: 'upper',
}

Example query

SELECT DeCimal(1, 2) AS dec, noW() AS n

I'm guessing it would be this:

SELECT DECIMAL(1, 2) AS dec, now() AS n

If this is true, then I think the docs for dataTypeCase (and maybe also functionCase) need to be updated too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Definitely like so. Treating DECIMAL() as a function is an implementation detail of the formatter. It would very much confuse users when dataTypeCase doesn't affect parameterized data types.

Comment on lines +21 to +32
if (!cfg.dialectDoesntHaveVarchar) {
it('formats short CREATE TABLE with lowercase data types', () => {
expect(
format('CREATE TABLE tbl (a INT PRIMARY KEY, b VARCHAR);', {
dataTypeCase: 'lower',
})
).toBe(dedent`
CREATE TABLE
tbl (a int PRIMARY KEY, b varchar);
`);
});
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks completely unnecessary. The lowercasing of data types is already tested by the supportsDataTypeCase() test.

It's also problematic to execute this test only for dialects that support VARCHAR - it's not like lowercasing of data types shouldn't be supported when dialect has no VARCHAR support.

@nene
Copy link
Collaborator

nene commented Dec 6, 2023

FYI @karlhorky, created a PR with some initial progress: #689

@nene nene merged commit 4ee4965 into sql-formatter-org:master Dec 7, 2023
2 checks passed
@karlhorky
Copy link
Contributor Author

@nene Thanks for reviewing, taking over in #689 and merging! 🙌

@karlhorky karlhorky deleted the add-datatypeformat-option branch December 7, 2023 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants