-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add experimental dataTypeCase
and functionCase
options
#673
Add experimental dataTypeCase
and functionCase
options
#673
Conversation
datatypeFormat
option for changing data type casedatatypeCase
option for changing data type case
Thanks for starting work on this. FYI, I will be away from computer for about a week. So likely won't be reviewing this during that time. |
datatypeCase
option for changing data type casedataTypeCase
option for changing data type case
dataTypeCase
option for changing data type casedataTypeCase
and functionCase
options
@nene just so you know, digging into the problem a bit, some notes and some ideas for approaches for this PR:
Thoughts on these implementation approaches? |
Seems like a good approach. I agree that the current structured objects used for keywords can cause unnecessary confusion. Probably best to da away with these. 👍 |
PRs open here:
@nene Since this may need to wait for your return, I will continue on with #673 by rebasing on #677's branch |
…ter into add-datatypeformat-option
Further progress:
This is my first time working on code for a parser, so it's very possible that I got something wrong. I hope it's not too far off! |
Finally, last changes:
So I think after merging #677 and #676, this PR ready for a first round of feedback @nene. Hopefully my parser / tokenizer code is not too crazy. |
Part of preparations for PR #673 Copied from [my original comment](#673 (comment)): I see there is no distinction in the keyword tokens between `all` and `dataType` / `datatype` / `dataTypes`: https://github.com/sql-formatter-org/sql-formatter/blob/290f3b767492d5e3204866447f9f4e4c098f825f/src/languages/postgresql/postgresql.keywords.ts#L4-L8 https://github.com/sql-formatter-org/sql-formatter/blob/290f3b767492d5e3204866447f9f4e4c098f825f/src/languages/transactsql/transactsql.functions.ts#L78-L83 https://github.com/sql-formatter-org/sql-formatter/blob/290f3b767492d5e3204866447f9f4e4c098f825f/src/languages/snowflake/snowflake.keywords.ts#L102-L106 https://github.com/sql-formatter-org/sql-formatter/blob/290f3b767492d5e3204866447f9f4e4c098f825f/src/languages/transactsql/transactsql.functions.ts#L303-L309 This is because of the `flatKeywordList()` function, which effectively discards the key information: https://github.com/sql-formatter-org/sql-formatter/blob/290f3b767492d5e3204866447f9f4e4c098f825f/src/utils.ts#L18-L20 Because of this, I would like to simplify the `src/languages/*/*.keywords.ts` files to export 1 flat array (with empty lines + comments as separators instead of object keys) and delete the `flatKeywordList()` function - this will prevent unusual, inconsistent naming conventions and also clarify that the object keys do nothing. ## Reviewer notes 1. Best to review this with whitespace changes turned off: https://github.com/sql-formatter-org/sql-formatter/pull/676/files?w=1 2. This PR also enables GitHub Actions workflows CI for pull requests on branches other than `master` and `develop`, so that PRs from community members can be tested on GitHub Actions 3. while removing the object keys, I added comments as such: a. if there was a commented URL above the key, I removed the key and left the URL as the sole comment (eg. in `src/languages/db2i/db2i.functions.ts`) b. if there was no other comment (or if the commented URL was generic, for multiple sections), I left the key name as the comment (eg. in `src/languages/hive/hive.functions.ts`)
In general this looks fine.
Yes. I think it would keep the API simpler if there aren't any dependencies between these options. This whole thing will definitely be a major version bump, so we can break backwards compatibility a bit. For anybody upgrading, the change is pretty trivial to do, plus they can make a meaningful choice of whether they actually want function names and data types uppercased. I do very much appreciate your concern for backwards compatibility though. I would also like to see some tests for these new options. Like something similar that exists in I'd like to explore this implementation a bit more when I get back to the computer. But for now at least nothing obviously wrong jumps out to me. |
…ter into add-datatypeformat-option
Part of preparations for PR #673 --- Copied from [my comment](#673 (comment)): I would want to then do a separate PR to separate out all of the data type information in `src/languages/*/*.keywords.ts` in these flat arrays in into a separate (optional) export in each file called `export const dataTypes`, which would temporarily merged together with `keywords` to keep compatibility before `reservedDataTypes` in the next step is implemented
This pull request is automatically built and testable in CodeSandbox. To see build info of the built libraries, click here or the icon next to each commit SHA. |
Ok, I've reverted that change in 4f46845
Done in 8d81244 One note about this approach below
Done in 66d596e |
I tested the version published by CodeSandbox CI in one of our projects as noted below, and both options
{
"dependencies": {
"sql-formatter": "https://pkg.csb.dev/sql-formatter-org/sql-formatter/commit/a4e40704/sql-formatter"
}
} |
if (!nextToken || !isOpenParen(nextToken)) { | ||
return { ...token, type: TokenType.RESERVED_KEYWORD }; | ||
return { | ||
...token, | ||
type: | ||
// Function names which are also data types | ||
[ | ||
'BIGINT', | ||
'BINARY', | ||
'BIT', | ||
'BLOB', | ||
'BOOLEAN', | ||
'CHAR', | ||
'DATE', | ||
'DECIMAL', | ||
'DOUBLE', | ||
'FLOAT', | ||
'INT', | ||
'INTEGER', | ||
'JSON', | ||
'NCHAR', | ||
'NUMBER', | ||
'NUMERIC', | ||
'NVARCHAR', | ||
'REAL', | ||
'SMALLINT', | ||
'TEXT', | ||
'TIME', | ||
'TIMESTAMP', | ||
'TINYINT', | ||
'VARCHAR', | ||
'XML', | ||
].includes(token.text) | ||
? TokenType.RESERVED_DATA_TYPE | ||
: TokenType.RESERVED_KEYWORD, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nene I'm not sure about this approach - I would rather get the list of data types from the dialect.
But since disambiguateTokens()
is used in the parser, it seems like passing in dialect information would not be desirable.
So instead I went for a list of commonly-overlapping function + data types - these will only be classified as TokenType.RESERVED_DATA_TYPE
if they are *also* a function in the dialect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is definitely problematic.
The root problem seems to be that these data types are also defined as functions, which previously was only done because we formatted those data-types like functions e.g. DECIMAL(5, 2)
. Now that we have data types defined separately, there's really no reason to include them among the functions any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another alternative approach that I thought of would be:
- map each dialect to an array
- import + deduplicate all function names
- import + deduplicate all data type names
- find the overlapping entries
- merge all of the arrays and deduplicate
this would make the items more accurate
but I think you're suggesting to go in a different direction with the grammar... I don't think I fully understand what you mean yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to understand what you mean here:
The root problem seems to be that these data types are also defined as functions, which previously was only done because we formatted those data-types like functions e.g.
DECIMAL(5, 2)
. Now that we have data types defined separately, there's really no reason to include them among the functions any more.
my assumptions are:
A) sql-formatter
would like to keep DECIMAL(5, 2)
recognized as a function, to allow for formatting of the function arguments and parentheses
B) the conversion of the DECIMAL
part of this to a data type is still necessary
so I'm unsure of what you mean to suggest here
2 hypotheses of what you mean:
X) it sounds like maybe you're saying that we could remove the funcNameToKeyword
function altogether from disambiguateTokens
. just as a note, if this is the case, I did try going down this path, but it caused problems (although I'm probably not well-versed enough with the code in order to know if these problems are easy to resolve)
Y) another way of interpreting your comment leads me to believe you're suggesting removing DECIMAL
(and anything else that is a data type) from *.functions.ts
- but I think that also would be problematic...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would indeed suggest this last option:
Y) another way of interpreting your comment leads me to believe you're suggesting removing
DECIMAL
(and anything else that is a data type) from*.functions.ts
- but I think that also would be problematic...?
Simply removing them will of course result in DECIMAL(1, 2)
being formatted as DECIMAL (1, 2)
as the formatter no more recognizes as formats it as a function. What should be done, is also changing the parser so it recognizes both <func-name> <parenthesis>
and <data-type> <parenthesis>
as function calls. Or alternatively we could also introduce a separate node like ParameterizedDataType
for the latter.
I'm happy to look into this by myself as I'm now finally back at my computer, and I have a hunch that there are some quirks of the parser that will also be getting in the way of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks for picking this one up!
One thing to keep in mind and decide on is what the behavior of conflicting options would be in that DECIMAL(1, 2)
case, eg:
{
functionCase: 'lower',
dataTypeCase: 'upper',
}
Example query
SELECT DeCimal(1, 2) AS dec, noW() AS n
I'm guessing it would be this:
SELECT DECIMAL(1, 2) AS dec, now() AS n
If this is true, then I think the docs for dataTypeCase
(and maybe also functionCase
) need to be updated too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Definitely like so. Treating DECIMAL()
as a function is an implementation detail of the formatter. It would very much confuse users when dataTypeCase
doesn't affect parameterized data types.
if (!cfg.dialectDoesntHaveVarchar) { | ||
it('formats short CREATE TABLE with lowercase data types', () => { | ||
expect( | ||
format('CREATE TABLE tbl (a INT PRIMARY KEY, b VARCHAR);', { | ||
dataTypeCase: 'lower', | ||
}) | ||
).toBe(dedent` | ||
CREATE TABLE | ||
tbl (a int PRIMARY KEY, b varchar); | ||
`); | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test looks completely unnecessary. The lowercasing of data types is already tested by the supportsDataTypeCase()
test.
It's also problematic to execute this test only for dialects that support VARCHAR
- it's not like lowercasing of data types shouldn't be supported when dialect has no VARCHAR
support.
FYI @karlhorky, created a PR with some initial progress: #689 |
Closes #237
Closes #653
dataTypeCase (experimental)
Converts data types to upper- or lowercase.
Note: Casing of function names like
VARCHAR(30)
are not modified - instead rely on thefunctionCase
option for this.Options
"preserve"
(default) preserves the original case."upper"
converts to uppercase."lower"
converts to lowercase.preserve
upper
lower
functionCase (experimental)
Converts functions to upper- or lowercase.
Options
"preserve"
(default) preserves the original case."upper"
converts to uppercase."lower"
converts to lowercase.preserve
upper
lower