Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name mangling #189

Open
edsko opened this issue Sep 21, 2024 · 14 comments
Open

Name mangling #189

edsko opened this issue Sep 21, 2024 · 14 comments
Assignees

Comments

@edsko
Copy link
Collaborator

edsko commented Sep 21, 2024

We need to be careful translating C names to Haskell names.

@edsko edsko added this to the 1: `Storable` instances milestone Sep 21, 2024
@edsko edsko mentioned this issue Sep 21, 2024
@edsko
Copy link
Collaborator Author

edsko commented Sep 21, 2024

As of #190 , basic infra for this is now in place (see HsBindgen.Hs.AST.Name), but the actual mangling still needs to happen.

@TravisCardwell
Copy link
Collaborator

I thought that C identifiers are pretty restricted, but that knowledge is outdated: C does indeed support Unicode now! From the C23 working draft (N3096):

  • identifier:
    • identifier-start
    • identifier identifier-continue
  • identifier-start:
    • nondigit
    • XID_Start character
    • universal-character-name of class XID_Start
  • identifier-continue:
    • digit
    • nondigit
    • XID_Continue character
    • universal-character-name of class XID_Continue
  • nondigit: one of /_a-zA-Z/
  • digit: one of /0-9/

Package text-icu could be used to check such Unicode character properties correctly. (One would use property to query constructors XidStart and XidContinue of type Bool_.) Perhaps there is no need to do so, however, since we are creating Haskell identifiers.

A Haskell identifier "consists of a letter followed by zero or more letters, digits, underscores, and single quotes" (Haskell 2010 Language Report). Some details:

  • "Small" letters include the underscore and Unicode lowercase letters.
  • "Large" letters include Unicode uppercase and titlecase letters.
  • Digits include Unicode digits.

Perhaps we can remove any characters that are not valid in Haskell names, as well as change the case of the first character according to the type of name as appropriate. There are some edge cases:

  • The name may start with an underscore, which is only valid for variable-like names.
    • For other names, perhaps we can drop leading underscores.
  • The filtered C name may be empty.
  • Multiple C names could result in the same Haskell name.
  • A C name could conflict with a Haskell keyword.

Name conversion is pure. Perhaps it is acceptable to let edge case compiler errors happen, notifying users that they need to work around such name issues? That might be preferred over workarounds such as encoding invalid characters, etc.

It is possible to convert to Haskell-style camelCase and PascalCase names, dropping underscores and adjusting the following letters as appropriate. Is this desired, however, or might it be more convenient to have the Haskell names match the C names as much as possible?

@phadej
Copy link
Collaborator

phadej commented Oct 4, 2024

Perhaps we can remove any characters that are not valid in Haskell names, as well as change the case of the first character according to the type of name as appropriate.

To avoid clashes, the mangling should be injective.

struct foo { ... };
typedef struct foo Foo;

We cannot mangle both foo and Foo to Foo.

@TravisCardwell
Copy link
Collaborator

Thank you very much for the example!

C identifiers that differ only in the case of the first letter do indeed provide a challenge. It is also common for a struct typedef name to be identical.

typedef struct bar {
    char c;
    struct bar *next;
} bar;

Currently, conversion is done via a ToHsName class, such that conversion is a pure function with the C name and the target Haskell namespace as input.

class ToHsName (ns :: Namespace) where
  toHsName :: CName -> HsName ns

Perhaps we need more context. I am trying to get a feeling for what the desired result might be. For a typedef struct with an identical identifier or an identifier that only differs in the case of the first letter, perhaps we do not need separate definitions at all. In the example above, Haskell type Bar could correspond to both C bar and C struct bar. We could generate a type alias only when a typedef has a different name.

C identifiers may differ only in the case of the first letter in general, though, so that special case is not sufficient.

enum Bar {
    HOGE,
    PIYO
};

What should we do in the general case? With more context, perhaps we can use a prefix or suffix. For example, BarStruct and BarEnum could be used. If we maintain a set of identifiers in various current scopes, we could use a prefix or suffix only when it is necessary due to collision.

@TravisCardwell
Copy link
Collaborator

C struct member names are in a namespace specific to that struct. Different struct definitions in the same scope can have members with the same name. Haskell accessor functions, on the other hand, are in the variable namespace for the module, and the same name cannot be used multiple times within a module.

With more context, perhaps we could prefix member names with the name of the struct. For example, barC and barBar could be used for the example in the above comment. These could of course conflict with other definitions, however. If we maintain a set of identifiers in various current scopes, we can check for and avoid collisions. The only way I can think of to avoid collisions without doing this is to encode enough information in Haskell names that it is guaranteed to be unique (assuming that the C is valid), making use of a character that is valid in Haskell identifiers but not C identifiers as a control character: '. The result would likely be pretty ugly, though.

TODO:

  • Add a collisions.h example

@TravisCardwell
Copy link
Collaborator

A Haskell identifier ...

The Haskell 2010 Language Report specs are likely insufficient. It is probably better to reference the GHC source.

@TravisCardwell
Copy link
Collaborator

An example of a character that is valid in C but causes problems in Haskell:

0x30fb KATAKANA MIDDLE DOT ・ xid_continue

This character can be used in C identifiers as long as it is not the first character. When used in Haskell, however, it is interpreted as a symbol character that creates an infix operator.

This is an edge case, but it is feasible that somebody could write C using Japanese identifiers. Other problematic characters that I have found seem at least as unlikely. If we want to handle all cases, however, we need to filter or escape such characters.

I am attaching a simple utility script that shows various information about characters passed as command-line arguments. The extension is renamed to .txt in order to be attached here.

icu-char-info.txt

@edsko
Copy link
Collaborator Author

edsko commented Oct 8, 2024

I think it's probably important to have a definition that deals with the most urgent problems first (capitalization issues, name clashes, etc.), before we deal with the various unicode intricacies.

I think keeping this context free is not only useful for code clarity, but also for predictability of the names for users. Perhaps we can just use a system of standard prefixes?

Actually, on the topic of prefixes, perhaps we should have some context: it might be useful to prefix the names of the fields of a struct with the name of that struct; this avoids name clashes between fields of structs, and is anyway a commom idiom in Haskell. (Perhaps at a later stage we could add an option for enabling the use of OverloadedRecordFields?)

@edsko
Copy link
Collaborator Author

edsko commented Oct 8, 2024

Perhaps we should introduce a type family mapping ns :: Namespace to a context required to resolve names in that namespace?

@edsko
Copy link
Collaborator Author

edsko commented Oct 8, 2024

(Still a local context though; I think for predictability it is important that it doesn't depend on which other types happen to be in scope also.)

@TravisCardwell
Copy link
Collaborator

Thank you for the feedback!

I agree that we should prioritize common issues over edge cases such as Unicode.

Predictability of names for users is indeed a significant concern. Generating Haddock documentation (#26) would likely be really appreciated. This is especially true in the Template Haskell case, IMHO, so that users do not have to read dumped splices.

By "system of standard prefixes," perhaps you are thinking of prepending strings such as Struct? Example: struct bar {...} could translate to data StructBar = StructBar {...}. A drawback is that some users would likely be unhappy with the longer names.

One benefit of using standard prefixes is that the case of the first letter would be part of the prefix. If we do not try to use Haskell naming conventions, we could use prefixes to work around issues caused by identifiers that only differ in case of the first letter.

Contrived Example
struct bar {
    int a;
    int b;
};

struct Bar {
    int b;
    int c;
};
data Struct_bar = Struct_bar {
    struct_bar_a :: Int
  , struct_bar_b :: Int
}

data Struct_Bar = Struct_Bar {
    struct_Bar_b :: Int
  , struct_Bar_c :: Int
}

Prefixing the names of fields indeed requires (local) context. We need to know that the name is a struct member as well as the name of the struct. We could indeed use a type family mapping to constrain the type of the (C) context that may be used for the target (Haskell) namespace. Something like this?

class ToHsName (ns :: Namespace) where
  type ToHsNameContext ns :: Type

  toHsName :: (ToHsNameContext ns) -> CName -> HsName ns

I understand the desire and benefits of only using a local context, not use state to keep track of which identifiers have already been used within each namespace. That is how I was thinking about it at first, but I then recalled that the low level API should work without user customization. In cases where there is a collision, what recourse does the user have? Perhaps they have to develop a C wrapper header with some names changed, duplicating the whole API since we operate on a file level. I wonder if this is acceptable. If we go this route, we should probably write documentation to help users who run into collisions.

FWIW, I think that using a local context sounds good, assuming that it is acceptable for users to have to work around issues using C wrappers.

Here is an overview of the types of collisions (thought of so far):

  • The C code has a typedef for a struct with an identical name.
    • This is considered a best practice by some folks, so it is very likely that we will run into this.
    • Perhaps we can not generate any code for the typedef in this case, using Foo for both foo and struct foo.
  • The C code has different struct members with the same name.
    • Prefixing with the name of the struct requires (local) context.
    • Perhaps OverloadedRecordFields can be used in the future.
  • The C code may have (unrelated) definitions with names that differ only in the case of the first letter.
    • Such names are distinct in C, but capitalization rules in Haskell can result in collision.
  • A C name could collide with something else in scope within the Haskell module.
    • When run as a preprocessor, perhaps we could use NoImplicitPrelude and import Prelude qualified to minimize the number of things in scope? If we make all imports qualified with ' suffixes, then we can avoid such collisions since ' cannot be used in C identifiers. Example: import Prelude qualified as Prelude'
    • When run via Template Haskell, we do not control what is in scope. It is possible to query the compiler using lookupTypeName and lookupValueName if we want to mitigate this (which does not seem likely at this time).
  • A C name could collide with a Haskell keyword.
    • We could check for this and use a different name.
  • C names may include characters that are not allowed in Haskell names, and there may be collisions after accounting for this.

@edsko
Copy link
Collaborator Author

edsko commented Oct 9, 2024

By "system of standard prefixes," perhaps you are thinking of prepending strings such as Struct? Example: struct bar {...} could translate to data StructBar = StructBar {...}. A drawback is that some users would likely be unhappy with the longer names.

Yes indeed, though instead of Struct we could consider C (alongside types in Foreign.C such as CInt). But yes.

Something like this?

class ToHsName (ns :: Namespace) where
  type ToHsNameContext ns :: Type

  toHsName :: (ToHsNameContext ns) -> CName -> HsName ns

Yes indeed (no need for the brackets though 😄 ). While we're at it, we might as well also add a NameManglingOptions struct, even if it's empty at the moment, so that we know to thread that through.

@TravisCardwell
Copy link
Collaborator

I just discovered #160, which is the same topic. This issue is the duplicate, but I am closing #160 in favor of this once because this one has discussion.

In #160, @phadej wrote:

instance Storable primitive where ...

is invalid. Similarly data Foo = Foo { Bar :: Int } would be invalid.

We need a name mangling scheme for at least data type names and field names

See example outputs in master or #159

I see that the case of the first letter is already causing problems. I will fix this ASAP, before adding the context and options or considering edge cases.

@TravisCardwell
Copy link
Collaborator

TravisCardwell commented Oct 11, 2024

The following example should address both of the basic case issues that @phadej indicated.

// lowercase first character
struct foo {
    // uppercase first character
    int A;
};

I prefer to see a test fail before I address the issue, so that I have confidence that I am going in the right direction when the test starts to pass. I tried using preprocess, but that fails. Investigating, I see that data declarations are not implemented yet. No problem.

Details
$ cabal run hs-bindgen -- preprocess -i basics.h -o basics.hs
hs-bindgen: TODO
CallStack (from HasCallStack):
  error, called at src/HsBindgen/Backend/HsSrcExts.hs:90:20 in hs-bindgen-0.1.0-inplace:HsBindgen.Backend.HsSrcExts

EDIT: It turns out that this was already in the tests, but the tests were not failing because they were testing against invalid Haskell. I updated the tests.

I implemented very simple name mangling that just changes the case of the first letter of the name as required, according to the target namespace.

Some GHCi tests
λ: import HsBindgen.C.AST
λ: import HsBindgen.Hs.AST.Name
λ: toHsName @NsVar "foo"
"foo"
it :: HsName NsVar
λ: toHsName @NsConstr "foo"
"Foo"
it :: HsName NsConstr
λ: toHsName @NsVar "Foo"
"foo"
it :: HsName NsVar
λ: toHsName @NsConstr "foo"
"Foo"
it :: HsName NsConstr

I did this minimal change first in case this is blocking anybody, and I will add the context and options next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants