-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TokenSet, First(), Follow() #3
base: master
Are you sure you want to change the base?
Conversation
NEW: A) The First and Follow methods in NonTerminals. B) TokenSet.cs C) Various changes implemented to help with A) and B) (marked with //SAM CHANGE for easy location) Notes: The newFromUnion() method in TokenSet.cs has two implementations: One, based on the C compiler, is the one currently in use. This one I believe to be more efficient. The other has been commented out. This one is more readable in my opinion. These changes should be ready to be committed into master. For now I will leave them in a separate branch for any desired review before that happens.
NonTerminals.cs
Outdated
|
||
uint Count() { | ||
uint total = 0; | ||
foreach (String s in Enum.GetNames(typeof(Production))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C# we can use the Length()
method to get the number of items in an enumeration:
Enum.GetNames(typeof(Production)).Length;
NonTerminals.cs
Outdated
p += alternateSetOffset; | ||
} /* end if */ | ||
|
||
return tokenset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always returns null.
We need initialisers for all the FIRST sets, ideally in an array where the index matches the value of p
.
see
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-production.c#m2-production.c-48
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-production.c#m2-production.c-231
once we have the initialisers, we can get the set from the array
index = Convert.ToUInt32(p) + alternateSetOffset;
...
tokenset = firstSet[index];
...
NonTerminals.cs
Outdated
} /* end if */ | ||
|
||
return tokenset; | ||
} /* end FOLLOW */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This too, always returns null.
We need initialisers for all the FOLLOW sets, in the same manner as for the FIRST sets.
see
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-production.c#m2-production.c-139
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-production.c#m2-production.c-325
and accordingly ...
index = Convert.ToUInt32(p) + alternateSetOffset;
...
tokenset = followSet[index];
...
NonTerminals.cs
Outdated
|
||
TokenSet FIRST(Production p) { | ||
TokenSet tokenset = null; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add uint index = 0;
NonTerminals.cs
Outdated
|
||
if (IsConstParamDependent(p)) | ||
{ | ||
p += alternateSetOffset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be index = Convert.ToUInt32(p) + alternateSetOffset;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason you like Convert.ToUInt32(p) over (uint)p?
Should run a find & replace and change all of the code to one or the other?
NonTerminals.cs
Outdated
} /* end if */ | ||
if (IsVariantRecordDependent(p) && CompilerOptions.VariantRecords()) | ||
{ | ||
p += alternateSetOffset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as for line 119
NonTerminals.cs
Outdated
|
||
TokenSet FOLLOW(Production p) { | ||
TokenSet tokenset = null; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see line 116
NonTerminals.cs
Outdated
TokenSet tokenset = null; | ||
|
||
if (IsConstParamDependent(p)) { | ||
p += alternateSetOffset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see line 119
NonTerminals.cs
Outdated
p += alternateSetOffset; | ||
} /* end if */ | ||
if (IsVariantRecordDependent(p) && !CompilerOptions.VariantRecords()) { | ||
p += alternateSetOffset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see line 119
NonTerminals.cs
Outdated
lastOptionDependent = lastNoVariantRecDependent; | ||
public static int alternateSetOffset = (int)lastOptionDependent - (int)firstOptionDependent + 1; | ||
|
||
/* -------------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a private array with the initialisers for all FIRST and FOLLOW sets here. Ideally all initialisation will take place at compile time. The FIRST() and FOLLOW() functions should then compute the index of the appropriate set in the private array, fetch it from the array and return it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where to pull the values for the first and follow sets from. It's possible I'm just completely missing something obvious. Can you point me where I need to be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to write standalone programs that generate the code. The C versions are
https://bitbucket.org/trijezdci/m2c-rework/src/tip/gen_first_sets.c
https://bitbucket.org/trijezdci/m2c-rework/src/tip/gen_follow_sets.c
The C versions of the generated code are
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-first-set-inits.h
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-follow-set-inits.h
In C we pull them in via #include directives, but in C# there is no such thing. We could use templates but for now let's simply copy-paste the generated code in.
return null; | ||
} /* end if */ | ||
return token.ToString(); | ||
} /* end LexemeForResword */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this matters less for reserved words, for we use interned strings for all lexemes within the compiler. For consistency we should also apply this to reserved words.
The reason for interning strings is that at any point within the compiler we can compare two strings with a single identity test instead of having to string-compare them.
For example, the identifiers of modules and functions in Modula-2 are repeated at the end of their declarations:
PROCEDURE Foobar ( baz : Bam );
...
END Foobar;
When checking that the END part matches the procedure header, we can simply test for identity:
ident1 = Lexer.CurrentLexeme();
...
ident2 = Lexer.CurrentLexeme();
...
if (ident1 == ident2) {
/* identifiers match */
}
If we use the To.String()
method, the strings won't be interned.
In the C version, I defined private arrays that contain the strings, then calculate the index for the array, fetch the string and return it. Even though C# has methods for interning strings, it is preferable to use the same approach as in the C version. Ideally, the string arrays should be initialised at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created the array lexemeTable in TokenSet.cs to do the same job as your string arrays. I simply referenced lexemeTable in my fix, but because it's static I could easily move lexemeTable into Terminals.cs and then reference it as Terminals.lexemeTable in TokenSet if you would rather.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work thus far, but we need initialisers for the sets. For details see my comments within the code.
TokenSet.cs
Outdated
|
||
/* out-of-range guard */ | ||
|
||
"\0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this guard is only needed in C, not in C#.
TokenSet.cs
Outdated
public static int segmentCount = (Enum.GetNames(typeof(Token)).Length / 32) + 1; | ||
|
||
public struct opaque | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the style we use throughout the project is ...
foobar {
...
} /* end foobar */
TokenSet.cs
Outdated
|
||
public static int segmentCount = (Enum.GetNames(typeof(Token)).Length / 32) + 1; | ||
|
||
public struct opaque |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use a more descriptive name here, possibly also change the struct to something else.
An opaque type is a type whose name is defined in a public interface and whose implementation is hidden. Instances of the type are opaque pointers to hidden data, that is, the pointer is exposed to clients but the structure that it points to is unknown to them.
The C syntax to declare opaque types is not self explanatory. It isn't immediately obvious what the type declaration means. For this reason we use a naming convention in C that makes it explicit in the name that the type is an opaque type. Thus a type we might otherwise have called tokenset_t
is then called tokenset_opaque_t
instead.
Both the structure and the naming needs to be appropriately adjusted in the C# version. In C# the primary means to hide data is the class. The struct is simply a completely hidden type, it is not exposed to the client. The instance of the hidden type is an instance variable.
The advantage of keeping the struct is that it can be mapped to the initialisation data which we want to be struct based so we can initialise it using a structured literal like { 0x0, 0x0, 0x3, 2 }. Otherwise we could simply declare separate instance vars, one for each segment, one for the count.
private struct TokenSetBits {
uint bits64to95;
uint bits32to63;
uint bits0to31;
uint count;
} /* end TokenSetBits */
alternatively
private struct TokenSetBits {
uint[SegmentCount] segments;
uint count;
} /* end TokenSetBits */
which allows us to address the segments by index (desirable), but then SegmentCount has to be a compile time constant and the value needs to be calculated at compile time (absolute must), not at runtime.
Terminals.cs
Outdated
} /* end switch */ | ||
break; | ||
|
||
case /* length 6 */ 6 : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had added reserved word OPAQUE here. You need to copy the section for all the reswords of length 6.
imp/NonTerminals.cs
Outdated
|
||
TokenSet[] followSet = new TokenSet[Count()]; | ||
TokenSet[] firstSet = new TokenSet[Count()]; | ||
public static uint alternateSetOffset = (uint)lastOptionDependent - (uint)firstOptionDependent + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constants firstConstParamDependent, lastConstParamDependent, firstNoVariantRecDependent, lastNoVariantRecDependent, firstOptionDependent, lastOptionDependent and alternateSetOffset are only used by the methods of this class, we don't need to expose them to clients. They ought to be private.
imp/NonTerminals.cs
Outdated
TokenSet[] followSet = new TokenSet[Count()]; | ||
TokenSet[] firstSet = new TokenSet[Count()]; | ||
public static uint alternateSetOffset = (uint)lastOptionDependent - (uint)firstOptionDependent + 1; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please set your editor to expand tab into two spaces [ASCII(32)], do not let it insert any TABs [ASCII(9)].
Lines should be broken as follows
public Foobar foobar =
foo = Foobar.Foo,
bar = Foobar.Bar,
baz = Foobar.Baz;
etc
imp/NonTerminals.cs
Outdated
public static uint alternateSetOffset = (uint)lastOptionDependent - (uint)firstOptionDependent + 1; | ||
|
||
#region followSets | ||
TokenSet[] followSets = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIRST sets should come first, FOLLOW sets last.
imp/NonTerminals.cs
Outdated
|
||
new TokenSet( /* bits: */ 0x58000004, 0x00000242, 0x00000050, /* counter: */ 9 ), /* typeDeclarationTail */ | ||
}; | ||
#endregion | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove all the //SAM CHANGE
comments. They have no documentary value for users and maintainers. For any temporary comments during review, we can use Github annotation features.
imp/TokenSet.cs
Outdated
@@ -175,6 +175,20 @@ private TokenSet() | |||
// no operation | |||
} /* end TokenSet */ | |||
|
|||
/* --------------------------------------------------------------------------- | |||
* private constructor TokenSet (uint[]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have constructor methods newFromList()
and newFromUnion()
. As you can see in line 170, I have made the default constructor TokenSet()
private in order to prevent clients of the class from ever being able to use any other constructors than newFromList()
and newFromUnion()
.
Your comment says "private" but your code declares it public. But even as a private method it makes no sense because the two constructors newFromList()
and newFromUnion()
use the private default constructor in lines 201 and 248.
imp/TokenSet.cs
Outdated
dataStored.segments[2] = TSB[2]; | ||
dataStored.elemCount = TSB[3]; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 205 says /* allocate new set */
but that is not true because newSet
is already allocated in line 201.
imp/TokenSet.cs
Outdated
dataStored.segments[2] = TSB[2]; | ||
dataStored.elemCount = TSB[3]; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 206 allocates a new array of segments, but the array is already allocated in line 161.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like the name dataStored
. It is entirely non-descriptive.
If you had no idea what this code is about and all you saw was a single line of code that references the struct like
newSet.dataStored.segments[segmentIndex] = 0;
does the "dataStored" part in there add any information or does it add confusion?
Either use a descriptive name, like bits
...
newSet.bits.segments[segmentIndex] = 0;
newSet.bits.elemCount = 0;
... or get rid of the struct and use just segments
and elemCount
...
newSet.segments[segmentIndex] = 0;
newSet.elemCount = 0;
The latter is generally preferable UNLESS we need to assign a compound literal directly to the struct, like so
newSet.bits = { 0x0, 0x0, 0x11, 2 };
I have not seen your code representing gen_first_sets.c and gen_follow_sets.c. It depends on how you implement those whether you want segments
and elemCount
to be within a struct or separate instance variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear to me how you generated the FIRST and FOLLOW set literals. Did you copy-paste them from the C code? If so, that is not what we want. The project needs to be self contained, thus it needs to have its own standalone programs to generate the FIRST and FOLLOW set literals.
imp/TokenSet.cs
Outdated
@@ -154,7 +154,7 @@ public class TokenSet : ITokenSet { | |||
|
|||
}; /* end m2c_token_name_table */ | |||
|
|||
public static const int segmentCount = (Enum.GetNames(typeof(Token)).Length / 32) + 1; | |||
public const int segmentCount = (Enum.GetNames(typeof(Token)).Length / 32) + 1; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constants and variables within an implementation of a class should always be private. If a client of the class needs to read the constant or variable, then there should be a public function method to return its value. If a client of the class needs to write to a variable, then there should be a public mutator method to overwrite its value. Such public methods will then need to be added to the public interface of the class.
imp/TokenSet.cs
Outdated
* --------------------------------------------------------------------------- | ||
* Used to instantiate the prefabricated first and follow lists. | ||
* Expects TSB[segmentCount] to be the number of tokens, and returns null if | ||
* the count does not match. | ||
* ------------------------------------------------------------------------ */ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* --------------------------------------------------------------------------
* constructor newFromRawData( bits95to64, bits63to32, bits31to0, counter )
* --------------------------------------------------------------------------
* Returns a newly allocated tokenset object from raw data passed in as
* three data segments of 32 bits from highest to lowest, followed by a bit
* counter. Returns null if the input data is inconsistent.
* ----------------------------------------------------------------------- */
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from highest to lowest
do you want me to change my code to run through the segments in reverse, or do you just want the bit values to be printed out in a different order?
imp/TokenSet.cs
Outdated
} | ||
TokenSet newSet = new TokenSet(); | ||
uint counter = 0; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First we need to test if the number of arguments is correct, return null if it is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, we check the last argument against the largest number of bits that can be represented by segmentCount*32 bits. If the counter is greater, then we can already tell it is wrong, forego any further expensive checks and return null.
imp/TokenSet.cs
Outdated
uint counter = 0; | ||
|
||
for (int i = 0; i < segmentCount; i++) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next, we iterate only over the first count-1 arguments, because the last argument is the counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check;
if(args.Length != segmentCount + 1) {
return null;
} /* end if */
so segmentCount should always be the same as args.Length - 1.
Regardless, changed the code to:
for (int segmentIndex = 0; segmentIndex < args.Length - 1; segmentIndex++) {
imp/TokenSet.cs
Outdated
uint counter = 0; | ||
|
||
for (int i = 0; i < segmentCount; i++) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next, we only iterate over the first count-1 arguments because the last argument is the counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further, we should use self explanatory names. The loop variant is an index to address segments. A self explanatory name would be segmentIndex
.
imp/TokenSet.cs
Outdated
|
||
for (int i = 0; i < segmentCount; i++) { | ||
|
||
for (int j = 0; j < 32; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop variant is an index to address individual bits, a self explanatory name would be bitIndex
.
imp/TokenSet.cs
Outdated
* ------------------------------------------------------------------------ */ | ||
|
||
public TokenSet(params uint[] TSB) | ||
public static TokenSet newFromRawData(params uint[] TSB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TSB is a non-descriptive name, also it goes against C# naming conventions to have a parameter in all-uppercase. Since the variadic argument list includes both the bit patterns and a bit counter, there isn't really any single name that will be specific, so we have to settle for a generic name. I generally use args
in such cases.
imp/NonTerminals.cs
Outdated
|
||
new TokenSet( /* bits: */ 0x58000004, 0x00000242, 0x00000050, /* counter: */ 9 ), /* typeDeclarationTail */ | ||
TokenSet.newFromRawData( /* bits: */ 0xb0000008, 0x00000484, 0x00000140, /* counter: */ 9 ), /* typeDeclarationTail */ | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the bit pattern values have doubled relative to the C version, indicating that they were left-shifted by one. A probably cause might be an off-by-one index to address the bits. That is to say, any bit stored at position n should have been stored at position n-1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm mistaken, this is because the ARGLIST
token does not exist in the C compiler. The addition of said token early on in the enumeration shifts every following token's index in the enum by one.
Am I correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that the values also shift left again after the index of Backslash
, another token which isn't included in the C compiler.
Moving utils into master
MOD, MODULE, NOT, OF, OPAQUE, OR, POINTER, PROCEDURE, QUALIFIED, RECORD, | ||
REPEAT, RETURN, SET, THEN, TO, TYPE, UNTIL, VAR, WHILE, WITH, | ||
MOD, MODULE, NOT, OF, OPAQUE, OR, POINTER, PROCEDURE, QUALIFIED, RECORD, REPEAT, | ||
RETURN, SET, THEN, TO, TYPE, UNTIL, VAR, WHILE, WITH, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please keep the original formatting -- the sources are formatted to 79 columns to ensure that there will be no vertical scrollbars even when viewed on a small screen laptop.
public static uint alternateSetOffset = (uint)lastOptionDependent - (uint)firstOptionDependent + 1; | ||
|
||
#region followSets | ||
TokenSet[] followSets = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIRST sets should come first, FOLLOW sets last.
|
||
namespace org.m2sf.m2sharp { | ||
|
||
using System; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please follow the original formatting:
- strictly NO TABs, only spaces.
- two spaces indentation.
- maximum 79 characters per line
NEW:
A) The First and Follow methods in NonTerminals.
B) TokenSet.cs
C) Various changes implemented to help with A) and B) (marked with //SAM CHANGE for easy location)
Notes:
The newFromUnion() method in TokenSet.cs has two implementations:
One, based on the C compiler, is the one currently in use. This one I believe to be more efficient.
The other has been commented out. This one is more readable in my opinion.
These changes should be ready to be committed into master. For now I will leave them in a separate branch for any desired review before that happens.