-
Notifications
You must be signed in to change notification settings - Fork 11
/
intro.texi
279 lines (231 loc) · 12.2 KB
/
intro.texi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
@node Introduction to Emacs Lisp Byte Code and LAP
@chapter Introduction to Emacs Lisp Byte Code and LAP
@menu
* Why is Emacs Lisp Bytecode Important and How is Emacs as a Program Different?::
* Emacs Lisp Bytecode and LAP::
* Emacs Lisp Virtual Machine::
* Wither Bytecode - Its Future::
@end menu
@SECTION{Why is Emacs Lisp Bytecode Important and How is Emacs as a Program Different?}
If we were to compare two similar complex programs in around 2018,
Firefox 53.0.3 and Emacs 25.3, we would see that the Firefox
tarball is 5 times bigger than the Emacs tarball. How are these made up,
and what languages are they comprised of?
@noindent
For Firefox whose core is written in C++ we have:
@verbatim
$ cloc --match-f='\.(js|c|cpp|html|py|css)$' firefox-53.0.3
89156 text files.
86240 unique files.
1512 files ignored.
cloc v 1.60 T=244.20 s (353.2 files/s, 56012.8 lines/s)
-------------------------------------------------------------
Language files comment code
-------------------------------------------------------------
C++ 7267 418019 3057110
Javascript 25855 532629 2859451
HTML 45311 120520 2209067
C 3482 400594 1664666
@end verbatim
@noindent
And for Emacs whose core is written in C we have:
@verbatim
$ cloc emacs-25.3.tar.xz
3346 text files.
3251 unique files.
1130 files ignored.
cloc 1.60 T=13.85 s (160.1 files/s, 154670.7 lines/s)
-------------------------------------------------------------------
Language files comment code
--------------------------------------------------------------
Lisp 1616 200820 1270511
C 255 66169 256314
C/C++ Header 176 11505 34891
@end verbatim
If we look at the relative ratio of C++ to Javascript code in
Firefox, and the ratio of C versus Lisp code in Emacs, we see that
much more of Emacs is written in Lisp than Firefox is written
in Javascript. (And a lot of C code for Emacs looks like Lisp written using C syntax).
My take is that Emacs a
lot more orthogonal in its basic concepts and construction. Just as
Leibniz was amazed that such diversity could come out of such simple
rules of mathematics and physics, so it is remarkable that something
as complex as Emacs can come out of the relatively simple language,
Lisp.
@SECTION{Emacs Lisp Bytecode and LAP}
However pervasive Emacs Lisp is in the Emacs
ecosystem, Emacs Lisp is not and never has been a speedy language compared
to say, C, C++, Go, Rust or Java. And that's where LAP and bytecode
come in.
As stated in a commment in @code{byte-opt.el} added from Lucid Emacs
circa 1992:@footnote{This likely an adaptation of dialogue from
@url{https://www.goodreads.com/quotes/7745830-you-can-t-make-a-race-horse-of-a-pig-no,
``East of Eden by John Steinbeck''}. Jamie Zawinski (jwz) is
responsible for its addition (and the bit about turbocharged VW bugs
below it) but makes no claims to its origin.}
@quotation
No matter how hard you try, you can't make a racehorse out of a pig.
You can, however, make a faster pig.
@author Jamie Zawinski
@end quotation
Emacs Lisp bytecode is the custom lower-level language used by the
Emacs bytecode interpreter. As with all bytecode, its instructions are
compact. For display purposes, there is a @code{disassemble} command
that unpacks the fields of the instruction. With this and the
constants vector, bytecode can be printed in an assembly language-like
format.
@cindex bytecode
I'll often use an Emacs Lisp bytecode instruction to refer to an assembly
representation of the instruction.
@cindex LAP
LAP stands for Lisp Assembly Program. It is an internal representation
of the bytecode instructions in a more symbolic form. It is used
behind the scenes to make bytecode more amenable to optimization,
since the instructions are in a structure which is easier to operate
on.
If we want to write the instruction sequence in this symbolic form
rather than give a byte-encoded form, we can do that using the
function @code{byte-compile-lapcode}.
@unnumberedsubsec Example showing use of @code{byte-compile-lapcode}
@findex byte-compile-lapcode
@findex make-byte-code
@verbatim
(defalias 'get-foo
(make-byte code
#x000 ;; lexical parameter counts
(byte-compile-lapcode
'((byte-varref . 0)
(byte-return))) ;; instruction sequence
[foo] ;; constants vector
1)) ;; max stack usage
@end verbatim
@uref{https://www.gnu.org/software/emacs/manual/html_node/elisp/Speed-of-Byte_002dCode.html,
Silly Loop Example} in the Emacs Lisp Manual gives a program to time
running in some code in the bytecode interpreter versus running the code in
the Lisp interpreter. When I ran this program, bytecode ran 2.5 times
faster@footnote{Code was compiled to use dynamic binding for variable
access, as was probably the case in the Emacs Lisp manual. We should
note that byte-compiling with lexical binding for variable access
gives code that runs a bit faster than when dynamic binding is
used.}. The Emacs Lisp manual gets a speed improvement of about 3
times.
@SECTION{Emacs Lisp Virtual Machine}
@cindex Reverse Polish Notation
The Emacs Lisp bytecode interpreter, like many bytecode interpreters
such as Smalltalk, CPython, Forth, or PostScript, has an evaluation
stack and a code stack. Emacs Lisp bytecode instructions use reverse Polish notation: operands appear prior to the
operator. This is how many other bytecode interpreters work. It is
the opposite of the way Lisp works. To add the values of two
variables we might write @code{(+ a b)}. However in bytecode it is the
other way around: the operator or function comes last. So the
corresponding bytecode is:
@verbatim
0 varref a
1 varref b
2 plus
@end verbatim
As in most language-specific virtual machines, but in contrast to a
typical general-purpose virtual machine, the things that
are on the evaluation stack are the same objects that are found in the
system that they model. Here, these objects can include Emacs buffers,
or font faces, Lisp objects like hashes or vectors, or simply (30-bit)
Lisp integers. Compare this with, say, LLVM IR, or JVM instructions
where the underlying objects on the stack are registers which can act
as pointers, and the internal memory layout of objects is exposed.
Control flow in Lisp bytecode is similar to a conventional assembly
language: there are unconditional and conditional jumps. More complex
control structures are simply built out of these.
Although it may be obvious, one last thing to point out is that the
Emacs Lisp bytecode instruction set is custom to Emacs. In addition
to primitives that we would expect for Lisp such @code{car} and
@code{cdr}, there are primitive bytecodes for more-complex Emacs
editor-specific concepts such as ``save-excursion''@footnote{The
semantic level difference between Emacs Lisp and its bytecode is not
great, so writing a decompiler for it more feasible than if the
bytecode language were of a general nature such as, say, LLVM IR.}.
The interpreter is largely backward compatible, but not forward
compatible (although eventually old Emacs Lisp bytecode instructions
do die). So old versions of Emacs cannot necessarily run new
bytecode. Each instruction is between 1 and 3 bytes. The first byte
is the opcode and the second and third bytes are either a single
operand or a single immediate value. Some operands are packed into
the opcode byte.
@SECTION{Wither Bytecode - Its Future}
Emacs's bytecode is pretty old---about as old as Emacs itself. And
although there have been some changes to it, there has always been lurking
in the background the question of whether it might be totally ditched,
either as a by-product of switching out the underlying Lisp
implementation for something else, or as a result of using JIT
technology.
Let's take these two situations where Emacs Lisp Bytecode might become
obsolete separately. Both ideas have been floating around for a long
time.@footnote{Tom Tromey says about the same thing in his FOSDEM 2020
talk @url{https://fosdem.org/2020/schedule/event/emacsthoughts/} or at
least he seems to come up with similar conclusions.}
With respect to alternate programming-language implementations, there
have been many languages that been proposed and experimented with. The
big obstacle in totally replacing Emacs Lisp is in rewriting the huge
current Emacs Lisp code base. (The counts given in the last section
for Emacs 25.3 are 1.5K files and 100K lines of code.)
I think that if such an approach were to work, the language would have
to be available as an additional language until the current code base
was replaced. At present (circa 2018), alternate programming languages
haven't gained much of a foothold; they are not in the current Emacs
distribution or in any of its branches.
An obvious alternative language proposed is Common Lisp. Over time, an
Emacs Lisp package implementing Common Lisp has been providing more
and more Common Lisp functionality; names, however, are prefaced with
@code{cl-}.
The addition of features in Common Lisp has been somewhat
reflected in changes in the run-time systems, such as the
addition of lexical scoping. And this approach partially solves the
large code-base migration problem. But it also reduces the need to
jump cold turkey from Emacs Lisp Bytecode to something else.
And what about the other possibility where Emacs incorporates JIT
technology? The motivation for this is to speed up Emacs. There is
widespread belief among the development community that there could be
big performance wins if this were done right. After all, it is not
uncommon for some people to live inside a single GNU Emacs session.
This idea of using a JIT to speed performance goes back over a decade,
at least back to 2006. Of the JITs that have been proposed, at least
four of them use Emacs Lisp Bytecode as the basis from which to JIT
from. I think that is because Emacs Lisp Bytecode is a reasonable
target to JIT: it is sufficiently low level, while also easy to hook a
JIT into.
Two alternatives to Emacs Lisp Bytecode which have sophisticated JIT
technology are LLVM IR and JVM IR. For each, the surrounding run-time
environment would have to be replicated. Another IR possibility might
be JavaScript IRs: specifically, the ones for V8 and Spidermonkey.
Pipcet's work that allows SpiderMonkey's garbage collector to be used in
Emacs, allows for a real possibility of using SpiderMonkey's JIT with
either JavaScript, Emacs Lisp bytecode, or Emacs Lisp bypassing Emacs
Lisp bytecode. That last route I think might be harder. JIT'ing from
Emacs Lisp bytecode to via SpiderMonkey (if it is possible) would
allow for dual Emacs Lisp and JavaScript scripting while the other
options don't.
@cindex LIMPLE
@cindex gccemacs
In late 2019, Andrea Corallo proposed a prototype for adding native
compilation to Emacs Lisp. The work is described in
@url{https://zenodo.org/record/3736363#.X_LeOFNKjMU}. The idea behind
this is to use LAP as main input for the compilation, this is lowered
into LIMPLE, an SSA based intermediate representation, where several
compiler passes are used to perform a number of Lisp specific
optimizations. Eventually LIMPLE is translated into libgccjit IR and
libgccjit is used to plug into the GCC infrastructure and achieve
native code generation.
The approach of a dedicated SSA based Intermediate Representation allow
for advanced value/type inference within the code enabling effective
optimizations as type check removal. Also Elisp is extended for
allowing the programmer to provide compiler hints.
The so called Emacs Lisp native compiler (AKA gccemacs @footnote{Andrea
Corallo gccemacs development blog.
@url{http://akrl.sdf.org/gccemacs.html}}) is capable of AOT as well as
parallel asynchronous JIT compilation. In the later case byte compiled
functions are used till the native compiled version is available and
function hot swap is performed.
As of January 2021 this work is being integrated in Emacs and is
planned to released as part of Emacs 28.
Needless to say, such a lot of work is involved in adding any sort of
JIT technology; it will not totally replace Emacs Lisp Bytecode in the
near future.