Easy Performance Fixes Guided By Flame Graphs
When I was in grad school and struggling with my Master’s thesis, I started a small side project in order to distract myself and work on something that felt more rewarding. This project was basically a clone of vim – a terminal-based text editor I called badavi. That was around seven years ago. I haven’t worked on it continuously since then, but I would occasionally go back and hack on it in short little bursts. In particular I’ve been doing this the last couple of months, and I thought it might be interesting to write about some of the things I’ve been up to.
For this post, we’ll look at some small performance fixes I made recently, guided by perf and flame graphs.
Just for context, badavi is a small C project, something like 5000 lines currently. It looks like this (click to expand):
Some features showcased there are split windows, insert mode completion (based on ctags), searches, and syntax highlighting.
The other day I was playing around with perf and flame graphs and decided to
see if there were any low hanging fruit performance improvements that could be
made, since I had never profiled the editor before. I used perf record -g
to
profile a short session where I just opened a source file and jumped around the
file using various motions, and generated the corresponding flame graph:
$ perf record -g badavi editor.c
$ perf script | FlameGraph/stackcollapse-perf.pl > out.perf-folded
$ FlameGraph/flamegraph.pl out.perf-folded > perf.svg
This was the result:
From this we can see that the majority of the time was spent inside
editor_draw
. As background, badavi uses a library called termbox to
interact with the terminal. termbox uses an array of cells as an abstraction
over the terminal grid; each cell is composed of a character, a foreground
color, and a background color. In the code, “drawing” refers to the process of
populating this array with the right characters and colors for each cell on the
screen. The array is repopulated from scratch every time we redraw the screen.
Within editor_draw
, we see a few different code paths, but the two dominant
ones are the syntax highlighting machinery (which begins at syntax_token_at
)
and a call to gb_linecol_to_pos
.
Focusing in on the latter, what’s happening there is that for a given window,
we need to populate the array with the portion of the corresponding file text
that is visible on the screen. The window knows which line-and-column positions
are visible, but in order to read the corresponding characters from memory,
these positions need to be converted to offsets from the start of the file (the
file text is represented in memory as a gap buffer, which is similar to a
big array of characters that can accessed by index). This is what
gb_linecol_to_pos
does, and it was being called once per cell.
The way this works right now is that alongside the text, the code also maintains an array of integers storing the length of each line. Converting a line number to an offset involves traversing this array of line lengths and adding them up, so the operation is linear in the number of lines.
I’ve considered changing this array from storing line lengths to storing the
offsets of each newline (or viewed another way, partial sums of the line
lengths). Then gb_linecol_to_pos
could be implemented in constant time, at
the cost of making edits more expensive. I might do that in the future, but in
this case, there was an easy win if just avoid calling it repeatedly like this;
instead, we can get the offset of the first visible character, and increment it
as needed for subsequent characters instead of recomputing it from scratch.
That change can be viewed here:
https://github.com/isbadawi/badavi/commit/2eceda3db447832fd84b8bc6cc0b9e969e769659.
With this change, the flame graph looked like this:
The time was now dominated by syntax highlighting. One of the reasons syntax highlighting is expensive is that we end up scanning the whole file, not just the visible part, just in case we find the start of a block comment or multiline string (depending on the language), which would affect how the visible characters should be colored. I had some bigger improvements in mind for this, so I decided to just set it aside for now, and just disabled syntax highlighting to see if anything else interesting came up. That looked like this:
We see two other code paths that lead to gb_linecol_to_pos
and
gb_pos_to_linecol
(the reverse operation, also linear in the number of
lines). The code path on the right (motion_apply
-> down
) is the
implementation of the G
motion. As in vim, hitting G
in normal mode jumps
to the bottom of the file. I hadn’t noticed that it was slow, since I mostly
work with small source files, but I opened a large 25000-line source file just
to see, and found that there was indeed a very noticeable delay after hitting
G
.
The root cause of the slowness turned out to be a hack that I had put in seven years prior, when I first implemented most of the basic motions.
As in vim, motions in badavi can be prefixed with an integer count, which can
usually be understood as how many times to repeat the motion. For example,
34j
means to repeat j
34 times, where j
means to move down one line.
Given this, I had written some generic code where motions were represented as
functions that returned a new position, and specifying a count caused the
function to be invoked repeatedly in a loop.
But G
works slightly differently; 34G
means to jump to line 34, not to
repeat G
34 times. For whatever reason, I decided back then to awkwardly fit
G
into this “repeated motion” model by implementing it as a repeated j
from
the top of the file, so that for example in a file with 25000 lines, hitting
G
would start from the top of the file and essentially do 25000j
. And as
the flame graph shows, the implementation of j
calls gb_linecol_to_pos
and
gb_pos_to_linecol
. This means G
was quadratic in the number of lines.
I don’t really understand why I did it this way; since G
ends up being
special cased anyway, we might as well special case it to directly compute the
right offset. (Actually it seemed like many of the motions, such as j
in this
example, could be optimized if they were aware of the count instead of being
blindly called repeatedly, but I decided to leave that for later). The change
can be viewed here:
https://github.com/isbadawi/badavi/commit/3f64704a9dd4c768880f71f7693e7240c44c7529.
Here is the corresponding flame graph:
The other code path leading to gb_pos_to_linecol
was the code that displays
the line number for each line. badavi supports relative line numbers, similar
to vim, where rather than displaying the actual line numbers, we display the
line numbers relative to the line number of the cursor (this can be seen in the
screenshot above). In the code, windows keep track of their cursor position as
an offset, so it needs to be converted to a line number, and
window_draw_line_number
just happened to repeatedly recompute it instead of
doing it once and reusing it. That fix can be viewed here:
https://github.com/isbadawi/badavi/commit/13bb413011c9d3cb1732961a3dd069d45aa87105.
And the corresponding flame graph:
There’s not as much an obvious bottleneck here, and we see a broad mix of code paths being represented, including the dynamic linking that happens at process spawn time, the loading of the source file, and the ctags file, so this seemed like a good place to stop for now.
I found this process to be pretty rewarding. In general, profiling an application and looking at flame graphs can often quickly lead to a lot of “obvious” insights that may not have been so obvious just from reading the code (especially if this is the first time profiling this application). Not all issues are as simple to track down and fix as the ones described in this post, but I think it is often the case that many problems become easy if you can just see what a program is actually doing.