Growth of the Linux Code Base

Remzi H. Arpaci-Dusseau
Introduction
We study the growth of the Linux source code tree over time. Note that this study is not yet complete - these are preliminary results.

Methodology
To gather data for this study, we obtained source code from this site. Fortunately, Linux source code archives are commonplace, enabling this type of analysis.

To analyze the data, files are first separated into two categories: code and non-code. Code files are identified by file suffix, in particular .c, .h, and .S. All other files are classified as non-code files.

The size of the Linux source tree and various components is calculated simply by summing the size (in bytes) of all code files. Thus, whitespace and comments are all included in final counts.

Analysis
Click here for a graph of the growth of Linux since its inception. From mid-1994 to mid-1999, the total amount of code in the source tree has grown by almost a factor of 10 (5.23 MB of code in version 1.1.33, July 21, 1994 to 51.61 MB in version 2.3.9, June 30, 1999).

Click here for the same graph, but with the addtion of two linear models. The first model represents growth of Linux from inception through mid-1995 (version 1.2). The second model begins in mid-1995 (version 1.3) and goes through late 1999 (version 2.3). As you can see, the growth of the code base during the past five years is much higher (5x) than the first few years.

Finally, we breakdown the code into various sub-components, based on directory organization. Here is a set of graphs that show how much code is found in each major component of Linux. In each graph, two lines are plotted. The first (red) line plots the percent of code in this sub-component, as compared to the total code base; use the left y-axis for this line. The second (green) line plots the total amount of bytes in this sub-component, and should be compared against the right y-axis.

Not surprisingly, drivers dominate the total amount of source code, accounting for over 50% of all code. The architectural-dependent code (arch) is another substantial contributor, accouting for almost 20%. Both the file system and networking stacks account for about 7% each, though the file system used to play a much more prominent role. Memory management code is insubstantial.

More detailed analyses are forthcoming.