To analyze the data, files are first separated into two categories: code and non-code. Code files are identified by file suffix, in particular .c, .h, and .S. All other files are classified as non-code files.
The size of the Linux source tree and various components is calculated simply by summing the size (in bytes) of all code files. Thus, whitespace and comments are all included in final counts.
Click here for the same graph, but with the addtion of two linear models. The first model represents growth of Linux from inception through mid-1995 (version 1.2). The second model begins in mid-1995 (version 1.3) and goes through late 1999 (version 2.3). As you can see, the growth of the code base during the past five years is much higher (5x) than the first few years.
Finally, we breakdown the code into various sub-components, based on directory organization. Here is a set of graphs that show how much code is found in each major component of Linux. In each graph, two lines are plotted. The first (red) line plots the percent of code in this sub-component, as compared to the total code base; use the left y-axis for this line. The second (green) line plots the total amount of bytes in this sub-component, and should be compared against the right y-axis.
Not surprisingly, drivers dominate the total amount of source code, accounting for over 50% of all code. The architectural-dependent code (arch) is another substantial contributor, accouting for almost 20%. Both the file system and networking stacks account for about 7% each, though the file system used to play a much more prominent role. Memory management code is insubstantial.
More detailed analyses are forthcoming.