Angry Unix Programmer

June 21, 2009 2:19 AM CDT by psilord in category Useless Ideas

Assumption Inference

Today I was thinking about the old adage of the average number of arguments passed to a function is four.

Why four? Why not two or six or seventeen? What is special about that number that programs written by human beings exhibit such a behavior? How is the selection of that average number related to how humans process information, or is it related at all?

Which brings me to an analogous topic... Human short term memory can hold 7 symbols on average....

Oops. It turns out that is an urban legend and masterfully destroys the original blog post I had made using that reasoning. Well, that's what you get when you poorly skim the Internet looking for something. Sure the Internet is filled with piles of knowledge which can elevate you to PhD status, help the sick and the poor, oh, and also porn. Lots of porn. And poor quality videos about idiots doing stupid things and getting seriously hurt. Anyhow, the indisputable and eternal paragon of information retrieval that is Wikipedia had an article about working memory which instead I will skim and then about which I shall make bold and unsubstantiated claims.

Programming is a dance between the long term and working memory. The long term memory contains the library that retrieves what API calls, algorithms, and data structures one needs to perform the manipulation in question. The working memory fundamentally holds the storage of the current data flow graph of only a few hops before and after the direct line being authored at any given moment. If one subscribed to certain working memory models, then possibly there is some kind of episodic memory that hackers use to replay (or on the other side of the coin: plan) how they are going to manipulate some variables or other programming concepts that once they complete authoring, create another episodic set and then implement that.

I hypothesize that the reason why there are so few parameters to functions is that there are often significant out of band pieces of information surrounding the implementation of the function that present a situation where the hacker forgets about them and the constraints they force upon the solution due to working memory overload. I'd say there are explicit and implicit pieces of information about functions (and individual code lines in general, but for now we'll just talk about functions--and Unix system functions at that!). A constraint is something which must be kept true/false and is either a first class mental symbol as in the passed in pointer must be non-NULL, or something which is inferred from other constraints, such as if I fill up /tmp while write()ing a file, then malloc() will fail on an older Solaris machine--because a hidden constraint of the virtual memory system is bound to the free space in /tmp.

The difference between explicit and implicit hinges upon inherent documentation and/or commonality versus hidden side effect and/or uniqueness. Inherent documentation would be something like seeing the function prototype on a man page and being presented with the direct arguments to the function, or perfect recall from long term memory. A hidden side effect would be something like the function returning a pointer to a static buffer that a later call to the function will alter. A unique fact would be something like read() of very small chunks don't update the a_time on a file in NFS if client side attribute caching is enabled (which it often is).

As the number and temporal sequencing of mental symbols increases around the call to a function, the more chance there is for erroneous code to be written. This is an important thing to understand and may explain why certain functions have an error prone nature about them in certain contexts. Regardless of how the working memory of a human works, one can hypothesize that only a certain number of symbolic pieces of information: function prototypes, variables, their spatial relationship on screen, the explanation of someone describing how to do it to you, how the use of an idiom makes you feel, etc., can be kept in attentive memory at any give time. One can also make a reasonable assumption that this set of symbols and their temporal relation kept in the attention centers has an average across different human beings, and that average is probably a small number or small length of time (seconds at best). It seems, though, that if episodic memory is utilized, the information can be held in stasis much longer.

"I don't believe the bilious feces that you are projectile vomiting into the Internet and would like examples", you say? [Actually, someone didn't believe them and convinced me of a rewrite. I almost got lucky because I don't have a comment system, but unfortunately Alan knew my email address.]

I've analyzed some common functions out of libc to show the sets of the mental symbols needed to use them correctly which were then grouped into explicit and implicit sets by, check this out, subjective observation. How's that for scientific? My observation is that common mental symbols between functions are ranked explicit and first with implicit and last being the unique mental symbols associated with the function that are only remembered from the time the man page is read to the act of authoring it into the code. Implicit mental symbols would be the ones most strongly held in the working memory system.

I predict that functions whose explicit and implicit data flow symbols seemingly pass a magic number of "too damn much" cause the function to suffer from erroneous application. I'm assuming that in general the explicit stuff is recalled from long term memory and the implicit stuff picked from a man page about 0-2 seconds before writing the piece of code that it affects.