User:MikeDunlavey/Linguistic Method Article

Linguistic Method refers to a method of developing software based on awareness of formal computer science and information theory.

Interrelated Topics[edit]

This method requires a number of issues to be simultaneously understood: Speed, Redundancy, Language, and Data.

Speed An algorithm can be thought of as an information channel with input, output, and bandwidth. The individual operations of the algorithm process quantifiable amounts of information, and high performance can be achieved by maximizing the information yield of individual operations. In addition, simple diagnostics in large software can easily identify those parts of programs that waste cycles. The simplest and most effective tool for diagnosing performance problems in single-thread software is a debugger having the ability to instantly halt the program when the programmer strikes a key. This is not like a breakpoint, because it happens at a truly random time in the execution of the program. When the program is halted, the state of the program is examined in depth, including the call stack, and the question is asked "Why was the program doing this particular thing?". This is repeated several times, up to 20. If there is some program construct responsible for a significant part of the execution time, it will be seen, and can be dealt with. Such things are not usually "hotspots" (tight inner loops where the program counter is often found). More commonly, subroutines are being called excessive numbers of times, doing repeated unnecessary calculation. As the typical calling depth of subroutines increases, the time spent tends to increase exponentially, because the opportunity for extra calls exists at every level.

Redundancy Information Theory deals with the subject of redundancy in error-correcting codes. Errors in information transmission can be detected and corrected if sufficient extra bits are added to the code, and if there is a mechanism that uses them for detecting and correcting errors. To change a valid code word A into a valid code word B requires changing some number N of bits. If fewer than N bits are changed, the result is a code word that is not correct. In a program, information is stored in data structure. Generally, if data structures are being built or modified, multiple instructions or statements must be executed. Changing valid data structure A into valid data structure B may require N modifications to the data structure. If fewer than N modifications are done, the data structure is invalid. So data structure has redundancy, and the number N is a measure of it. In the same way, program source code has redundancy. If valid program A requires N source code editing operations (say, insertion or deletion of statements) to be changed to valid program B, if fewer than N editing operations are done, the result is an incorrect program. Now, the essential difference between redundancy in the programming realm and in error-correcting code is that, in error-correcting code, the redundancy is carefully added as part of a well-thought-out scheme for detecting/correcting errors. Without the well-thought-out scheme, the redundancy does not help and in fact, if the redundancy is large, it hurts because it increases the exposure for errors to occur. If there is no error-correction mechanism, redundancy is actually harmful. So, in the programming realm, it is important to minimize redundancy (i.e. the number of steps N needed to make changes correctly) in both data structure and in program source code. As a seldom-achievable ideal, if N can be reduced to 1, errors due to incomplete changes can be eliminated, and it could be suspected that most errors are of this type. When redundancy cannot be eliminated, techniques that help manage it should be considered, such as Differential Execution.

Language If a language is understood to be not only the syntax, but the vocabulary available to the programmer, then computer languages like C, Fortran, C++, etc. are only base languages. The language that the programmer works in consists of this, plus all the classes, variables, routines, methods, macros, etc. that have been defined as part of the project. If these things are defined in a very different way, it makes a huge difference in the actual surface language in which the programmer works. So, programmers are language designers, like it or not, and the languages they are designing are targeted at the problems they are solving. It is beneficial for programmers to understand how to design their languages so that, when requirements change, those changes can be implemented with the smallest possible number of editing operations (N). (Certainly, one can hope for the best that all requirements will be specified ahead of time, but one should also plan for the worst, that requirements will change up until the day of completion.) When languages are designed in such a way as to minimize N for the class of problems at hand, they become "problem-oriented languages". One characteristic of such languages is that they may have dramatically less explicit data structure. There may or may not be plenty of data structure underneath, but once the language is near completion the programmer does not need to refer to it. Another characteristic is that they become "declarative" rather than "imperative". The source code tends to state the problem, rather than the solution. The solution is inherent in the work that goes into building the language because the language solves all problems capable of being stated in it.

Data Data exists to be read by programs that do things with it. The programs can be simple, like reading the data and copying it or adding up the numbers in it. They can be complex, like an interpreter reading the data and using it as instructions to carry out complex activities. They can be intermediate between these. Nevertheless, data controls, to lesser or greater extent, the behavior of the programs that read it. The behavior of the programs can be classified according to what kind of temporary storage they need, just as automata are classified according to the temporary storage they need, whether it be a fixed set of registers, a pushdown stack, or arbitrary storage like a Turing machine tape. When looked at this way, the distinction between data and program becomes only a matter of perspective, just as in Einstein's physics the distinction between matter and energy is only a matter of perspective. Therefore, when data structure is being designed, it is equally correct to say a language is being designed, and vice versa. This may seem a vacuous observation until it is seen that it opens up a spectrum of ways to represent information and process it.

Method Since all of these ideas may be confusing to put into practice, there is a rough outline of a methodology for applying it, consisting of asking a series of questions:

- What information needs to be worked with, and what temporary storage is needed to work with it? This identifies where the information falls in the spectrum from simple to complex processing.

- When is the information acquired? Is it acquired at development time, installation time, or run time? This determines the options for representing the information as source code, installation script, or run-time data structure.

- Who provides the information? This may put constraints on how the information is represented.

Answers to these questions determine the space of possible implementation approaches, from which a good one can be chosen.

Characteristics[edit]

Minimum data structure Explicit run-time data structure (i.e. data structure that programmers need to code to) is minimized. This minimizes redundancy of the source code, because having to manipulate explicit data structure is a major contributor to source code.

Maximum performance Since one option for representing information is to represent it as compiled source code, it can be processed at high speed in that case. Even if the information must be represented in run-time data structure, programs that operate on it may dispense with overly general techniques that sap cycles. For example, programs have made use of hash tables classes, complete with storage allocation and freeing, for lists where the expected number of members is fewer than 10.

Minimize defects When the surface language minimizes redundancy for the problem domain, the exposure to programmer errors in implementing requirements is minimized. This not only minimizes defects, it speeds implementation.

Comparison to Object Oriented Programming[edit]

This approach is not in opposition to Object Oriented Programming (OOP), rather it puts OOP in place as one of multiple implementation techniques. The tension is with the idea that OOP is such a fundamental good idea that all further growth in software engineering must be based on it. In practice, when OOP is used as the thinking method for software design, it tends to result in massive numbers of classes, lots of data structure, and correspondingly massive code (large N, both for data and for source code). Such code, due to many layers of excessively general abstractions implemented by method (subroutines) tends to result in performance problems. By contrast, if OOP is used as a useful technique instead of the end-all concept, it can be used well by being used sparingly. The essential difference is that, rather than thinking in terms of the objects that exist at run-time, a run-time perspective, in the Linguistic Method the thinking is explicitly about the surface form of the source language, i.e. a development-time perspective. One does not ask "What kinds of objects will there be?" but "What do we need to say?".

References[edit]

Dunlavey, "Building Better Applications: a Theory of Efficient Software Development" International Thomson Publishing ISBN 0-442-01740-5, 1994.

Category:Software engineering Category:Programming paradigms