Talk:Programming language/Archive 5

This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Lead section again

Again, I'm not a big fan of italics, but it's good enough so I won't quibble. Ideogram 01:46, 22 June 2006 (UTC)

FAC

I feel this article is in good shape now and all outstanding issues have been resolved. The only major obstacle to FAC status now is the paucity of citations. Ideogram 02:09, 22 June 2006 (UTC)

Actually the lead could use some work. Needs to adequately summarize the article. Ideogram 02:12, 22 June 2006 (UTC)

I think it needs considerable work. Flaws/needs in the article:

Citations (agreeing with Ideogram)
Illustrations (I will start a talk section below for illustration suggestions)
Section length: syntax and type systems seem out of proportion to the rest of the article
Some intro text at the beginning of the "Elements" section, before "Syntax" subsection, giving an overview of that section.
"Instruction and control flow" section feels underdeveloped and perfunctory. I would much rather see this become part of some more general notion of execution model ("dynamic semantics"), though I'm not sure how to fold this into the article yet.

I agree this section is too short. Unfortunately I don't know what "dynamic semantics" means. I don't know what else you want to say here. Ideogram 19:50, 22 June 2006 (UTC)

It's not clear to me what that section is even trying to say. It seems to conflate several ideas: the notion of "keywords" and program structure (which is basically syntax), and operations that might be performed. I think it might be useful to have something that discusses the basic elements of a language (sequence, selection, iteration/recursion, procedural/functional abstraction, etc.). But I don't think the current version of the section does that. --Allan McInnes (talk) 20:15, 22 June 2006 (UTC)

Prose needs general improvement: better "flow" from section to section. Reading through the article from beginning to end feels a bit choppy to me.

My personal recommendation would be for most of the regular editors to take a week or so break from the article and come back to it with fresh eyes. The flaws will be more apparent. I don't see any hurry to get this article to featured status. k.lee 19:43, 22 June 2006 (UTC)

After the FAC nom process is over we can list it for peer review. Ideogram 19:46, 22 June 2006 (UTC)

Bullets?

I saw K.lee just re-bulletized the languages discussed in the History/Refinements section. I took out bullets because... well, I find them yucky. Maybe that's not a precise word, but it sort of suggests "Powerpoint" to me rather than "Professional article"; or more specifically, it's not quite the sort of logical enumeration that I might actually want in bullets (there is something enumerated about it, I confess, but it still somehow feels like an arbitrary enumeration).

However, K.lee is an excellent contributor, and I'm certainly not just going to change back. Do other editors have a preference between bullets and paragraph text? LotLE×talk 19:03, 22 June 2006 (UTC)

I have a slight preference for the paragraph-text version. However, I think that version would need a little work, since it did have an enumerated feel to it even without the bullets. That said, I'm happy enough with the bulleted version. --Allan McInnes (talk) 19:26, 22 June 2006 (UTC)

I don't care one way or another about bullets vs. paragraph text, but nested bullets -- yuck. Ideogram 19:27, 22 June 2006 (UTC)

Bullets are easier to read than paragraphs for list-structured data. They are easier to parse and to skim. The Wikipedia style guidelines used to say explicitly to use bullet lists and other structured markup in preference to long blocks of text, but I can't find it (the style guidelines have grown a lot bigger since I last read them). I like a good paragraph of prose as much as anybody, but in general people do not read long blocks of text on the web and (for various reasons) it is really hard to produce a really good and enduring prose paragraph on Wikipedia anyway. Hence, bullets. k.lee 19:32, 22 June 2006 (UTC)

Oh well, I still disprefer the bullets... but there seems to be a slight tilt in their favor. Actually, reading through the whole article again with this in mind, I realized how much I'd rather remove most of the bulleted lists. I only actually did it that one place because I condensed the History section a bunch (since there's a more expansive child). Then again, I wrote a long book on programming that consists of basically headings, paragraphs, and code samples. 150K words of eschewing bullets, cutesy graphics, decorations, font changes, etc... of course, Knuth is up to something like a million such words worth of dry paragraphs in TaoCP :-). LotLE×talk 19:48, 22 June 2006 (UTC)

I certainly understand your distaste. Nevertheless, web readers are not TAoCP readers, the web is (currently) a medium typographically inferior to print, and we are not Knuth. Best to be realistic. k.lee 19:58, 22 June 2006 (UTC)

In Wikipedia:Peer review/Scheme programming language/archive1 someone commented that most lists should be in prose form. Ideogram 19:54, 22 June 2006 (UTC)

The exact comment there was "This article may be a bit list-weighty; in other words, some of the lists should be converted to prose (paragraph form)." If the article begins to feel like a collection of lists rather than a coherent article, then OK, we can revisit which lists should be converted into prose. The commenter did not say that most lists should be in prose form, nor did (s)he reference any Wikipedia policy saying that most lists should be in prose form. Judicious use of bullets to reflect list structure is OK. k.lee 20:01, 22 June 2006 (UTC)

203.135.12.118 16:24, 17 December 2006 (UTC)== Writing C in C++ ==

I took out my comment in a footnote about many Sourceforge C++ projects really being minimal C supersets. Nothing really hangs on it... but it is sort of obviously true. I mean, the below isn't technically a C program, but C++ (at least prior to C99):

#include <stdio.h>
int main(void) {  
  printf("hello, world\n");  // The output
  return 0; }

A lot of compilers just gloss over the little superset issues, so programmers need not think about niggles. LotLE×talk 19:39, 22 June 2006 (UTC)

Illustrations ideas

OK, here's the talk section to discuss illustrations. In my experience PL people prefer writing code to drawing pictures, but some pictures could give this article a lot more visual appeal, so... ideas? k.lee 19:50, 22 June 2006 (UTC) —Preceding unsigned comment added by K.lee (talk • contribs)

That would be helpful. What about a state-diagram? That's pretty much only for imperative programming, but it's easy to understand. Or photos of important people in PL history? LotLE×talk 19:58, 22 June 2006 (UTC)

A few possibilities:

An image of syntax-highlighted source code
A graphical parse tree
A (simple) language evolution graph depicting the different languages mentioned in the history section, and their relationships

--Allan McInnes (talk) 20:08, 22 June 2006 (UTC)

Something like this? <http://www.oreilly.com/news/graphics/prog_lang_poster.pdf> :-) LotLE×talk

There was a reason I inserted the parenthetical "simple", and mentioned restricting the graph to "languages mentioned in the history section" :-) --Allan McInnes (talk) 20:30, 22 June 2006 (UTC)

Re: the parse tree idea, I think it would be nice, under the syntax section, to show a 3-part figure with (1) an array of characters, (2) the corresponding token stream, and (3) the AST. LISP would probably be best here, as it has the simplest grammar. And it would be especially nice if the parse that we illustrate were a substring of a larger syntax-highlighted code image we show elsewhere. k.lee 20:17, 22 June 2006 (UTC)

The drawback of LISP would be that there's not a lot of syntax to highlight ;-) --Allan McInnes (talk) 20:30, 22 June 2006 (UTC)

What about something easy for nonprogrammers like a BASIC language? Jaxad0127 20:33, 22 June 2006 (UTC)

For reader simplicity, I'd vote Python over Basic... but then, I'm vested :-). LotLE×talk 20:37, 22 June 2006 (UTC)

True, but BASIC is more verbose than Python. Jaxad0127 20:41, 22 June 2006 (UTC)

What about XSLT? It's regular enough that it doesn't need a super-complex parser (we can ignore all the edges of XML for the example, e.g. whitespace normalization rules). But it's special enough to have meaningful syntax to highlight. Of course, as PLs go, it's a heck of a lot less readable than are Algol-family languages. LotLE×talk 20:35, 22 June 2006 (UTC)

I think I'd prefer Python. It's likely to be much less confusing to lay readers than something like XSLT. --Allan McInnes (talk) 21:38, 22 June 2006 (UTC)

Here's a quick example that I just whipped up, using Python's parser module to produce an AST for a simple function (hooray for REPLs). This could easily be turned into a diagram of the sort that k.lee suggested:

 import parser
 ast = parser.suite("def add5(x): return x+5")
 listast = parser.ast2list(ast)
 ... elided function def to map integer AST codes to symbol names ...
 ['file_input', 
   ['stmt', 
     ['compound_stmt', 
       ['funcdef', 
         ['NAME', 'def'], 
         ['NAME', 'add5'], 
         ['parameters', 
           ['LPAR', '('], 
           ['varargslist', 
             ['fpdef', 
             ['NAME', 'x']]], 
           ['RPAR', ')']], 
         ['COLON', ':'], 
         ['suite', 
           ['simple_stmt', 
             ['small_stmt', 
               ['flow_stmt', 
                 ['return_stmt', 
                   ['NAME', 'return'], 
                   ['testlist', ['test', ['and_test', ['not_test', ['comparison', ['expr', ['xor_expr', ['and_expr', ['shift_expr', 
                     ['arith_expr', 
                        ['term', ['factor', ['power', ['atom', ['NAME', 'x']]]]], 
                        ['PLUS', '+'], 
                        ['term', ['factor', ['power', ['atom', ['NUMBER', '5']]]]]]]]]]]]]]]]]], 
     ['NEWLINE', '']]]]]], 
   ['ENDMARKER', '']]

Obviously, the parse tree might need to be simplified some for a graphical depiction (for example, the whole nested "testlist" could probably be reduced to just the relevant arithmetic expression tokens). --Allan McInnes (talk) 22:16, 22 June 2006 (UTC)

As for tokenization:

>>> open('add5.py','w').write('def add5(x): return x+5')
>>> d = open('add5.py')
>>> tokenize.tokenize(d.readline)
1,0-1,3:        NAME    'def'
1,4-1,8:        NAME    'add5'
1,8-1,9:        OP      '('
1,9-1,10:       NAME    'x'
1,10-1,11:      OP      ')'
1,11-1,12:      OP      ':'
1,13-1,19:      NAME    'return'
1,20-1,21:      NAME    'x'
1,21-1,22:      OP      '+'
1,22-1,23:      NUMBER  '5'
2,0-2,0:        ENDMARKER

If other editors are happy with the example and language, I can work up graphs and syntax highlights later tonight. LotLE×talk 22:28, 22 June 2006 (UTC)

Parse tree image

I worked up a parse tree image of Allan's code. Actually, I wrote a utility to generate a DOT language representation of an arbitrary Python program. But the image I put above is actually a slight manual simplificaton of the actual parse tree. Does the image seem usable? I can tweak options like fonts and layout style. For example, there's a wild looking "energy minimized" layout; but the top-down hierarchical seems more usual.

Later: Since there was so much whitespace in the parse tree image, I stuck syntax highlighted code in the middle of it, including highlighting the part that's actually parsed. As a little self-referential joke, the code I put around it was the actual utility used to generate the parse tree.

Anyway, the code to generate the graphics is below. Btw. Allan: where did you get the symbol names for your parse tree. I found that symbol.sym_name had some but not all the names, so I had to improvise by copying from your output.

# mk_dot.py
import parser, symbol, sys
symbol.sym_name[0] = 'ENDMARKER'
symbol.sym_name[1] = 'NAME'
symbol.sym_name[2] = 'NUMBER'
symbol.sym_name[4] = 'NEWLINE'
symbol.sym_name[5] = 'INDENT'
symbol.sym_name[6] = 'END'
symbol.sym_name[7] = 'LPAR'
symbol.sym_name[8] = 'RPAR'
symbol.sym_name[11] = 'COLON'
symbol.sym_name[14] = 'PLUS'
N = 0
def getNodename():
    global N
    N += 1
    return 'node%s' % N
def dotwrite(ast):
    nodename = getNodename()    
    label= symbol.sym_name.get(int(ast[0]), ast[0])
    print '    %s [label="%s' % (nodename, label),
    if isinstance(ast[1], str):
        if ast[1].strip():
            print '= %s"];' % ast[1]
        else:
            print '"]'
    else:
        print '"];'
        children = []
        for n, child in enumerate(ast[1:]):
            children.append(dotwrite(child))
        print '    %s -> {' % nodename,
        for name in children:
            print '%s' % name,
        print '};'
    return nodename 
if __name__=='__main__':
    ast = parser.suite(sys.stdin.read())
    l_ast = parser.ast2list(ast)
    print 'digraph L0 {'
    print '    size = "8,8";'
    print '    ordering=out;'
    print '    node [shape = box];'
    dotwrite(l_ast)
    print '}'

The tokenization is simple enough I'd just do it by had. Well, it wouldn't be tough to write a similar DOT output utility. LotLE×talk 01:21, 24 June 2006 (UTC)

The other symbols come from the tok_name dictionary in the token module. Sorry should have made that clear before. --Allan McInnes (talk) 19:18, 25 June 2006 (UTC)

Practice section good

The new practice section and placing the specification and implementation subsections there is very good. Ideogram 02:58, 23 June 2006 (UTC)

Natural language processing

I'm skeptical about this addition. At the least, it needs to be sourced. There's a big gap between understanding natural language, and writing programs in it. My own feeling is that no matter how good NLP parsers might get in the future, ordinary speech simply does not have the built-in disambiguation or precise concepts to express most algorithms. Even if a machine could pass the Turing test, that doesn't mean you could ask it to carry out an exact program without resorting to describing it in a special-purpose artificial language.

Much is the same with mathematics, which is thousands of years older than programming languages are. Math has developed specialized vocubularies and symbologies to discuss the areas of concern to mathematicians. Even when mathematics is discussed vocally, mathematicians use special words outside of "normal" natural language, and often combine those words in ways that do not precisely follow the grammatical conventions of natural language. LotLE×talk 20:20, 23 June 2006 (UTC)

Most every technical discipline has terms and grammatical forms that specialize the meaning of the corresponding natural language constructs in order to improve precision of expression. Outside of math, "legalese" is an obvious example of this kind of thing (and yet it's still subject to interpretation). --Allan McInnes (talk) 20:31, 23 June 2006 (UTC)

True. Legal language, however, is mostly about special words or special senses of words, and not so much about changes in grammar. Of course, legal documents might seem like somewhat contorted grammar compared to many documents: but basically you can parse the noun clauses, predicates, paratactic modifiers, etc. in the same way you would ordinary speech. Math is "more special" in this regard. LotLE×talk 20:36, 23 June 2006 (UTC)

Why is math so special? You can just as easily parse a mathematical equation. And describe math in natural language, just like legalesse or any other specialized grammer. Jaxad0127 20:51, 23 June 2006 (UTC)

Really?! Try saying this in English:

\Sigma ={\begin{bmatrix}{\begin{matrix}0&\lambda _{1}\\-\lambda _{1}&0\end{matrix}}&0&\cdots &0\\0&{\begin{matrix}0&\lambda _{2}\\-\lambda _{2}&0\end{matrix}}&&0\\\vdots &&\ddots &\vdots \\0&0&\cdots &{\begin{matrix}0&\lambda _{r}\\-\lambda _{r}&0\end{matrix}}\end{bmatrix}}

To me it sounds a lot less like natural language than does anything I ever read in a contract or court opinion. And that's not really the most "obscure" or technical thing I could come up with, just something I semi-randomly grabbed from another article. YMMV. LotLE×talk 21:00, 23 June 2006 (UTC)

The r-fold direct sum of the matrix representation of the complex numbers λ_ki? Dysprosia 07:55, 26 June 2006 (UTC)

Oh, I wasn't disputing your point. In fact I completely agree with you. The fact that things like legalese exist is evidence that pure natural language is insufficiently precise for saying anything that needs to avoid ambiguity. I have no doubt that its possible to develop programming languages that "feel" more like natural language, but I believe that those languages will inevitably still have specialized terms and grammatical forms. --Allan McInnes (talk) 21:01, 23 June 2006 (UTC)

My intent in adding the sentence was very limited, and I expressed it badly, giving the impression of a stronger claim than I intended, which is why I didn't expect any objection. I apologize. The limitation that computers cannot execute instructions in natural language is not known to be essential. In the past, some researchers naively expected to be able to program computers to understand natural langauage, thereby removing specialized computer languages to a very specialized and behind-the-scenes roll. COBOL was actually touted as a first step in this regard. I don't expect to ever see a computer that can carry out nearly-arbitrary instructions in standard English, but I've been wrong before. I'll see if I can find an appropriately-cited remark to insert. In the meantime, I prefer the article without the ugly {{fact}} template, and so approve of deleting the sentence.

But, as to Lulu's point, consider a machine that can pass the Turing test. By hypothesis, if we pass a response to both it and to a human, we must not be able to reliably distingush the two. We pass a set of instructions for an action within the terms of the test: if the natural response of a qualified human would be to carry them out, and the machine fails to do so, we can distinguish, in violation of hypothesis. If the machine carries them out, we have eliminated the need for a specialized language, except insofar as it is needed to construct the machine.

But I don't believe that I could communicate most algorithms to Robert A West without using a specialized artificial language. Not with specificity. And that despite the fact that I tend to believe that Robert is made of cells and bones rather than of transisters and chassis. LotLE×talk 21:46, 23 June 2006 (UTC)

Now we get to the part that I didn't mention, because I felt a citation would be needed, there is a problem (considered in science fiction, see Asimov and Niven for examples) when the instructions are ambiguous, as they often are in the real world. By the fact that it can pass the Turing Test, the machine should make the same types of mistakes, with the same probability, as a human possessing similar information about context. If we want the machine to make no mistakes, we must use a specialized language and make no mistakes ourselves. If we are content to permit mistakes, an AI will do.

All this must have been said by someone qualified to say it (or do you consider Asimov and Niven qualfied?) -- and a nod, as Lulu says below, would seem worthwhile. I won't have time this weekend -- real life intrudes -- but I'll see what I can do. Robert A.West (Talk) 21:14, 23 June 2006 (UTC)

As for mathematics, it was expressed in natural language for centuries. Remember back to reading Apollonius of Perga or Nicomachus of Gerasa. It makes my head hurt, but it can be done. Robert A.West (Talk) 21:20, 23 June 2006 (UTC)

Oh, damn! The whole thing was in five paragraphs down! Now I feel like an idiot. Robert A.West (Talk) 21:28, 23 June 2006 (UTC)

I like Gazpacho's rephrasing of the concept better. Not just because it's presented mainly in the negative, but the expression seems clearer too. And s/he added some nice footnotes even, which enhances the article. LotLE×talk 21:42, 23 June 2006 (UTC)

No argument. Robert A.West (Talk) 22:11, 23 June 2006 (UTC)

Literate programming

Somewhat along the line of the NLP thing, I wonder if we should have a bit of a nod to literate programming here. I think the idea that programming languages are, in fact, as much about communicating concepts among people as actual instructions to machines, is worth noting. It's not entirely excluded from the existing text, but I think it's underplayed. I'm not sure the literate programming itself is the main point: many languages emphasize readability, and good programs that aren't "LP" as such, still have clear and extensive comments. Something like Python's doctest is interesting as what I find a better fulfillment of the original goals of LP. But Java's Javadoc has some of this idea as well... or even Perl's POD. Actually, there's "Literate Haskell" too, which is a funny creature. LotLE×talk 20:28, 23 June 2006 (UTC)

picture

It looks good and is informative, but it's so ... big. Is this the simplest example you could come up with? I think the size of it might intimidate a layman.

Maybe a step-by-step explanation, since it does pack a lot of information in. Ideogram 03:18, 24 June 2006 (UTC)

It's a pretty darn simple chunk of code (look at the highlighted bit in the inset code). The BNF just has a lot of rules, so winds up "over parsing" (the quite small utility mk_dot.py winds up with something like a thousand nodes in the tree!... that's not my doing, it's what the Python parser does). We could perhaps prune it a bit more—I actually simplified it already: I can't see any harm to the concept if we pretend the parse tree is simpler than it actually is, as long as it is a possible BNF grammar (of a simpler language). Try it out yourself, see if something seems better. You'll need a Python interpreter; the utility I provide in full; and a DOT renderer like Graphviz.

Oh, small note: I just looked at the graphic on a different computer. On my MacBook Pro, the color highlight around the 'add5()' function is almost too much contrast. But on my desktop machine, I can't even see the highlight at all (I can if I manually increase gamma on the image). Weird. In any case, the parse tree is of the one-line add5() function. LotLE×talk 03:44, 24 June 2006 (UTC)

Uhm, I've never used Python or Graphviz, and I'm not really motivated to start right now. I just want to throw peanut shells at you :-). Ideogram 03:56, 24 June 2006 (UTC)

Both are really cool though. Not that you'd have to do anything much special though: the little utility I give above just take Python code on STDIN, and outputs DOT graphics on STDOUT. It's already programmed. You do need a DOT renderer; but at least the OSX Graphviz version I use is a regular GUI app: just open a .dot file, and it renders. You can select other options using a property sheet (such as what logic to use in arranging nodes).

The really cool thing about DOT is that it's an ultra-high level description of a graph (that's also simple text that's easy to generate programmatically). Rather than needing to give some exact coordinates and sizes of everything, you just say what's connected to what, and the engine figures out how to render it (with various options to tell it to use different techniques. E.g. http://www.graphviz.org/Gallery/undirected/process.html:

graph G {
   run -- intr;
   intr -- runbl;
   runbl -- run;
   run -- kernel;
   kernel -- zombie;
   kernel -- sleep;
   kernel -- runmem;
   sleep -- swap;
   swap -- runswap;
   runswap -- new;
   runswap -- runmem;
   new -- runmem;
   sleep -- runmem;
}

It definitely appeals to the programmer in me. LotLE×talk 04:17, 24 June 2006 (UTC)

This could be very useful for a visual programming tool. Ideogram 04:21, 24 June 2006 (UTC)

Actually, if anyone gets the crazy idea of actually trying to map the interrelationships of the PLs mentioned in the article (as Allan suggested as a possible picture), the example here could be a good starting point: http://www.graphviz.org/Gallery/directed/crazy.html (or more plainly: http://www.graphviz.org/Gallery/directed/unix.html). That's not of PLs, but Unix versions, but the idea and graph type are similar. Moreover, being a textual description of the relations, it's almost laughably easy to add or remove languages, or change relationships. The source text of the graph can be posted for discussion and joint editing (someone needs to render it, but that's simple). LotLE×talk 04:26, 24 June 2006 (UTC)

Thanks for your hard work. My high-level comments on the picture are as follows:

It depicts the Concrete Syntax Tree (CST), not the Abstract Syntax Tree (AST) with nodes for intermediate productions removed. Personally I think the AST is more likely to convey the point to a lay reader.
It depicts not only the syntax tree of the add5 function, but also the whole parser.

Either or both of these might be appropriate in the parser article, but IMO it's too much detail for this article. My recommendations would be to make the figure depict just the add5 function and the resulting AST. (Also, while I'm wishing, the token stream would be nice too.) k.lee 02:44, 25 June 2006 (UTC)

I'm afraid I actually don't understand what you mean. FWIW, what I provide actually is simplified from the actual parser output. Are you saying you'd like all the intermediate productions removed, but the rest kept (with the same names, etc.). I imagine that's doable, but I'd like to be clear on the goal. I agree it's sort of shockingly much detail for such a simple function (I was surprised myself, having not really played with the Python parser before): when I write parsers, I tend to get fancy with the regexen, and have fewer productions as a result (which isn't necessarily free of its own drawbacks).

Does anyone know if there's a way to generate an AST for a Python program? The tokenization is much simpler. I give it above, and we could inset that in the same graphic, I imagine. Of course, all of this is niggly work. DOT is great, but even given that, you have to make some decisions on export formats, and choose sizes, figure out how to preserve anti-aliasing, massage the graphic a bit in a raster editor, etc. I'm willing to go through some steps to make something different, but I'd rather avoid doing it a lot of times, given the need for manual effort (my script automates some stuff, but it's a first brush). LotLE×talk 03:04, 25 June 2006 (UTC)

Er... the tree in question is for just the add5 function. And it is the AST (at least the Python docs refer to the output of the parser module as an AST). I agree that it could stand some simplification (although LotLE has apparently done some). But that will require producing a diagram that isn't "real". Perhaps a language other than Python would be a better choice? --Allan McInnes (talk) 19:28, 25 June 2006 (UTC)

I don't see any harm in using a "not real" parse tree for the illustration. Python doesn't guarantee the exact same BNF productions between versions (I don't know, for example, if IronPython or Jython are compatible at this level, even though they should run the same example program fine). But the distinction between AST and CST seems a bit fuzzy to me; and indeed, the parser module describes the complicated version as an AST.

I did start working on a utility to remove all the single chains of productions from the DOT files. That is, if a production node points to exactly one other node, remove it and "promote" the target. However, the thing I did last night produces a weirdly random graph: I have some logic bug to resolve in my pruning algorithm. But I should be able to produce a "minimal production" version of any given parse pretty soon. LotLE×talk 19:43, 25 June 2006 (UTC)

Sounds good to me (although it might be worth noting somewhere that the AST is for illustration purposes only, rather than a "real" Python AST). Thanks for all of your work on this! --Allan McInnes (talk) 20:06, 25 June 2006 (UTC)

Simplified

Tokenization of add5() function

Parse tree with pruned intermediate productions of add5()

File:Add5.png

Raw CST of add5() function

Sorry, I wasn't clear. First, the inset depicts the code to invoke the Python parser. This isn't necessary. The point is to illustrate a piece of code and its syntax tree, not the Python parser API. The inset should just have add5. Second, the Python docs may call the result an AST, but it's much more like a CST in the Dragon Book sense (although admittedly the distinction between a CST and an AST does not have a precise technical definition). To make the tree more AST-like I would:

I added the extra code inset to illustrate syntax highlighting, as was suggested above. The fact the code hightlighted happens to be the same as used to generate the parse tree is just one of those recursive jokes. However, the one-line suite of 'add5()' seemed too short to illustrate syntax highlighting. If there's some other bit of "famous" or "notable" source code, we could highlight that, I just like self-reference.

Anyway, here's two more things for discussion. The one is the actual complete parse tree as generated by Python's parser module (more CST-like, I recognize). The second is the minimized version that I generate by automatically removing single-linked intermediate productions. It still includes some terminals like parens that you may consider superfluous, but I like the idea of having a moderately robust looking tree. LotLE×talk 23:44, 25 June 2006 (UTC)

I think this minimized version is good enough. Let's put it in. Ideogram 23:59, 25 June 2006 (UTC)

Agree. However, I think that we should demonstrate syntax highlighting in a separate figure somewhere higher up in the article. It's fine (actually, sort of cool) if the syntax highlighted code example is also the code that generates the parse tree shown in the later figure. I just don't want this one figure to be too crowded. k.lee 22:22, 26 June 2006 (UTC)

remove at least the following intermediate productions: compound_stmt, stmt, simple_stmt, small_stmt, flow_stmt, varargslist, fpdef, testlist, term, factor, power, and atom.
remove the following terminals: NAME = def, NEWLINE, INDENT, ENDMARKER, COLON = :, LPAR = (, RPAR = ), END, and NAME = return.

Basically, if I were building the figure, I'd make the parse tree look something like this:

file -- function --+-- name=add5
                   +-- params -- formal_name=x
                   +-- body -- return_stmt -- binary_op --+-- op_name=PLUS
                                                          +-- variable=x
                                                          +-- numeric_literal=5

This communicates roughly the same point, but with less complexity. If you were explaining ASTs in an introductory PL course, this is roughly what you'd show students.

Lastly, I agree that it doesn't matter whether the parse tree is the one that some particular Python parser implementation generates. As long as the result corresponds to the tree that some sensible parser would generate, it seems be OK. There is no such thing as the "real" Python AST, since the language definition does not specify how implementations may represent the program internally (nor should it).

Anyway, these are just my suggestions. I don't have time to build an alternate figure, so do as you will. k.lee 20:11, 25 June 2006 (UTC)

I meant "real" in the sense that the AST (or CST) is actually something generated by an existing Python interpreter or compiler. I'm fine with using a a notional AST of some kind. I just think it needs to be clear that the AST in question is notional, and created specifically for the article: the reader will not be able to recreate the AST in a Python interpreter. --Allan McInnes (talk) 21:49, 25 June 2006 (UTC)

It seems I spoke too soon. Although I haven't played with it, it seems likely that this module might provide a way to get a Python AST that is more abstract than the so-called ASTs produced by the parser module. --Allan McInnes (talk) 22:00, 25 June 2006 (UTC)