Talk:Scannerless parsing

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

???

This article has not yet received a rating on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Added a disadvantages section and edited the advantages. This article is in need of being balanced, scannerless parsing is a technique which makes sense in limited circumstances, usually when the language being parsed is very simple. Remember, there is a reason the lexer/parser distinction was made in the first place.

In particular:

lexer/parser distinction not neccesary: actually, yes it is, depending on what your needs are. as I said above, there was a very good reason it was developed, combining the 2 functions (scanning/parsing) with more complex languages becomes messy, harder to understand, develop, and maintain. Moved this into the introduction, changed semantics to explain when this technique is appropriate (as opposed to implying it's a universal truth)

no keywordification: keywords are often included as a feature, and having a seperate lexer and parser doesn't mean you have to have keywords; scannerless parsing can do without them simply because it has less of the design constraints that make keywords attractive to implement in the first place. Also, many people would rightly consider keywords a feature, and not a requirement; go look up the early fortran days to get an inkling why. As such, moved this info to the token classification not required advantage.

12.165.250.13 (talk) 20:54, 28 May 2008 (UTC)[reply]

I've been observing this article for a while, and I've been dismayed at how poor the article still is. It contains a number of factual mistakes and does not really explain anything. I'm reluctant to improve the article myself though, because I'm one of the researchers publishing on the merits of scannerless parsing. I have a few problems with the article:

There is no decent explanation of the scanner/parsing process. This article should explain how in a traditional scanner/parser division a scanner splits up a character stream into tokens, and how the parser consumes the tokens.

The article does not give any actual examples of cases where scannerless parsing is useful. The current list of applications is not correct. In fact, scannerless parsing is mostly useful for languages with a complex, context-sensitive lexical syntax. Typically, these are languages that involve a mixture of different sublanguages. We've published a series of papers on this: "Concrete Syntax for Objects" (OOPSLA'04) and "Declarative, Formal, and Extensible Syntax Definition for AspectJ - A Case for Scannerless Generalized-LR Parsing". The second paper in particular illustrates how the traditional scanner/parser separation breaks down on languages with a complex context-sensitive lexical syntax.

The 'required extensions' section is largely focussed on language extensions in SDF/SGLR. Some of these extensions are not related to scannerless parsing at all. In particular: preference attributes (more an aspect of GLR) and per-production transitions (related to the priorities mechanism, which is unrelated to scannerless parsing).

Martin Bravenboer (talk) 15:11, 30 May 2008 (UTC)[reply]

I also think this article contains many errors. For example the advantage 'Grammars are compositional' is not related to Scannerless parsers: it is theoretically possible to have an LL(1)-parser that is scannerless, and LL(1) is not closed under composition. The third page of this paper describes it: http://www.springerlink.com/content/xugat38tyrxvtm9w/. So compositional grammars seem to be a feature of generalized parsers, instead of scannerless. It is related, because most scannerless parsers are generalized.

Mbvlist (talk) 10:37, 10 July 2009 (UTC)[reply]

Being scannerless is a necessary but not sufficient condition for compositionality. AshtonBenson (talk) 22:44, 11 July 2009 (UTC)[reply]

Seems to make sense. Unfortunately I don't understand it good enough to edit the footnote to make real sense, can you do that? Mbvlist (talk) 10:10, 14 July 2009 (UTC)[reply]

This page needs to back up its assertions with citations[edit]

I was quite gratified to discover this page, which broaches a topic I'd wondered about for ages, but never saw discussed. However, I regret that I was chagrined to discover that while it is replete with admirably many links to examples of the concept's use, it contains many unsupported general assertions yet cites at best only one external discussion of the theory. (And Visser's report does not pretend to cover the whole domain - just one solution).

I personally consider some of the unsupported assertions to be plausible. However even the bald assertions which I haven't singled out deserve to be backed up with cites.

For all I know, someone who well understands the topic might find many justifiable cites in Visser's own references. AHMartin (talk) 19:23, 28 December 2016 (UTC)[reply]

"Required extensions" section, Eelco Visser, "Disambiguation Filters"[edit]

Removing the section because it seems quite nonsensical. Problems with the section and topic:

The section seems to rely on an unreferenced paper.^[1]
Visser is only named as the fourth author of the paper, so it seems inappropriate to wholly attribute the section to him.
It is very much unclear if the topic holds enough weight to be mentioned in the article.
It is unclear how relevant the topic is to this article.
The section title seems nonsensical and unrelated to the topic. Notrium (talk) 15:56, 18 September 2020 (UTC)[reply]

References

^ van den Brand, Mark G. J.; Scheerder, Jeroen; Vinju, Jurgen J.; Visser, Eelco (2002). "Disambiguation Filters for Scannerless Generalized LR Parsers". Compiler Construction. Springer: 143–158. doi:10.1007/3-540-45937-5_12.

scannerless parsing[edit]

In the late 1960s, early 70s compiler development systems were developed whose technology is not covered here.

CWIC (Compiler for Writing and Implementing Compilers) developed at Systems Development Corporation included a top-down parser programming language.

I developed SLIC System of Languages for Implementing Compilers based partly on CWIC.

I am not sure what exactly parser type they are other then recursive decent. Does the ability to recognizing a sequance with a loop or recursion make a difference as to their type. It does make a difference as to the tree that can produced.

expr = term $(('+':ADD|'-':SUB) term!2);

<node string> creates a node and pushes it on the node stack.

!<number> pops the top node object and the <number> of parse stack entries into a list representing a <number> branch tree.

 a+b-c    => [SUB,[ADD,a,b],c]

                   SUB
                 /        \
         ADD          c 
       /        \
    a            b

Looping generates a tree bottom up left to right.

 term = factor $(('*':MPY|'/':DIV)factor!2);

 factor = (numbe | id | '(' expr ')')
                    ('^':POW factor!2|--);

A^B^C    =>

        POW
     /         \
  A            POW
               /         \
             B            C

POW,[A POW[B,C]]

We control tree construction using tail recursion or looping.

OR we could just gather parsed entries into a lidt:

 arguments = +[arg $(',' arg) | .EMPTY]+

I believe these are far easier to use than a parser generator.

How about making a list of terns:

 expr = +[term $('+' term|'-':SUB term!1):SUM!1;

 a+b-c+5 => SUM[a,b,SUB[c],5]

Token '..' formula are called to recognize tokens. Recognized tokens by default are cataloged into the symbol table.

In CWIC tokens are defined using

that reference character class ':'  formula:

bin: '0'|'1';  
oct: bin|'2'|'3'|'4'||'5'|'6'|'7';
dec: oct|"8'|'9';

CWIC developed at Systems Development Corporation.  included three languages, SYNTAX(parser programming language), GENERATOR(Tree crawling code production language), and MOL360(Machine Oriented Language for IBM360 processors). In the SYNTAX language you program a top-down  LR parser directing the parse tree's construction. Recognizing tokens with token ".." formula that reference character class formula that test for characters of the class. Examples 216.146.244.139 (talk) 05:57, 30 April 2023 (UTC)[reply]

[1] van den Brand, Mark G. J.; Scheerder, Jeroen; Vinju, Jurgen J.; Visser, Eelco (2002). "Disambiguation Filters for Scannerless Generalized LR Parsers". Compiler Construction. Springer: 143–158. doi:10.1007/3-540-45937-5_12.

[1]