Talk:Moby Project

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Untitled[edit]

There are some others symbols in pronociation, for instance: "person 'p/[@]/rs/@/n". What addidional brackets [] mean? It should be mention. — Preceding unsigned comment added by 83.26.122.91 (talk) 16:04, 27 May 2012 (UTC)[reply]

Hyphenator[edit]

It should be mentioned that while the Hyphenator appears to separate syllables, there are counter examples to that theory. Aeron, Alias, Ascetic, etc. Lordcheeto (talk) 18:00, 9 July 2015 (UTC)[reply]

Inclusion of "st" row in pronunciation table[edit]

An anonymous edit from 213.78.70.53 (which has no DNS entry) added a row to the pronunciation symbol to English IPA table that ASCII "st" maps to English IPA "st".

I'm removing this row on the following grounds:

  • The mapping is consistent with the component mappings s→s and t→t.
  • The Help:IPA for English page doesn't have a separate entry for ‘st’ (even though it does give some double-symbol entries, such as for affricates /tʃ/ and /dʒ/).
  • More generally, I can't think of a reason to consider st a sound distinct from its components (or at least no more so than /sk/, /sp/, /sm/, ...).

Was this by chance intended to be a row for /ts/ ? I can see how that could benefit from its own row, much as for the affricates mentioned above (even if I can only think of one english lemma containing this sound, namely ‘tsar’). I wouldn't object to adding such a row, even though I choose not to add such a row myself.

Otherwise, what is the reason for adding an st row? Pjrm (talk) 08:33, 1 December 2015 (UTC)[reply]

Article rewritten to remove creator's hype[edit]

The Wikipedia notes say that the article was written largely by one person. It's obvious that the person is the creator of the project, Grady Ward, for the purpose of promoting the project, and the article includes many dubious claims about the extent and quality of the project:

"it contains the largest free phonetic database, with 177,267 words and corresponding pronunciations." -- Aside from there being no substantiation for this claim, this doesn't meet the usual understanding of what a database is. This project is just a collection of lists of words and related symbols in plain ASCII text format.

The number of words in the lists are greatly inflated by the inclusion of non-words (&c, 'd, 1080, 2, aa, aargh, aarrgh, aarrghh, ab, crevice's), names, phrases (a phrase is not a "word" but a combination of other words already counted), foreign words not in common use in English, and words made up by adding the same prefix to dozens and sometimes hundreds of other words, such as "self-whole". This means that the chart showing the number of "words" in each file is incorrect by any normal meaning of "word".

There are no References provided and of the 4 "External links", the 1st is his own web page, the 2nd is just a page for downloading his files, the 3rd is an article about writing a program to extract data from one of the files, and the 4th is a dead link. 47.214.177.17 (talk) — Preceding unsigned comment added by 47.214.177.17 (talk) 18:28, 2 September 2016 (UTC)[reply]

I've undone this change. The criticism introduced by this edit was unsourced and possibly original research. I've added a {{cn}} tag to the claim of being the largest free phonetic database, which is really the only "hype" to be found in the article. Any claims regarding artificial inflation of the database with non-words should be backed up by reliable sources. clpo13(talk) 18:41, 12 September 2016 (UTC)[reply]
I also added a tag to the claim of Words II being the largest wordlist in the word. It's sourced to a list of resources, but the text at the source sounds almost like it was written by Grady and not an independent reviewer. clpo13(talk) 18:44, 12 September 2016 (UTC)[reply]

Observations on pronunciation database[edit]

I've just spent the last few days making a revised version of the pronunciation database, including many corrections, which is available on GitHub at [1]. In the process I learned a few things which might be added to this page:

1. In response to the question from 83.26.122.91, the sequence "/[@]/", followed by "r" denotes IPA /ɛə/ or /ɜː/ as in "air", "square" etc. The legend says that this sound is encoded by "/@r/", but in fact that sequence is never used in the database.

2. The slash character "/" is not used to separate phonemes. There is no separator; each phoneme is encoded by a unique sequence of characters, some of which start and end with "/", for example "/A/". If two such phonemes follow each other, then a double slash results, eg "ding" is pronounced "d/I//N/". In some cases, the same character with or without slashes is used to denote two different phonemes, eg "/A/" denotes "a" in "far" (IPA /ɑː/), while "A" without slashes denotes IPA a as used in French and other languages. The fact that an "A" could be placed between two slashed phonemes means that you cannot search for "/A/" phonemes just by doing a text search for "/A/": you have to parse the whole sequence correctly. [I have written such a parser in Prolog, which I will release at some point. It enabled me to find many errors in the encodings.]

3. Generally, "/O/" is also used to signify /ɔ/ as found in both "dog" (lot-cloth split) and "caught", "north" (no cot-caught merger). However, in a number of cases "/oU/r" is used to encode the same sound when followed by an "r", eg in "score", "port", even though the main and documented use of "/oU/" is to encode /oʊ/ ("boat", "goat"). This looks like a clear error: all instances of "/oU/r" should be replaced with "/O/r".

4. For some reason, the sequence for /ɔɪ/ is "//Oi//" (and sometimes erroneously "O/i/") instead of the expected "/Oi/". [This is corrected to "/Oi/" in my revised version.]

5. The set of supplementary phonemes for non-English words is not described on the Wikipedia page, but in any case, the description of these in the 'readme' file included with the database is incomplete: many other sequences mentioned in the documentation are also used, eg "e", "o", "i", "V", "/z/", "c". Some of these (definitely the "c") are errors and should be encoded differently, but others are, in my judgement, legitimate uses of non-English phonemes, such as /o/ or /i/ as used in French, Spanish, and other languages. There is also some inconsistency in how these sequences are used; they sometimes denote different phonemes in different words, for example "V" seems to cover both Spanish /β/ and Dutch /ʋ/. The "/z/" seems a bit dubious - it seems to represent the /ts/ sound used for "z" in German or Italian. In any case, the phonetic encoding of the non-English words is quite inconsistent and error prone.

In conclusion, while the pronunciation database is a great resource, it contains quite a few errors and inconsistencies and could do with a clean-up.......Comment added by Bistronaut on 26 October 2017

archaic[edit]

The Moby Words II table entry for SINGLE.TXT says including archaic words while the online documentation says This list does not exclude archaic words. I'd make the table match the list of only if only the word pigweabbits wasn't in SINGLE.TXT contradicting the online documentation as it is an archaic word. -- Jamplevia (talk) 00:22, 10 May 2022 (UTC)[reply]