Talk:Bencode

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

How does bencoding actually work?[edit]

The examples of bencoding given are much, much less helpful than they should be as they are not accompanied by an explanation of how the process operates, which is far from self-evident from the examples. It's similar to having a page entitled "foocoding" and having an example saying "the foocoded form of bar is baz, and the foocoded form of quux is xyzzy". Not very helpful without an explanation of how the clear text and cyphertext relate to each other.Thisisnotme 12:37, 12 July 2006 (UTC)[reply]

I expanded the article by describing how it works (as well as adding some explanations, a few of its properties, and making a few corrections). I retained most of the examples. 130.89.167.52 21:25, 14 July 2006 (UTC)[reply]

Efficiency[edit]

"While not particularly efficient, bencoding is simple"

Not efficient compared to what? Certainly not XML, at least. I guess only byte-encoding the string lengths would gain you anything (notwithstanding some sort of compression)? — Christopher (talk) @ 23:13, 20 July 2006 (UTC)[reply]

You're right. Wording changed, comparing bencode to a pure binary encoding. --Kwi | Talk 23:26, 23 July 2006 (UTC)[reply]

It does have the added complexity of maintaining order, so whilst decode isn't affected, encoding will be. XML on the other hand does not impose any ordering during encoding, but does require the production of extra bytes. However as Christopher implies, XML is way more inefficient to decode. --Angryjames (talk) 00:43, 8 October 2008 (UTC)[reply]

Yes: the main inefficiency in bencoding is the use of decimal instead of base-128 for numbers, like DWARF's LEB128 encoding. Aside from that, I think you'd have to switch to tokens that are shorter than a byte to get any more efficient. Using fixed-size binary numbers would, in fact, be less efficient most of the time, since as it stands most numbers only take 1 or 2 bytes, compared to 4 or 8 bytes for typical fixed lengths. (I'm sort of assuming that if bencoding had gone with base-128, the 's' would go before the length; it would also be quite likely that the tokens would just be bytes of arbitrary value, rather than being chosen from ASCII.)

So, you waste some bytes when you have a multi-digit or negative number; otherwise bencoding is about as efficient as it is possible for a byte-based format for this sort of data structure to be. —SamB (talk) 02:56, 25 January 2015 (UTC)[reply]

The discussion about "efficiency" gains/losses in encoding small integers as ASCII is, IMHO, functionally irrelevant for the purposes of a format such as BEncode. Furthermore it is the only part in which it is distinguished from a "pure binary" encoding, since all other cases use simple 1-byte sentinel markers to signal types and delimit fields.

I'm going to go ahead and simply remove the related fragment about efficiency. —Darmengo (talk) 5 April 2022 — Preceding undated comment added 14:14, 5 April 2022 (UTC)[reply]

Recursion[edit]

I've removed a fragment in the the Encoding algorithm section stating that Bencoding "is defined recursively." In actuality there is nothing recursive about Bencode's definition. And while bencoded dictionaries allow composition (a dictionary may contain another dictionary), recursion is not possible (a dictionary cannot contain itself.) — Foorider (talk) 01:00, 14 February 2008 (UTC)[reply]

I think you're splitting hairs. A bencode implementation is highly likely to implement encoding and decoding recursively. Schwatoo (talk) 01:41, 6 March 2008 (UTC)[reply]
That doesn't mean it's recursive. It's trivial to write a loop decoder using a data stack instead. 91.16.24.48 (talk) 17:52, 19 February 2011 (UTC)[reply]
Well, it is defined recursively, but neither BEP 3 nor the wiki.theory.org spec bothers to mention this, so we really don't need to either. (Though I also don't see any particular reason not to mention it.) And there's certainly no implication that it should necessarily be implemented that way; it's presumably written that way for brevity/clarity. —SamB (talk) 00:40, 25 January 2015 (UTC)[reply]

Dictionary Sorting[edit]

Page fails to mention that dictionary elements should be sorted by key. In the trivia section it mentions "there is only a single valid bencoding", without ordered dictionaries this wouldn't be possible. Schwatoo (talk) 01:37, 6 March 2008 (UTC)[reply]

Not just ordered dictionaries, surely lists must also be ordered. Since the bencoding must be unique (bijection), and comparable without decode, does this not mean that the list element must be ordered in some way? In the example l4:spami42e, is each element compared as an ASCII string, e.g. '4:spam' compared to 'i42e'? --Angryjames (talk) 00:36, 8 October 2008 (UTC)[reply]

Lists are not sorted, only dictionaries. That is the case, because dictionary objects have normally no order (certainly not in python, the programming language which was used for the first Bittorrent client and thus the first Bencoding), while list elements have a certain order. So while one dictionary could have an arbitrary number of encoded forms, if it weren't sorted for encoding, one list has always only one encoded form.85.178.146.223 (talk) 06:24, 20 January 2009 (UTC)[reply]

With "list elements have a certain order" you mean that programming languages usually treat two differently ordered lists as semantically unequal objects, but for a dictionary a programmer usually cannot get a position of an element in a dictionary. With "list elements have a certain order" you do not mean that there is a fixed ordering algorithm for list elements, right? Still Angryjames's question remains, whether bencode mandates such an algortihm to ensure bijection. --Tddt (talk) 00:13, 21 May 2012 (UTC)[reply]
Ah, wait, I already found an example why such an algorithm cannot exist for lists. For instance the "path" element in the info part in multi file mode: "4:pathl 4:path 2:to 8:file.ext e" relies on an unsorted ordering. So it'd be up to the BEPs etc. of certain list keys to ensure bijection by mandating a certain sorting algoritm if appropriate, right? But again, looking at for instance BEP-19, this does not seem to be the case: It does not mandate a lexicographical ordering of the elements in the 'url-list', which results in multiple valid encodings for the same set of webseeds - without imposing a semantical difference upon differing orderings. Ergo, contrary to what the "Features" section claims bencoding is not really bijective? --Tddt (talk) 00:59, 21 May 2012 (UTC)[reply]

Extending Bencoding[edit]

Bencoding as presented doesn't really handle some basic types. UTF-8 strings are mentioned in the artlcle. But bencode doesn't provide encodings for some very basic types: floats, bools, sets, etc. There has been some work to extend bencoding to support more types. Should this be included or linked to in the wikipedia article? See https://lists.ubuntu.com/archives/bazaar/2007q3/029292.html for more. Schwatoo (talk) 01:44, 6 March 2008 (UTC)[reply]

I don't see why not. It goes without saying that you should appropriately cite the sources you have for the extensions you know of. This page does need clarification for usage cases, as well as client specific implementation details.
If any of these extensions are used in mainstream bit-torrent clients, I would suggest integrating the details of these extensions as they are used should be a higher priority than it is now. In practice, people who want to know more about bencoding are interested in the encoding mechanism as it is applied in bit-torrent software.
NaruFGT (talk) 14:04, 9 May 2013 (UTC)[reply]

Minimum number of elements in lists and dictionaries?[edit]

What's the minimum number of elements for lists and dictionaries? Zero? One? Two? From "there is a bijection between values and their encodings" I'd guess two, but I'm not really sure. --Tddt (talk) 23:32, 20 May 2012 (UTC)[reply]

I'd say zero - an empty bencoded list can be represented as "le", and an empty bencoded dictionary as "de" (and an empty string as "0:") without violating the specification. Why would bijection suggest two? Korexio (talk) 09:55, 11 June 2012 (UTC)[reply]
This is important. I would suggest someone looks to the sources of the mainstream bit-torrent clients that have sources available, and/or create tests using bencoded (is that even a verb?) data and try out various structures using the functions found in the sources. As for µTorrent, and BitTorrent, and other binary distributed clients, I wouldn't mind emailing the developers asking about the behavior. I'll be sure to update this talk page when I do so. I believe that updating the article itself would be inappropriate as that would be original research.
NaruFGT (talk) 14:55, 9 May 2013 (UTC)[reply]

Strings and UTF-8[edit]

"The specification does not deal with encoding of characters outside the ASCII set; to mitigate this, some BitTorrent applications explicitly communicate the encoding (most commonly UTF-8) in various non-standard ways."

Hmm, sure? BEP-3 says: "All strings in a .torrent file that contains text must be UTF-8 encoded." --Tddt (talk) 23:38, 20 May 2012 (UTC)[reply]

Bencoding is not only used for the torrent file but may also be used somewhere else (e.g. tracker response). Even in the .torrent file there is a string which is not UTF-8 encoded: The pieces entry in the info dictionary. It contains the SHA1 hashes in binary form (each hash is represented by 20 bytes - not characters) BEP-3. Korexio (talk) 09:35, 11 June 2012 (UTC)[reply]

The statement "The specification does not deal with encoding of characters outside the ASCII set;" is wrong - UTF-8 encoding is specified for some strings in BEP-3. Korexio (talk) 09:51, 11 June 2012 (UTC)[reply]

Features & defects[edit]

"When editing torrent files, any changes to the info directory index will change the torrent file infohash."

This is a deficiency specific to torrent files, not Bencode. Therefore the sentence is irrelevant and should be removed.

And it's not even a defect; the whole point of infohashes is that any difference inside the info dictionary will lead to a different infohash, hence the name. —SamB (talk) 00:44, 25 January 2015 (UTC)[reply]