Talk:File format/Archive 1

This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

BIEW

Why did you remove link to BIEW? BIEW is not only hex editor BIEW is Binary vIEW project first!!! —The preceding unsigned comment was added by Nickols k (talk • contribs) 07:30, 4 May 2007 (UTC).

So how is it relevant to an article about file formats in general (as opposed to object file formats)? Guy Harris 07:58, 4 May 2007 (UTC)

Data format

This article is about data formats in general, not only file formats: it should be renamed. For instance JPEG is a data format but not a file format (JFIF is a file format using JPEG, TIFF is another one). A data format is required to exchange data, a file is only an exchange mean, that is not always used. It is quite common for a server to generate data in some format (say MPEG) and send it to a client for immadiate use, with no file storage at any point in the process. Marc Mongenet 18:37, 2004 Aug 30 (UTC)

Incorrect HTML magic number, HTML files should start with <!DOCTYPE

An HTML file, for instance, should begin with the ASCII characters <html>

This is not correct. HTML and XHTML documents should start with a DOCTYPE declaration.

See http://www.w3.org/TR/html4/struct/global.html#h-7.1

--Skjæve 09:21, 14 Sep 2004 (UTC)

You are of course correct, and I've changed the article to try and reflect this. I've left the mention of <html> in there, though, because if you were testing magic numbers it would be foolish not to look for this: I should think there are an awful lot of HTML files out there that don't include a DOCTYPE, especially hand-crafted ones. - IMSoP 15:46, 15 Sep 2004 (UTC)

And they are perfectly valid HTML documents that have no html tag (and no head and no body). They all are optional elements. HTML is the worst example of magic number I can think about. Marc Mongenet 04:35, 2004 Sep 16 (UTC)

Or perhaps the best example of the imperfection of the approach! It is, after all, an attempt to apply rules that started off for internal use by readers of binary files, to the much more complex problem of identifying the large range of files on a modern computer. Perhaps a note should be added that this isn't always easy. Anyway, I only used HTML as an arbitrary example of "a file type most readers will have seen and understand", and included it in all three sections for consistency and to aid comparisons. (Although, I'm not sure they are strictly "optional" in terms of Standards-based validity, only in terms of renderability by most "tag soup" parsers, but I see your point) - IMSoP 13:07, 16 Sep 2004 (UTC)

Both the start and end tags are optional in HTML 2.0, HTML 3.2 and HTML 4.0(1), strict and loose DTD. Worst possible example... Marc Mongenet 20:02, 2004 Sep 30 (UTC)

Fair enough, I didn't know that. Still, I think you miss my point about what it's a good example of: it's a good example of magic numbers not working. It illustrates their achilles heel, so to speak. I've edited the article now, to explicitly make this point. [The downside being my nice parallel examples are now in different orders in different paragraphs. :( Maybe I'll fix that later.] - IMSoP 22:36, 30 Sep 2004 (UTC)

"magic number" vs. other ways to specify format

The "magic number" ... approach ... is only useful, however, if the interface used to access the files allows the user to easily manipulate any file in a variety of ways — as opposed to double clicking automatically doing the "right" thing...

I don't understand. When I double click on a file, what difference does it make if the OS looks at the "magic number" at the beginning of the file or the "extension" of the file name ? It seems to me that it doesn't make any difference, so it is just as "useful" either way. --DavidCary 20:26, 5 Jan 2005 (UTC)

Hmm... I wonder if the author of that statement (possibly me) was referring to the "can often determine more precise information" part of the previous sentence (this information being irrelevant in a big-icon just-double-click type environment - although I grant that sorting and searching can be greatly enhanced by it, even on Windows); or perhaps, it was intended to mean the usefulness of arbitrarily changing a file's "type" (e.g. renaming a file under Windows changes the double click behaviour). I certainly agree that the current statement is unclear, but am not 100% sure how to reword it. - IMSoP 23:21, 5 Jan 2005 (UTC)

A further disadvantage [of "magic numbers"] is that it requires scanning of both the file in question and a "magic file" listing known identifiers, making it less efficient, especially for displaying large lists of files.

I don't understand. Less efficient than what ? If we used external metadata or file extensions, we'd still have to (a) read that data or file extension and (b) look it up in a list of known file types.

Yes, this is definitely badly written. A more correct statement is that since reliable 'magic number' tests are often quite complex, and each file must be tested against every test known (the tests are not necessarily mutually exclusive, and there will often be fairly generic tests that match as well as more specific ones, so you can't even stop at the first match), it is less efficient. (Unlike an extension or standardised metadata test, where the data is checked once, and looked up in a potentially very efficient index of one-to-one relationships.) I'll reword that part now. - IMSoP 23:21, 5 Jan 2005 (UTC)

And again

This is only useful, however, if the interface used to access the files allows the user to easily manipulate any file in a variety of ways—as opposed to double clicking automatically doing the "right" thing; it is therefore more often associated with command line interfaces than graphical ones.

Still doesn't make any sense. Removed.

each file must be tested against every possibility in the "magic file"

Not strictly speaking true. Proper algorithms and data structures for the magic database (trie-like) can make magic detection remarkably efficient.

And, as with the example of HTML, some filetypes just don't lend themselves to recognition in this way.

HTML is recognisable. fdo uses this magic:

   <magic priority="50">
     <match value="<head" type="string" offset="0:64"/>
     <match value="<TITLE" type="string" offset="0:64"/>
     <match value="<title" type="string" offset="0:64"/>
     <match value="<html" type="string" offset="0:64"/>
     <match value="<HTML" type="string" offset="0:64"/>
     <match value="<BODY" type="string" offset="0"/>
     <match value="<body" type="string" offset="0"/>
     <match value="<TITLE" type="string" offset="0"/>
     <match value="<title" type="string" offset="0"/>
     <match value="<!--" type="string" offset="0"/>
     <match value="<h1" type="string" offset="0"/>
     <match value="<H1" type="string" offset="0"/>
     <match value="<!doctype HTML" type="string" offset="0"/>
     <match value="<!DOCTYPE html" type="string" offset="0"/>
   </magic>

That's enough to match any real-world html document. EdC 17:28, 3 August 2006 (UTC)

Uniform Type Identifiers

The page should be updated with information about the Uniform Type Identifiers (UTI), from Apple Computers.

Section on UTIs added. --Malpertuis 22:42, 4 August 2006 (UTC)

Odds and ends...

--Ccodere 13:46, 17 August 2006 (UTC)I will add some small clarifications on the MIME types, as a lot of people have added their own MIME types without actuall registering them, which makes the MIME standard very awkward indeed. Furthermore, even not widely in use, i have created my own File format identification scheme... I will explain it, please tell me if it is not appropriate here.

File structure addition...

I will be adding a new section giving information on the different possible file structures for file formats. If you have think i have missed something, please feel free to discuss it with me. Hopefully this will help clarifiy the different file formats...

Furthermore, i explicitly the copyright on my magicdb.org glossary terms page so it can be included in wikipedia.

Ccodere 04:34, 11 January 2007 (UTC)Carl

File format resouces in External Links

File format resource sites should definitely be in the External Links section, including the Game File Format Central that is the last word in descriptions of game archive and other game file formats. I have included it back in, because some ignorant user deleted it as spam. It is not. A site which offers thousands of detailed descriptions of file formats is highly relevant to this page. In addition, you can find tutorials there that explain how one can go about understanding file formats. Now, please be sensible and do not remove that resource again. It looks stupid. —The preceding unsigned comment was added by 192.87.23.66 (talk) 08:59, 27 June 2007

file type convertions

can anyone make a suggestion on how to include information about converting between file types, possibly in a seoporate page? and the details of available convertion programs. eg. the software you use to turn a wav, mp3 or other music file into another file type (properly) eg. sox on linux converts simple wave like files, I used to use a prog on linux which converts mp3s into wave files.

Weecol 12:12, 20 July 2007 (UTC)

Extensions

This section seems to be written ina rather odd way. Extensions are generally perceived as having more disdvantages than anything, such as the mentioned luser losing the file, although the main reason (not mentioned) is the potential for exploits. --18.33.1.42 22:21, 22 July 2007 (UTC)

Yes, it is very odd. I'd even say it is "original research" because it seems to solely based on the opinion of the author. In fact, nobody (probably not even Microsoft) knows why filename extensions are hidden by default in Microsoft's software since 1995. At least I am not aware of any statement regarding their reasonings. Exploits are indeed based on the assumption that this information is hidden from the user. If it was not hidden, there would be nothing to exploit. Sure, after 12 years of dumbing users down we cannot expect Joe and his average to understand what ".exe" means or implies but that's only because extensions are now hidden not because they exist. There was no point in hiding this. If they thought ".exe" or ".jpg" was too obscure, they could have mapped this to non-forgeable human-readable hints like "application", "picture" etc. for presentation. You have to show some information to the user anyway. Mac OS and MS Windows support embedded icons which adds another exploitable flaw because an executable can now be camouflaged as a picture, audio file or whatever. Most people don't customize icons, just in case someone wants to object. Thus, claiming extensions are hidden to prevent that users mislabel their files by accident does not make any sense and looks pretty far fetched. Windows lets you do much worse things with files and MS Windows Explorer could simply reject such renaming attempts, ask for confirmation or provide an undo function. Also this isn't the same world anymore. In 1995 users knew what a "JPG" or "EXE" was. So the pseudo-explanation is at best an attempt to retcon reality. If there's confusion nowadays, see above, --217.87.80.102 22:36, 21 October 2007 (UTC)

Article is technically incorrect and badly written

Where to start? First sentence has no particular meaning. Then, second para: It isn't true that a disk drive can store only bits. It isn't true that "0s and 1s" aren't "information". It's completely misleading to say that "different" kinds of information require different file formats. It doesn't make any sense to say that "within" an application "there will be" several formats.

This article is a wasteland of half-understood concepts and cliches. Honestly. One of the worst written articles I have read in Wiki.24.6.66.149 (talk) 12:01, 19 March 2008 (UTC)

To clairify slightly, it would be better to say "Disks usually store information in a binary format which needs conversion to be easily understood by people." That is, avoid saying "disks can only store bits", because there are a number of senses in which this isn't true. 76.102.157.205 (talk) 03:19, 20 March 2008 (UTC)

Chunked file formats

The article is written as though IFF was the first. Clearly it wasn't; I've added information that DER (which was already in the list of formats that 'took the idea' from IFF) predates its publication by a year. I'm pretty certain that wasn't the first, either. It's a very obvious concept that I suspect dates back to the 70s or even the 60s... I'm just not aware of any sources about this.

Anyone know of anything useful here? JulesH (talk) 11:10, 25 October 2008 (UTC)

You are probably correct on this assumption, but as long as we do not have a clear historical reference, we cannot simply replace "pioneered" by "popularized" in the IFF paragraph.

Ccodere (talk) 17:41, 5 January 2009 (UTC)

External Links

Now, somehow, someone decided to remove some external links. Even though there were too many external links, some of these links pointed to standards that are not even on dmoz.org. Therefore, i have at least re-instated some of the sites accordingly.

Ccodere (talk) 17:21, 5 January 2009 (UTC)

New Concept

[[Object Format is a new concept being developed that relates to the concept of storing functions in with the data (what an object is), thus offloading data transformations on [[Image:those who created the data format. The reduces the [http [[[Image:== ://www.freesoftwaremagazine.com/free_issues/issue_01/focus_format_history/ complexity needed in a file format and the complexity that isn't needed] (effectively open sourcing the implementation of the transformation code with the data). This technology is currently in development in Open Source projects I'm working on, so I felt it was worth mentioning.

The methods in the object can be protected using a public key encryption of the data CRC and any necessary error correction, a process I haven't dealt with but have faith in as a way to protect the objects from malicious adjustments. Beware file formats, your days are numbered. --Rofthorax 08:51, 24 August 2005 (UTC)

Sounds a lot like a Smalltalk image... Wouter Lievens

==]]]

]] yes it does!

The Ogg format can potentially store video and/or audio, but actual implementations are currently rare as of December 2001, and only an audio codec called Ogg Vorbis exists for the format, though developers continue to work on video codecs such as the Tarkin video codec and to integrate other formats such as MNG (lossless and motion-JPEG compression), FLAC (lossless audio compression), and XML (text-based data such as captions and subtitles) into the Ogg framework. IFF is a now defunct format which, like AVI, is a shell, but IFF had no limitations, being able to store sound, image, movie, animation, data or archive. If a programmer wanted to store data in the IFF format, he just had to define the subformat following the general rules. For example, WAV files follow a variation of IFF called RIFF.

Removed, at least temporarily. The main point of including this seemed to be to make the point that some file formats allow storage of more than one kind of information. I have left one or two examples in the article, but I think such an in-depth treatment is excessive, at least considering the current length of the rest of the article. -- Ryguasu

This text:

The most useful part of intellectual property law for protecting ownership of a file format seems to be patent law. Although you cannot patent a file format, some file formats require encoding data with patented algorithms.

is presumably talking about the laws of a particular country, the USA maybe? Maybe there are countries where it's possible to patent a file format, but patenting algorithms as such is not permitted in many places.

Yes, I was thinking about US law, which is not an altogether bad starting point, given the importance of US software developers. But a more international perspective is certainly in order. --Ryguasu

Why remove links on file format page

I have an issue with you removing all the links on the file format definitions page. I tried to cleanup and remove some of the invalid spam links.

Some of the web sites in the external links ARE much more useful than the ones in DMOZ, can you please explain your reasoning:

I don't see how any of these links could be considered SPAM in any way:

- * Magic signature database - Standard file format information and FFID registry - * File Signatures Database resource for forensic practitioners - * PRONOM technical registry - * Library of Congress file format information - * Introduction to Uniform Type Identifiers

Ccodere (talk) —Preceding undated comment was added at 01:45, 31 January 2009 (UTC).

Wikipedia is an encyclopedia - as such many links do not belong here. Equally Wikipedia is not a place for comercial links, links to registries, databases ect. Wikipedia is not a repository for links. Additionaly, these are Link normally to be avoided and fails Wikipedias specific inclusion requirements of our External Links policy.Its best to add Cited verifyable content, not links. --Hu12 (talk) 21:08, 31 January 2009 (UTC)

I currently cannot find ANY specific rule that does not permit any of the above links. Actually, to the opposite of what you are saying, all these categories fall in the

What should be linked

" 3. Sites that contain neutral and accurate material that cannot be integrated into the Wikipedia article due to copyright issues, amount of detail (such as professional athlete statistics, movie or television credits, interview transcripts, or online textbooks) or other reasons."

Please can you prove to me that one of those links is not usefull when trying to understand file format information?? There is no registration required, and no publicity. Any user interested in file formats (such as archivits and programmers) would be interested in these sites, you can ask any specialist. I really do not understand your reasoning....

Ccodere (talk)