Talk:Shard (database architecture)

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Numerous authorities have spoken out against this term, including Theo Schlossnagle[1] Sharding is a fadish term for an old practice, and as such, the entry for sharding should redirect to horizontal partitioning. This article gives an example of horizontal partitioning, says that sharding is much more difficult, but doesn't explain how sharding is different from horizontal partitioning or an example of sharding. Of course it can't, because there is no difference.

By the way, since you disagree with my solution to the problem and undid the nomination, the onus is on you, not me, to fix the problem by correcting the article since you believe it should be kept:

"If you disagree: Any editor who disagrees with a proposed deletion can simply remove the tag. Even after the page is deleted, any editor can have the page restored by any administrator simply by asking. In both cases the editor is encouraged to fix the perceived problem with the page."

cswpride (talk) 06:31, 24 May 2009(UTC)

I find your argument confused. Are you suggesting deletion of this article and topic as non-notable? Your suggestion to redir to horizontal partitioning instead suggests that. However you then cite supporting evidence that ridicules sharding as a database concept, whilst recognising that it's a currently high-profile and indeed fashionable term. Now in Wikipedia terms, that means it's notable. Doesn't mean that it's right, but it does mean an encyclopedia ought to discuss it, if only to point out its misleading (we list sasquatch sightings too).
Secondly you claim the article (a referenced stub) is too trivial and should be replaced by a redirect - except that the redirect target has even less content, and is unreferenced.
Finally, the reference you cite is just some random geek's personal bloggage (that SIA book is so 1990s!), it's two years old and he totally fails to get the point about sharding in that post.
Finally, finally, please don't confuse prod and AfD. It just makes mopwork for people who'd rather be writing content. If there is an AfD in circulation for this, please at least put the right templates in place so that it can be found, or else it'll only have to be dragged through DRV again afterwards because this sort of "hidden AfD" as a way to avoid other editors discussing it isn't an acceptable way to go about deletion. 8-(
Overall of course I agree with you. This article doesn't make the shard / horizontal partitioning distinction clear. So if that's the problem, why not either fix it, ask other people to fix the article / explain it to you so that you can fix the article? Calling for deletion instead is ridiculous, and quite honestly I've got better things to spend my time worrying about. 8-(
The very brief handwaving version: Horizontal partitioning splits one or more tables by row, all within a single instance of a schema and a database server. Sharding goes beyond this: it partitions the problematic table in just the same way, but it does this across potentially multiple instances of the schema, with the other tables being replicated(sic) into those schemas en masse. This makes replication across multiple servers easy (simple horizontal partitioning can't). This is also why sharding is related to a shared nothing architecture - once shared, each shard can live in a totally separate schema instance / physical database server / continent. Unlike simple horizontal partitioning of a single table, there's no ongoing need to retain shared access (from between shards) to the other unpartitioned tables.
Andy Dingley (talk) 11:48, 26 May 2009 (UTC)[reply]
FWIW: We don't use "sharding" in the MySQL documentation, except to mention it as a synonym for our preferred "[horizontal] partitioning". I agree with deletion/redirect to Partition_(database).
Jonstephens (talk) 12:35, 25 January 2012 (UTC)[reply]
When did MySQL become the sole arbiter of what other databases support, or more particularly in this case, what an application builds on top of a database platform? Sharding can be implemented with any database, including MySQL. There's no need to have a native platform implementation of it. Andy Dingley (talk) 16:42, 25 January 2012 (UTC)[reply]
I'm sorry, I don't recall making any claim to being "a sole arbiter". I was merely offering the fact that we use the term "partitioning" (and don't see any need for "sharding") in the MySQL partitioning and MySQL Cluster documentation as an example of (non-)usage. Folks who write official documentation for other RDBMS are absolutely welcome to chime in with further examples. ... As for the merits of application-level partitioning those of having it baked in: has nothing to do with what I was talking about, and I've no interest whatsoever in debating it here/now. It's true that MySQL supports the latter, but this is completely orthogonal to the matter of terminology, about which my view is that "partitioning" is what's traditionally used; "sharding" is a relative newcomer; the latter has nothing to distinguish it from the former; there's no justification for multiplying key terms needlessly; I'm sticking with "partitioning". Jonstephens (talk) 09:02, 26 January 2012 (UTC)[reply]
If I understand the article well, sharding is introduced & maintained higher in the architecture than partitioning, since it is designed to operate across logical servers. Then it is a different concept. Of course an RDBMS can state it supports sharding by using partitioning, but that does not cover the whole concept of sharding. (An analogy coming to mind: since a RDBMS supports data warehouse star-scheme, i.e. allows it to be used, that does not mean that our page on star-scheme as a design concept is superfluous).
So, to me the page can stay, provided that the difference with or addition to horizontal partitioning is noted. As it is today. -DePiep (talk) 09:50, 26 January 2012 (UTC)[reply]

Relational centric approach on Shard_(database_architecture)[edit]

The article refers to sharding as the separation of "rows". This is a relational centric approach which should be avoided since many databases that have no concept of rows support sharding of data. — Preceding unsigned comment added by Germanviscuso (talkcontribs) 01:22, 9 March 2011 (UTC)[reply]

  • What would you use instead of "rows". Perhaps "records"? We could also replace "columns" with "fields". Although the two-dimensional imagery can aid understanding through visualisation. An alternative might be to address your issue at some point in the article, e.g. "... rows (or records for databases without rows) ...". — Preceding unsigned comment added by 82.32.24.201 (talk) 22:22, 8 June 2011 (UTC)[reply]
  • Please provide a reference for a non-relational database model that uses sharding. You seem to assert that the concept is still relevant when the partition can be neither horizontal nor vertical ... which I'm not certain is true, nor what criteria one would use to implement such a partition. yoyo (talk) 17:45, 8 April 2012 (UTC)[reply]
I can't reference this off hand (I'm on holiday), but sharding is an important aspect within triplestores and the world of RDF (or similar) data models.
It's fundamental to RDF that the addressing model is based on URIs, and so anything in "the universe" (i.e. anything on the web, or described on the web, for which at least one identifying URI can be given) can potentially exist within a data model. That's a big data model! It's also usual for RDF models to include a "model" identifier (think of it like a namespace) which can thus identify the "application data". The trouble is that RDF models tend to spread beyond this: a FOAF model for a company's employee data will also reference schemas and vocabularies from outside the application's own model, where it's useful to reference pre-existing external vocabularies such as address regions, tax codes etc. that have some independent existence from outside the new application model. Yet no business wants to have to import a government's entire tax code, just to obtain the vocabulary list for full-time/part-time/temporary employment status. So any RDF project that is both large, based on outside vocabularies, and also achievable, must have some ability to partition data (if not shard it).
Once the tools are in place for this, then it becomes very easy to start sharding the application data model in just the same way as is described here. Some parts of the application's fact data find themselves sharded and partitioned, others for the dimension table aspects are simply replicated around. It's not usual to think of "shards" in the world of triples (they're certainly not slices) but the idea of clustering data is a pretty common one. In the great web of triples, some form clusters (by looking at just their schema data types) and it's common to group such things together. Extend this to further stretch and fission a cluster along a meaningful semantic axis, and you have a shard.
I can't think of refs off-hand, but Damian Steer did some work on this at years ago, with a short project called "Brown Sauce" that automatically grouped and analysed clustering. Andy Dingley (talk) 18:04, 8 April 2012 (UTC)[reply]

Old Wine in New Bottles[edit]

If "sharding" is done on a single server, it's a faddish term for partitioning. If it's done on multiple servers, it's a faddish name for distributed database management, which has been around for well over 20 years. — Preceding unsigned comment added by Mbwallace (talkcontribs) 20:21, 30 March 2011 (UTC)[reply]

The term sharding has been around since the 90's so I don't see how it is a faddish term. I have added an etymology section for clarity. 64.9.146.252 (talk) 18:34, 29 June 2011 (UTC)[reply]
re IP: Yes, but only then. If I ping 127.0.0.1, who needs internet. -DePiep (talk) 22:26, 10 February 2012

Etymology[edit]

I've added an Etymology section, referencing the term's probable origin in Ultima Online. A similar section had been added (not by me) in June 2011, but it had no references and was promptly deleted as WP:MADEUP. I've included a reference to Raph Koster, a recognized expert in online game design, which I believe meets the criteria for citing blogs. Feel free to correct me if you have a reliable citation for "shard" being used in a database context before Ultima Online's 1997 release, but please don't delete the section out of hand. Phasma Felis (talk) 05:16, 18 January 2015 (UTC)[reply]

The term is definitely much older than that. I've added a reference to a replicated database system that was even called SHARD. (That reference doesn't appear to be available online, though you can find it listed in the bibliography of at least one book visible through Google Books. Perhaps there's a better reference?) Gareth McCaughan (talk) 19:09, 3 July 2015 (UTC)[reply]

... But it's been pointed out to me that while the SHARD system was called "SHARD" and involved databases and multiple sets of hardware, it's not clear that it was talking about the same thing as what's called sharding now. So while that may be the origin (or *an* origin) of the term "shard" in the context of databases, the relationship between that and the present-day use may not be so simple. Gareth McCaughan (talk) 19:14, 3 July 2015 (UTC)[reply]

Sharding ain't just this[edit]

There is a problem with Sharding pointing to this article, as the term is being used for more than just DB items. Heck, the Shared nothing architecture article says "Google calls this sharding" when referring to adding compute nodes to a system. Hence even with WP there is confusion. I came here because of web article using 'sharding' more like the partitioning of systems into compute units. Shenme (talk) 17:52, 15 October 2016 (UTC)[reply]

Inconsistencies[edit]

There are some problems with this article.

  1. The lead and the section header contradicted each other
    • Lead: "A database shard is a horizontal partition of data in a database or search engine."
    • section header: "Shards compared to horizontal partitioning"
  2. "if the database shard is based on some real-world segmentation of the data (e.g., European customers v. American customers)" -- From a relational data base view this example is more likely to be a vertical one rather than a horizontal one (unless the data is far from normalized).
  3. "This reduces index size, which generally improves search performance." but it does not as one has to know where to look in the database for the data so if anything it is likely to slow down binary searches". The whole point of efficiencies in database structures is to abstract this sort of problem from application designers.

--PBS (talk) 10:38, 23 November 2015 (UTC)[reply]

A subset relationship (all shards are horizontal partitions, not all partitions are shards) is not a contradiction. Andy Dingley (talk) 11:05, 23 November 2015 (UTC)[reply]

Overview of SHARD[edit]

The article cites:

Sarin, DeWitt & Rosenberg, Overview of SHARD: A System for Highly Available Replicated Data, Technical Report CCA-88-01, Computer Corporation of America, May 1988

Has anyone ever seen this? Efforts to find a copy, using WP:REX, on Twitter and elsewhere, or even to find someone who has seen it, have proved fruitless. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 17:05, 1 May 2022 (UTC)[reply]

FWIW, this came up in a recent Hacker News discussion. There are a number of related links there, and even some verbiage from a response received from an inquiry sent to one of the authors of the SHARD paper. So far it's still the case that nobody has managed to lay hands on a copy, but there are a number of other papers linked in that sub-thread which give a lot of information about SHARD (some of them being by some of the same authors). At the very least, it's clear that SHARD really did exist and was a real "thing". It appears that the paper was an internal memo at CCA and was probably never shared widely with the external world, which is probably why it can't be found online (so far). Sprhodes (talk) 21:03, 24 July 2023 (UTC)[reply]
I unfortunately don't have time right now to go through all of these and see which (if any) have information that could/should be incorporated here, but FWIW here's a list of papers that seem to relate to SHARD (none of which is *the* SHARD paper however).
https://apps.dtic.mil/sti/pdfs/ADA171427.pdf
https://apps.dtic.mil/sti/tr/pdf/ADA214478.pdf
https://apps.dtic.mil/sti/tr/pdf/ADA216523.pdf
https://apps.dtic.mil/sti/tr/pdf/ADA209437.pdf
https://apps.dtic.mil/sti/pdfs/ADA209126.pdf Sprhodes (talk) 21:14, 24 July 2023 (UTC)[reply]