Wikipedia:Reference desk/Archives/Computing/2014 December 4

From Wikipedia, the free encyclopedia
Computing desk
< December 3 << Nov | December | Jan >> December 5 >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


December 4[edit]

Internal data structure for an editor-like application as a binary tree[edit]

Since a well-balanced binary tree has O(log n) time complexity for most operations (look up/insertion/deletion), which, as far as I know, would be a suitable data structure for application like text editors handling large text files over 10MB. (I don't know if there are other better data structures for this) A text editor may keep lines read from a text file in a binary tree (AVL tree, Splay tree, Red-black tree etc.), a tree node for a line (data overhead too high?), as its internal data structure. Since line numbers of lines after where lines get deleted or added change, keeping line numbers in the tree could require extra updating costs and hence not ideal. An order statistic tree is essentially a binary tree like the others, but an extra field is added to each node keeping track of the number of nodes below it, such that for any given node in the tree it can quickly answer what the order of the node is in the tree in O(log n) time. By order statistic tree, we should get out of updating line numbers when there are some preceding lines being added/deleted, because the order of the node is actually the line number.

Neither C++ STL nor Boost C++ libraries support order statistic tree AFAIK. But GNU C++ Library has an extension called policy-based data structures (PBDS) for that:

#include <iostream>
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>
 
using namespace std;
using namespace __gnu_pbds;

typedef int key_t;
typedef string line_text_t;

typedef tree<
  key_t, // Key
  line_text_t, // Mapped
  less< int >, // Cmp_Fn (functor)
  rb_tree_tag, // Tag (choices are: ov_tree_tag, rb_tree_tag, splay_tree_tag)
  tree_order_statistics_node_update // Node_Update
> map_t;

void printNth(size_t n, map_t & map)
{
  const char * name;
  switch (n) {
  case 1: name = "st"; break;
  case 2: name = "nd"; break;
  case 3: name = "rd"; break;
  default: name = "th";
  }
  cout << "the " << n << name << ": " << map.find_by_order(n - 1)->second << endl;
}

int main()
{
  map_t s;
  s.insert(make_pair(986, "qux"));
  s.insert(make_pair(-477, "bar"));
  s.insert(make_pair(26, "foo"));
  s.insert(make_pair(-39, "barz"));
  for (map_t::iterator it = s.begin(); it != s.end(); it++)
    cout << "* key: " << it->first << ", mapped: " << it->second << endl;

  cout << "--------" << endl;

  printNth(3, s);
  printNth(1, s);
  printNth(4, s);

  return 0;
}

the output is:

* key: -477, mapped: bar
* key: -39, mapped: barz
* key: 26, mapped: foo
* key: 986, mapped: qux
--------
the 3rd: foo
the 1st: bar
the 4th: qux

the key of the entry inserted into the tree is not very important since I have mentioned above what we want for an editor-like app is line number which is equivalent to the order of the node which should be acquired by map_t::find_by_order(). To insert a line, for example, into the order statistic tree we may do as follows:

1. Find out the line around which we want to insert a line. An iterator to the found line is acquired at this step. For example, to insert a line around line 3 "foo" we may do:
map_t::iterator itLine3 = s.find_by_order(3 - 1); // find_by_order() returns a zero-based order number so we need to minus the line number by 1.
2. Insert the line text against the iterator we've got from step 1, for the example:
s.insert_after(itLine3, make_pair(45, "this is the line text 1 being inserted"));
//  Or
s.insert_before(itLine3, make_pair(45, "this is the line text 2 being inserted"));

the problem is the reference of the order statistic tree generated from doxygen looks quite obscure to me. I can only find the reference for map_t::insert() but there seems no such functions called insert_after(), insert_before(), insert_by_order(), insert_by_iterator(), delete_after(), delete_by_iterator() etc. Moreover, there are many member function declarations listed in the doxygen-generated reference but have no description for them at all. So does any other function help I'm not aware of regarding to GNU's PBDS? Or should I try other C++ libraries or even other data structure? - Justin545 (talk) 05:19, 4 December 2014 (UTC)[reply]

I don't have an answer to your particular question for order statistic trees, but will note that Rope (data structure) is a common tree data structure used for text editors. --Mark viking (talk) 05:39, 4 December 2014 (UTC)[reply]
Yes, a rope is a bigger string that's an interesting name. According to the article you gave, it looks like a promising data structure for editor apps. But I've nerver head it before, it seems to be even rare than order statistic tree. So it makes me wonder if there is some existing C/C++ library implementation of rope for use? Or I need to write my own one from scratch? - Justin545 (talk) 08:12, 4 December 2014 (UTC)[reply]
The SGI STL has a rope container, with source code at [1]. I've not used it myself. The LLVM has a Twine class, a type of rope. In other languages, IBM has a Java rope implementation and Hackage has a data-rope package in Haskell. --Mark viking (talk) 09:40, 4 December 2014 (UTC)[reply]
It mentions stl_rope.h is an internal header file, but thanks anyway for the links! Just trying to understand the rope algorithms, in particular rebalancing of a rope, which should give me confidence using the library. Though rebalancing is not easy to understand, as it involves Fibonacci numbers. - Justin545 (talk) 02:26, 5 December 2014 (UTC)[reply]
You say the keys don't matter, but they do: they determine the order of the elements (like any other binary search tree), so to impose an order on all lines (past and present) you'd end up doing something like BASIC line numbers (obviously not robust) or variable-length keys (probably inefficient). That's why there isn't insert_after() (don't be confused by the insert-with-hint function in std::map). It sounds like you would want a self-balancing unsorted binary tree (with order-statistic added), for which I found only that trivial reference. --Tardis (talk) 14:57, 4 December 2014 (UTC)[reply]
What I though was that node position in a binary search tree implies its relative weight in the tree. As we can see, for example, there is a given node X, its left child node Y, and its right child node Z, we know that there is an implication that Y < X < Z even if the keys of X, Y and Z are unknown. So for a given node position in a tree (such as an iterator to a tree node), it should be sufficient to tell what the previous/next node is (see this algorithm finding out the next largest node which is irrelevant to node keys). Unfortunately, I can not find out such a function in the tree class provided by GNU PBDS, since its reference is obscure. - Justin545 (talk) 01:56, 5 December 2014 (UTC)[reply]
The tree does imply the order without reference to the keys, but no one implementing a binary search tree would (think to) provide such operations that could so easily violate the tree's invariants (since some kind of key is going to end up there, and as I said it's hard to construct keys that only imply relative order). That's why I suggested the "unsorted" tree; it would probably not be very hard to implement such a beast even from scratch (after identifying exactly which operations it should support: maybe it's just begin(), end(), next(I), previous(I), erase(I), push-back(T), and insert-before(I,T)). Then you can add the order-statistic support in the obvious fashion and add the operations seek(integer) and index(I). --Tardis (talk) 06:53, 5 December 2014 (UTC)[reply]
"probably not [...] very hard" ended up being a 312-line header file (with quasi-random-access iterators but no reverse iterators). It's not very well tested (I guess I could write an editor with it?), but I can post it if anyone's interested. --Tardis (talk) 03:36, 7 December 2014 (UTC)[reply]