Computing desk
< May 11	<< Apr \| May \| Jun >>	May 13 >

Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.

May 12[edit]

Accessing a list of words through any letter[edit]

If we want to search through a list of words, alphabetically ordering them should suffice. (say, you 'll want to find the words beginning with 'car' then you'll go to the 'c', 'a' and 'r') But, what if hypothetically, you'll wanted to access the same list through any letter? (say, you'll want to find all words that have 'ff' in the 3rd and 4th position - like 'different' 'differently').

How would a data structure look like in this case? Would you need to index the list for each position? That is, number of indexes needed = length of longest word. Is there a more efficient way of doing this? --Doroletho (talk) 13:23, 12 May 2018 (UTC)[reply]

grep "^..ff" list_of_words.txt > list_of_ff-words

grep ought to be available for (or already installed on) your computer, whatever computer it is. As for the "^.." mumbo-jumbo, see the article regular expression. -- Hoary (talk) 13:46, 12 May 2018 (UTC)[reply]

(edit conflict) I'm going to assume the question is not how to do it, but how to efficiently do it in terms of time and space complexity.

Relevant link: database index. It is probable that real-life implementations of Select (SQL) (WHERE) and similar features contain a lot of insight about your question, but I could not find any proper documentation of the implementation specificities regarding access by random properties, time complexity, and index usage.

In database parlance, your idea of alphabetical ordering the list is called clustered indexing. I.e. when you know that some search keys will be asked before others (in our case, the search keys are "value of the first letter", "value of the second letter", etc.) it makes sense to write the database in the order of decreasing frequence of search key. However, as you mention, this loses value when the order of searches in random, so what you want is non-clustered indexing. If any query of the form "find all words whose i-th letter is <X>" is allowed and must run in sub-linear time (i.e. faster than a grep that looks through all of them), you cannot avoid to index all positions, else there will be a query that causes you to search through all entries (words).

Depending on the particular constraints you face it is not clear what would be most "efficient", but here are two obvious ideas.

If there are few letters per word and all that matters is that lookups be as quick as possible, you can generate one alphabetical ordering (index tree) for every combination of letter priority. This means you must create a lot of indices that replicate the data (2^N indices for words of length up to N), and everytime you insert or remove a word from the dataset you need to update each of those indices (at great cost since there are many of them). However, each search query takes time proportional to the output, as the lookup part is essentially already precomputed (and takes a negligible time $O(L\log(V))$ where L is the number of characters in the query and V is the number of possible values for each letter; that is in an ordered-tree structure, in a hash-index structure you should be able to achieve O(L) but in any case that is not where the critical performance is).
If there are few words in the dataset compared to how many words with the same maximum number of characters there could be, I would say the following hash-index solution is decent both in terms of updating and lookup.^{[citation needed]} Precompute the set of words whose i-th letter is X $I[i][X]$ (in practice, every time you insert or remove a word from the dataset, you must update accordingly N such sets where N is the maximal length of words). Any search query sums up to computing an intersection of sets of the form $I[i][c_{i}]\cap I[j][c_{j}]\cap ...$ . Smart implementation (start with the smallest set) can reduce the intersection computation to a time complexity of $O(L\min _{1\leq i\leq L}[\#I[i][c_{i}]])$ , which can be quite terrible (close to linear) in the worst-case but the average case should be much better (though I am not sure how much exactly). Tigraan^{Click here to contact me} 14:44, 12 May 2018 (UTC)[reply]

Yes, sorry, I misread the question. -- Hoary (talk) 03:18, 13 May 2018 (UTC)[reply]