Draft:String sorting algorithms

Review waiting, please be patient.

This may take 3 months or more, since drafts are reviewed in no specific order. There are 2,534 pending submissions waiting for review.

If the submission is accepted, then this page will be moved into the article space.
If the submission is declined, then the reason will be posted here.
In the meantime, you can continue to improve this submission by editing normally.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL
Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Reviewer tools

Instructions · What links here · String sorting algorithms (talk: + · bio) · (log) · Copyvios report · reFill · Citation Bot · (Search: Google, Bing, Wikipedia) · Submitted 2 months ago by Annekathrinhermann (talk: D · +) · Last edited 16 days ago by IntGrah

In computer science, string sorting algorithms are a special case of sorting algorithms, where the input is an array $S=\{s_{0},\dots ,s_{n-1}\}$ of $n$ strings with characters chosen from an alphabet $Σ$ .

Unlike traditional sorting algorithms that deal with atomic keys, string sorting encounters unique challenges. Sorting strings using conventional atomic sorting algorithms, which treat keys as indivisible objects, is inefficient because comparing entire strings can be costly and must be performed numerous times. Efficient string sorting algorithms, in contrast, inspect most characters of the input only once during the entire sorting process and they examine only those characters that are necessary to establish the correct ordering. Another challenge is that strings are represented as arrays of pointers. This representation results in indirect access to string characters, leading to cache faults during the access, even when scanning an array of strings. This is in contrast to sorting atomic keys, where scanning is notably cache efficient. The efficiency of string sorting algorithms depens upon multiple factors, including the size of the dataset ( $n$ ), the distinguishing prefix size of $S$ ( $D$ ), which is the minimal number of characters that need to be examined to sort the strings, the number of subproblems ( $σ$ ), into which the algorithm breaks down the problem, and the underlying hardware. This indicates that no singular algorithm is universally optimal.

Sequential Methods[edit]

Multikey quicksort[edit]

Developed by Bentley and Sedgewick in 1997, this algorithm is an adaptation of traditional quicksort, tailored for string sorting.^[1] It uses the character $x=s[h]$ with a common prefix of length h as a splitter, organizing the strings into three distinct arrays based on their $(h+1)$ th character's relation to $x:<,>,=$ . The algorithm recurses until the termination condition is met: if $x=0$ termination with $S_{=}$ . With Insertion Sort as a base case sorter for constant input sizes, multikey quicksort has a complexity of $O(D+n\log n)$ .

Most significant digit (MSD) radix sort[edit]

Most significant digit (MSD) radix sort is especially efficient for sorting large datasets, particularly when the alphabet size is small^[2]. The algorithm initiates sorting by examining the $(h+1)-th$ character of each string with $h$ as the common prefix, subsequently dividing the dataset into $σ$ distinct subproblems. Each subproblem is then recursively sorted with the common prefix length $h+1$ . This strategy, which is a natural approach to string sorting, has been subject to numerous refinements and improvements across various studies in the literature ^[3] ^[4] ^[5]. The time complexity is $O(D)$ plus the time required for sorting the base cases. For example, with multikey quicksort as the base case sorter MSD radix sort has a complexity of $O(D + n log σ)$ .

Burstsort[edit]

Burstsort uses a trie-based structure with containers at the leaves for sorting the strings.^[6]^[7] Upon reaching a predefined threshold, these containers "burst", redistributing the strings into new containers based on their next character. These new containers are then attached to the appropriate child nodes of the trie. The sorting process involves traversing the trie and individually sorting the small containers. Key factors influencing the runtime efficiency of Burstsort include the trie implementation, the design of the containers, the burst threshold, and the chosen base algorithm for sorting the containers. Sinha and Zoble used an array for each trie node and unordered dynamic arrays of string pointers for the leaf containers, with a bursting threshold set at $8192$ .^[8] With this configuration and multikey quicksort for sorting the leaves, burstsort has a complexity of $O(D + n log σ)$ .

LCP-mergesort[edit]

LCP-mergesort is an adaptation of the traditional merge sort algorithm, which stores and reuses the longest common prefixes (LCPs) of consecutive strings in the sorted subproblems ^[9]. This strategy enhances the efficiency of string comparisons. In the conventional method the strings $s_{a}$ and $s_{b}$ must be compared character-by-character. However, with the LCP information for $s_{a}$ and $s_{b}$ relative to another string $p$ of similar or smaller size allows the preliminary use of the LCP. If the LCP between $p$ and $s_{a}$ is shorter than that between $p$ and $s_{b}$ , it follows that $s_{a}$ precedes $s_{b}$ in lexicographical order due to $s_{a}$ and $p$ sharing a shorter common prefix than $s_{b}$ and $p$ . This also applies symmetrically. LCP-Mergesort has a worst-case time complexity of $O(D+nlogn)$ .

Insertion sort[edit]

Insertion sort is frequently used as the base case sorter for small sets of strings.^[10] The algorithm stores an ordered array and inserts the unsorted items into their appropriate positions through linear scanning. This method treats strings as atomic units, necessitating full string comparisons during the linear scan to ensure the correct order. It has a worst-case time complexity of $O(nD)$ . So it is particularly good for small $n$ and $D$ , due to the cache-efficient manner in which strings are scanned.

Parallel methods[edit]

The exploration of parallel string sorting algorithms remains limited, yet it is the only way to get performance out of Moore's Law.^[11] The scalability of an algorithm in a parallel computing environment depends on various factors, similar to those affecting sequential methods. Many of the algorithms discussed in the sequential context can be adapted for parallel execution.

References[edit]

^ BENTLEY, Jon L.; SEDGEWICK, Robert. Fast algorithms for sorting and searching strings. In: Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. 1997. S. 360-369.
^ KÄRKKÄINEN, Juha; RANTALA, Tommi. Engineering radix sort for strings. In: International Symposium on String Processing and Information Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. S. 3-14.
^ KÄRKKÄINEN, Juha; RANTALA, Tommi. Engineering radix sort for strings. In: International Symposium on String Processing and Information Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. S. 3-14.
^ NG, Waihong; KAKEHI, Katsuhiko. Cache efficient radix sort for string sorting. IEICE transactions on fundamentals of electronics, communications and computer sciences, 2007, 90. Jg., Nr. 2, S. 457-466.
^ ANDERSSON, Arne; NILSSON, Stefan. Implementing radixsort. Journal of Experimental Algorithmics (JEA), 1998, 3. Jg., S. 7-es.
^ SINHA, Ranjan; ZOBEL, Justin. Efficient trie-based sorting of large sets of strings. In: ACSC. 2003. S. 11-18.
^ HEINZ, Steffen; ZOBEL, Justin; WILLIAMS, Hugh E. Burst tries: a fast, efficient data structure for string keys. ACM Transactions on Information Systems (TOIS), 2002, 20. Jg., Nr. 2, S. 192-223.
^ SINHA, Ranjan; ZOBEL, Justin. Cache-conscious sorting of large sets of strings with dynamic tries. Journal of Experimental Algorithmics (JEA), 2004, 9. Jg., S. 1.5-es.
^ NG, Waihong; KAKEHI, Katsuhiko. Merging string sequences by longest common prefixes. IPSJ Digital Courier, 2008, 4. Jg., S. 69-78.
^ MCCLELLAN, Michael T.; MINKER, Jack. The art of computer programming, vol. 3: sorting and searching. 1974.
^ MOORE, Gordon E. Cramming more components onto integrated circuits. Proceedings of the IEEE, 1998, 86. Jg., Nr. 1, S. 82-85.

[1] BENTLEY, Jon L.; SEDGEWICK, Robert. Fast algorithms for sorting and searching strings. In: Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. 1997. S. 360-369.

[2] KÄRKKÄINEN, Juha; RANTALA, Tommi. Engineering radix sort for strings. In: International Symposium on String Processing and Information Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. S. 3-14.

[3] KÄRKKÄINEN, Juha; RANTALA, Tommi. Engineering radix sort for strings. In: International Symposium on String Processing and Information Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. S. 3-14.

[4] NG, Waihong; KAKEHI, Katsuhiko. Cache efficient radix sort for string sorting. IEICE transactions on fundamentals of electronics, communications and computer sciences, 2007, 90. Jg., Nr. 2, S. 457-466.

[5] ANDERSSON, Arne; NILSSON, Stefan. Implementing radixsort. Journal of Experimental Algorithmics (JEA), 1998, 3. Jg., S. 7-es.

[6] SINHA, Ranjan; ZOBEL, Justin. Efficient trie-based sorting of large sets of strings. In: ACSC. 2003. S. 11-18.

[7] HEINZ, Steffen; ZOBEL, Justin; WILLIAMS, Hugh E. Burst tries: a fast, efficient data structure for string keys. ACM Transactions on Information Systems (TOIS), 2002, 20. Jg., Nr. 2, S. 192-223.

[8] SINHA, Ranjan; ZOBEL, Justin. Cache-conscious sorting of large sets of strings with dynamic tries. Journal of Experimental Algorithmics (JEA), 2004, 9. Jg., S. 1.5-es.

[9] NG, Waihong; KAKEHI, Katsuhiko. Merging string sequences by longest common prefixes. IPSJ Digital Courier, 2008, 4. Jg., S. 69-78.

[10] MCCLELLAN, Michael T.; MINKER, Jack. The art of computer programming, vol. 3: sorting and searching. 1974.

[11] MOORE, Gordon E. Cramming more components onto integrated circuits. Proceedings of the IEEE, 1998, 86. Jg., Nr. 1, S. 82-85.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]