This thesis deals with space-efficient algorithms to compress and index texts. The aim of compression is to reduce the size of a text by exploiting regularities such as repeti- tions or skewed character distributions. Indexing, on the other hand, aims at accelerating pattern matching queries (such as locating all occurrences of a pattern) with the help of data structures (indexes) on the text. Despite being apparently distant concepts, com- pression and indexing are deeply interconnected: both exploit the inner structure of the text to, respectively, reduce its size and speed up pattern matching queries. It should not be surprising, therefore, that compression and indexing can be “forced” (actually, quite naturally) to coincide: compressed full-text indexes support fast pattern matching queries while taking almost the same size of the compressed text. In the last two decades, several compressed text indexes based on the most efficient text compressors have been designed. These indexes can be roughly classified in two main categories: those based on suffix-sorting (Burrows-Wheeler transform indexes, compressed suffix arrays/trees) and those based on the replacement of repetitions by pointers (Lempel- Ziv indexes, grammar indexes, block trees). Indexes based on a combination of the two techniques have also been proposed. In general, suffix sorting-based methods support very fast queries at the expense of space usage. This is due to several factors, ranging from weak compression methods (e.g. entropy compression, used in early FM-indexes, is not able to capture long repetitions), to heavy structures (e.g. suffix array sampling) flanked to the compressed text representation to speed up queries. The second class of indexes, on the other hand, offers strong compression rates, achieving up to exponential compression on very repetitive texts at the expense of query times—often quadratic in the pattern length or—in the worst case—linear in the text length. Among the most used compression techniques, run-length compression of the Burrows- Wheeler transform and Lempel-Ziv parsing (LZ77) have been proved to be superior in the compression of very repetitive datasets. In this thesis we show that these two tools can be combined in a single index gathering the best features of the above-discussed indexes: fast queries (linear in the pattern length and logarithmic in the text length), and strong compression rates (up to exponential compression can be achieved). We describe an efficient implementation of our index and compare it with state of the art alternatives. Our solution turns out to be as space-efficient as the lightest index described in the literature while supporting queries up to three orders of magnitude faster. Apart from index definition and design, a third concern regards index construction complexity. Often, the input text is too big to be fully loaded into main memory. Even when this is feasible, classic compression/indexing algorithms use heavy data structures such as suffix trees/arrays which can easily take several times the space of the text. This is unsatisfactory, especially in cases where (i) the text is streamed and not stored anywhere (e.g. because of its size) and (ii) the compressed text is small enough to fit into main memory. A further contribution of this thesis consists in five algorithms compressing text within compressed working space and in two recompression techniques (i.e. algorithms toconvert between different compression formats without full decompression). The complete picture we offer consists of a set of algorithms to space-efficiently convert among: • the plain text • two compressed self-indexes, and • three compressed-file formats (entropy, LZ77, and run-length BWT) The general idea behind all our compression algorithms is to read text characters from left to right and build a compressed dynamic Burrows-Wheeler transform of the reversed text. This structure is augmented with a dynamic suffix array sampling to support fast locate of text substrings. We employ three types of suffix array sampling: (i) evenly-spaced (ii) based on Burrows-Wheeler transform equal-letter runs, and (iii) based on Lempel-Ziv factors. Strategy (i) allows us to achieve entropy-compressed working space. Strategies (ii) and (iii) are novel and allow achieving a space usage proportional to the output size (i.e. the compressed file/index). As a last contribution of this thesis, we turn our attention to a practical and usable implementation of our suite of algorithmic tools. We introduce DYNAMIC, an open-source C++11 library implementing dynamic compressed data-structures. We prove almost- optimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. The implementation of the compression algorithms described in this thesis using DYNAMIC meets our expectations: on repetitive datasets our solutions turn out to be up to three orders of magnitude more space-efficient than state-of-the art algorithms performing the same tasks.

Titolo: | Compressed Computation for Text Indexing |

Data di pubblicazione: | 3-apr-2017 |

Citazione: | Compressed Computation for Text Indexing - Milano : Università degli studi di Milano. , 2017 Apr 03. ((29. ciclo |

Abstract: | This thesis deals with space-efficient algorithms to compress and index texts. The aim of compression is to reduce the size of a text by exploiting regularities such as repeti- tions or skewed character distributions. Indexing, on the other hand, aims at accelerating pattern matching queries (such as locating all occurrences of a pattern) with the help of data structures (indexes) on the text. Despite being apparently distant concepts, com- pression and indexing are deeply interconnected: both exploit the inner structure of the text to, respectively, reduce its size and speed up pattern matching queries. It should not be surprising, therefore, that compression and indexing can be “forced” (actually, quite naturally) to coincide: compressed full-text indexes support fast pattern matching queries while taking almost the same size of the compressed text. In the last two decades, several compressed text indexes based on the most efficient text compressors have been designed. These indexes can be roughly classified in two main categories: those based on suffix-sorting (Burrows-Wheeler transform indexes, compressed suffix arrays/trees) and those based on the replacement of repetitions by pointers (Lempel- Ziv indexes, grammar indexes, block trees). Indexes based on a combination of the two techniques have also been proposed. In general, suffix sorting-based methods support very fast queries at the expense of space usage. This is due to several factors, ranging from weak compression methods (e.g. entropy compression, used in early FM-indexes, is not able to capture long repetitions), to heavy structures (e.g. suffix array sampling) flanked to the compressed text representation to speed up queries. The second class of indexes, on the other hand, offers strong compression rates, achieving up to exponential compression on very repetitive texts at the expense of query times—often quadratic in the pattern length or—in the worst case—linear in the text length. Among the most used compression techniques, run-length compression of the Burrows- Wheeler transform and Lempel-Ziv parsing (LZ77) have been proved to be superior in the compression of very repetitive datasets. In this thesis we show that these two tools can be combined in a single index gathering the best features of the above-discussed indexes: fast queries (linear in the pattern length and logarithmic in the text length), and strong compression rates (up to exponential compression can be achieved). We describe an efficient implementation of our index and compare it with state of the art alternatives. Our solution turns out to be as space-efficient as the lightest index described in the literature while supporting queries up to three orders of magnitude faster. Apart from index definition and design, a third concern regards index construction complexity. Often, the input text is too big to be fully loaded into main memory. Even when this is feasible, classic compression/indexing algorithms use heavy data structures such as suffix trees/arrays which can easily take several times the space of the text. This is unsatisfactory, especially in cases where (i) the text is streamed and not stored anywhere (e.g. because of its size) and (ii) the compressed text is small enough to fit into main memory. A further contribution of this thesis consists in five algorithms compressing text within compressed working space and in two recompression techniques (i.e. algorithms toconvert between different compression formats without full decompression). The complete picture we offer consists of a set of algorithms to space-efficiently convert among: • the plain text • two compressed self-indexes, and • three compressed-file formats (entropy, LZ77, and run-length BWT) The general idea behind all our compression algorithms is to read text characters from left to right and build a compressed dynamic Burrows-Wheeler transform of the reversed text. This structure is augmented with a dynamic suffix array sampling to support fast locate of text substrings. We employ three types of suffix array sampling: (i) evenly-spaced (ii) based on Burrows-Wheeler transform equal-letter runs, and (iii) based on Lempel-Ziv factors. Strategy (i) allows us to achieve entropy-compressed working space. Strategies (ii) and (iii) are novel and allow achieving a space usage proportional to the output size (i.e. the compressed file/index). As a last contribution of this thesis, we turn our attention to a practical and usable implementation of our suite of algorithmic tools. We introduce DYNAMIC, an open-source C++11 library implementing dynamic compressed data-structures. We prove almost- optimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. The implementation of the compression algorithms described in this thesis using DYNAMIC meets our expectations: on repetitive datasets our solutions turn out to be up to three orders of magnitude more space-efficient than state-of-the art algorithms performing the same tasks. |

Handle: | http://hdl.handle.net/11390/1132945 |

Appare nelle tipologie: | 8.2 Tesi di Dottorato (OpenUniud) |

###### File in questo prodotto:

File | Descrizione | Tipologia | Licenza | |
---|---|---|---|---|

10990_816_Compressed Computation for Text Indexing.pdf | Tesi di dottorato | Non specificato | Open Access Visualizza/Apri |