Project information
- Category: Algorithms Course Project
- Affiliation: Louisiana State University (LSU)
- Date: 2023
- Project URL: https://github.com/binaya07/fmindex
Backward Pattern Matching using FM-Index
Project Description
This project investigates the effectiveness of the FM-Index, a specialized data structure designed for efficient pattern matching within large text collections. The FM-Index leverages a combination of three key components: suffix arrays, the Burrows-Wheeler Transform (BWT), and wavelet trees. Each component plays a crucial role in achieving fast and accurate pattern identification.
We delve into the functionalities of each component:
- Suffix arrays: We explore how suffix arrays enable rapid retrieval of suffixes within a given string.
- Burrows-Wheeler Transform (BWT): We examine how BWT rearranges the string to group similar characters together, facilitating efficient pattern matching.
- Wavelet trees: We analyze how wavelet trees efficiently store and answer queries about the characters in the transformed string generated by BWT.
Our analysis of the FM-Index reveals its potential for efficient pattern matching in large text datasets. The combined functionality of suffix arrays, BWT, and wavelet trees allows for fast and accurate searches. Additionally, the project highlights the importance of block size selection in wavelet trees. While larger block sizes improve memory efficiency, they can also lead to slower pattern matching times. This finding emphasizes the need to carefully consider the application's priorities - faster search or lower memory usage - when implementing the FM-Index.