Documentation
¶
Index ¶
- Variables
- type ContentDefinedChunker
- func NewFastContentDefinedChunker(r Peeker, gearTable *GearTable) ContentDefinedChunker
- func NewMaxContentDefinedChunker(r Peeker, gearTable *GearTable, minSizeBytes, maxSizeBytes int) ContentDefinedChunker
- func NewRepMaxContentDefinedChunker(r Peeker, gearTable *GearTable, minSizeBytes, horizonSizeBytes int) ContentDefinedChunker
- func NewSimpleMaxContentDefinedChunker(r Peeker, gearTable *GearTable, minSizeBytes, maxSizeBytes int) ContentDefinedChunker
- func NewSimpleRepMaxContentDefinedChunker(r Peeker, gearTable *GearTable, minSizeBytes, horizonSizeBytes int) ContentDefinedChunker
- type GearTable
- type Peeker
Constants ¶
This section is empty.
Variables ¶
var FastContentDefinedChunkerGearTable = GearTable{ // contains filtered or unexported fields }
FastContentDefinedChunkerGearTable contains constants that match those used by various implementations of the FastCDC algorithm:
- https://github.com/nlfiedler/fastcdc-rs (Rust) - https://github.com/buildbuddy-io/fastcdc2020 (Go) - https://github.com/HIT-HSSL/destor/blob/master/src/chunking/fascdc_chunking.c (C)
Functions ¶
This section is empty.
Types ¶
type ContentDefinedChunker ¶
ContentDefinedChunker can be used to decompose a large file into smaller chunks. Cutting points are determined by inspecting the binary content of the file.
func NewFastContentDefinedChunker ¶
func NewFastContentDefinedChunker(r Peeker, gearTable *GearTable) ContentDefinedChunker
NewFastContentDefinedChunker returns a content defined chunker that uses the FastCDC8KB algorithm as described in the paper "The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems".
func NewMaxContentDefinedChunker ¶
func NewMaxContentDefinedChunker(r Peeker, gearTable *GearTable, minSizeBytes, maxSizeBytes int) ContentDefinedChunker
NewMaxContentDefinedChunker returns a content defined chunker that uses an algorithm that is inspired by FastCDC. Instead of placing cutting points at the first position at which a rolling hash has a given number of zero bits, it uses the position at which the rolling hash is maximized.
This approach requires the algorithm to compute the rolling hash up to maxSizeBytes-minSizeBytes past the eventually chosen cutting point. To prevent this from being wasteful, this implementation stores cutting points on a stack that is preserved across calls.
Throughput of this implementation is supposed to be nearly identical to plain FastCDC. Due to the sizes of chunks being uniformly distributed as opposed to normal-like, the spread in chunk size is smaller. Furthermore, it is expected that this distribution also causes the sequence of chunks to converge more quickly after parts that differ between files have finished processing.
func NewRepMaxContentDefinedChunker ¶
func NewRepMaxContentDefinedChunker(r Peeker, gearTable *GearTable, minSizeBytes, horizonSizeBytes int) ContentDefinedChunker
NewRepMaxContentDefinedChunker returns a content defined chunker that expands upon MaxCDC, in that it repeatedly applies the chunking process until chunks are [minSizeBytes, 2*minSizeBytes) in size.
Like MaxCDC, this algorithm takes a parameter that controls the amount of data that is read ahead. While MaxCDC uses it to control the maximum chunk size, in this algorithm it only denotes the quality of the chunking that is performed (i.e., the horizon size). Setting it to zero leads to uniform chunking of minSizeBytes, while setting it to a positive value n means that an optimal point within offsets [minSizeBytes, minSizeBytes+n] will always be respected.
While MaxCDC performs poorly if the ratio between the maximum and minimum chunk size becomes too large, the horizon size can be increased freely without reducing quality. However, there will be diminishing returns.
It has been observed that this algorithm provides an almost identical rate of deduplication as MaxCDC. The advantage of this algorithm over MaxCDC is that for a given input it is trivial to check whether it is already chunked, purely looking at its size.
func NewSimpleMaxContentDefinedChunker ¶
func NewSimpleMaxContentDefinedChunker(r Peeker, gearTable *GearTable, minSizeBytes, maxSizeBytes int) ContentDefinedChunker
NewSimpleMaxContentDefinedChunker returns a content defined chunker that provides the same behavior as the one returned by NewMaxContentDefinedChunker. However, this implementation is simpler and less efficient. It is merely provided for testing purposes.
func NewSimpleRepMaxContentDefinedChunker ¶
func NewSimpleRepMaxContentDefinedChunker(r Peeker, gearTable *GearTable, minSizeBytes, horizonSizeBytes int) ContentDefinedChunker
NewSimpleRepMaxContentDefinedChunker returns a content defined chunker that provides the same behavior as the one returned by NewRepMaxContentDefinedChunker. However, this implementation is simpler and less efficient. It is merely provided for testing purposes.
type GearTable ¶ added in v0.0.8
type GearTable struct {
// contains filtered or unexported fields
}
GearTable is a table of 256 seemingly random 64-bit integers. These values are used by various implementations of ContentDefinedChunker to compute Gear hashes.
func NewSeededGearTable ¶ added in v0.0.8
NewSeededGearTable creates a GearTable that is initialized with values that are based on a seed.
The seed is hashed using cSHAKE128, and the constants are set to the first 2 KiB of data generated by the XOF, interpreted as a sequence of little-endian integers.


