Finding a length n needle in a haystack

April 3, 2018 mikefc

Problem: Find index of a short vector (needle) inside a larger vector (haystack)

What’s a good way to find the location of a needle in a haystack?

haystack <- sample(0:12, size = 2000, replace = TRUE)
needle   <- c(2L, 10L, 8L)

The haystack may be large.
The needle is definitely a vector
Only the index for the first occurrence of needle within haystack is needed
Assuming that there is always at least 1 occurrence

Solution 1: `for` loop

Naively test every position in a for loop
Advantage of this method over a vectorised approach, is that it can terminate early i.e. as soon as a value is found

forloop_find <- function(needle, haystack) {
  n <- length(needle) - 1L
  for (i in seq(haystack)) {
    if (identical(haystack[i:(i+n)], needle)) {
      return(i)
    }
  }
}

forloop_find(needle, haystack)

## [1] 34

Solution 2: `vapply`

vapply_find <- function(needle, haystack) {
  n <- length(needle) - 1L
  which(vapply(seq(haystack), function(i) {identical(haystack[i:(i+n)], needle)}, FUN.VALUE = logical(1L)))[1L]
}

vapply_find(needle, haystack)

## [1] 34

Solution 3: `zoo::rollapply`

rollapply_find <- function(needle, haystack) {
  which(zoo::rollapply(haystack, length(needle), identical, needle))[1L]
}

rollapply_find(needle, haystack)

## [1] 34

Solution 4: stacked `lead()` finding

for each item in the needle
- shift the haystack and check for the needle element
- stack the shifted haystacks and sum them
- check where sum == length(needle)

lead_find <- function(needle, haystack) {
  v  <- haystack == needle[1]
  for (i in seq(2, length(needle))) {
    v <- v + (lead(haystack, i-1L) == needle[i])
  }
  
  which(v == length(needle))[1L]
}

lead_find(needle, haystack)

## [1] 34

Solution 5: `shift`

data.table::shift is a vectorised version of dplyr::lead()
Adapted from user “Jaap” on stackoverflow

shift_find <- function(needle, haystack) {
  shifted_haystack <- data.table::shift(haystack, type='lead', 0:(length(needle)-1))
  
  v <- Map('==', shifted_haystack, needle)
  v <- Reduce('+', v)
  
  which(v == length(needle))[1]
}

shift_find(needle, haystack)

## [1] 34

Solution 5: Rcpp `for` loop

update 2018-04-04
- Shaved a few % points off the runtime by using the value of j instead of an explicit match count.

Rcpp::cppFunction('int rcpp_find(NumericVector needle, NumericVector haystack) {
    int needle_length   = needle.size();
    int haystack_length = haystack.size();

    for (int i = 0; i < (haystack_length - needle_length); i++ ) {
        int j;
        for (j = 0; j < needle_length; j++ ) {
                if (needle[j] != haystack[i + j]) {
                    break;
                }
        }

        if (j == needle_length) {
            return(i+1);
        }
    }

    return(0);

    }')


rcpp_find(needle, haystack)

## [1] 34

Solution 6: `pile` find (by dmytro)

dmytro on twitter offered this one.

pile_find <- function(needle, haystack) { 
  index <- seq_along(haystack) 
  for (i in seq_along(needle)) { 
    pile <- haystack[index] 
    index <- index[pile==needle[i]] + 1L
  } 
  (index-length(needle))[1]
} 

pile_find(needle, haystack)

## [1] 34

Solution 7: Sieved find (by carroll_jono)

carroll_jono on twitter offered this one.
This is the fastest of the pure-R methods.

sieved_find <- function(needle, haystack) { 
  sieved <- which(haystack == needle[1L]) 
  for (i in seq.int(1L, length(needle)-1L)) {
    sieved <- sieved[haystack[sieved + i] == needle[i + 1L]]
  }
  sieved[1L] 
}

sieved_find(needle, haystack)

## [1] 34

Solution 8: Coerce to string and use grep

Convert each integer into a string e.g. c(1, 2, 3) becomes “001.002.003”
Current tests are with integers less than 1000, so I’m safe to use %03i as a format string.

grep_find <- function(needle, haystack) {
  expanded_width <- 4  # 3 digits plus a "."
  res <- stringr::str_locate_all(paste(sprintf("%03i", haystack), collapse="."), paste(sprintf("%03i", needle), collapse="."))[[1]][,1][[1]]
  
  (res-1)/expanded_width + 1
}

grep_find(needle, haystack)

## [1] 34

Solution 9: Idomatic C++ solution hrbrmstr

Rcpp::cppFunction('int idiomaticrcpp_find(NumericVector needle, NumericVector haystack) {
  NumericVector::iterator it;
  it = std::search(haystack.begin(), haystack.end(), needle.begin(), needle.end());
  int pos = it - haystack.begin() + 1;
  if (pos > haystack.size()) pos = -1;
  return(pos);
}')

idiomaticrcpp_find(needle, haystack)

## [1] 34

Solution 10: Rcpp boyer-moore jimhester

Boyer-Moore implementation in Rcpp by jimhester

# https://gist.github.com/jimhester/2d02d63d0459d9fdadcf15e08c3afc0a
Rcpp::sourceCpp("cache/2018-04-03-boyer-moore.cpp")
boyermoorecpp_find(needle, haystack)

## [1] 34

`needle` is in the middle of the `haystack`

Table 1: `needle` in middle of `haystack`. microbenchmark() timings for 1000 runs - times are in microseconds
expr	min	lq	mean	median	uq	max	neval
rollapply_find(needle, haystack)	64622.974	67946.554	78132.17233	80610.9880	83993.7715	261446.362	1000
vapply_find(needle, haystack)	15304.346	16611.288	20492.53295	17070.3485	18881.6200	47565.512	1000
lead_find(needle, haystack)	6057.255	7833.419	14920.53701	8950.0515	18850.4280	199884.346	1000
shift_find(needle, haystack)	2809.894	3108.799	6250.15134	3670.6565	4361.3965	183883.048	1000
forloop_find(needle, haystack)	4768.181	5269.440	6790.98958	5457.1060	5717.7130	29957.481	1000
grep_find(needle, haystack)	3844.324	4002.460	4402.40560	4065.0250	4173.2350	184611.679	1000
pile_find(needle, haystack)	86.056	97.890	289.28966	102.9325	109.6500	21766.544	1000
sieved_find(needle, haystack)	41.337	53.283	75.26908	57.4095	63.0490	7256.632	1000
rcpp_find(needle, haystack)	16.063	22.517	38.00685	27.7005	31.9475	5173.280	1000
idiomaticrcpp_find(needle, haystack)	12.401	20.561	98.40539	25.3500	29.7495	20635.513	1000
boyermoorecpp_find(needle, haystack)	13.654	23.306	86.19733	28.1015	32.6330	23749.599	1000

`needle` is very close to beginning of `haystack`

The for loop and rcpp are the only methods to benefit from a needle that is thought to be at the beginning of the haystack i.e. they can both return early as soon as they find match.

When the needle is close to the beginning of the haystack, then the searching time is very short, and the for loop is quite competitive with custom Rcpp code.

Table 2: `needle` near start of `haystack`. microbenchmark() timings for 1000 runs - times are in microseconds
expr	min	lq	mean	median	uq	max	neval
pile_find(needle, haystack)	81.544	89.0805	139.33497	93.263	102.4065	2533.354	1000
sieved_find(needle, haystack)	38.744	42.1410	65.25783	43.756	47.5185	2463.339	1000
forloop_find(needle, haystack)	24.728	28.9800	45.13525	30.928	35.7875	2144.941	1000
rcpp_find(needle, haystack)	8.513	10.8565	99.77119	12.733	14.5555	57445.972	1000
boyermoorecpp_find(needle, haystack)	11.058	13.7330	39.87593	15.483	17.3915	2075.381	1000

Conclusion

Summary

Rcpp is the fastest way. No real difference between my C-style of C++ and idiomatic C++
Boyer-Moore in Rcpp no faster than a vanilla solution for current benchmark sizes
For a pure R solution, the sieve_find() method is fastest and very competitive with Rcpp

coolbutuseless

About

Github

License

Twitter

Recent Posts

r64 package - a c64/6502 assembler

Generating Executable ASCII art

Changing the default arguments to a function

Solving the 8 queens problem

Finding a length n needle in a haystack

Finding a length n needle in a haystack

Problem: Find index of a short vector (needle) inside a larger vector (haystack)

Solution 1: `for` loop

Solution 2: `vapply`

Solution 3: `zoo::rollapply`

Solution 4: stacked `lead()` finding

Solution 5: `shift`

Solution 5: Rcpp `for` loop

Solution 6: `pile` find (by dmytro)

Solution 7: Sieved find (by carroll_jono)

Solution 8: Coerce to string and use grep

Solution 9: Idomatic C++ solution hrbrmstr

Solution 10: Rcpp boyer-moore jimhester

`needle` is in the middle of the `haystack`

`needle` is very close to beginning of `haystack`

Conclusion

cool but useless

Recent Posts

r64 package - a c64/6502 assembler

Generating Executable ASCII art

Changing the default arguments to a function

Solving the 8 queens problem

Finding a length n needle in a haystack

Categories

About

About

Github

License

Twitter

Problem: Find index of a short vector (needle) inside a larger vector (haystack)

Solution 1: for loop

Solution 2: vapply

Solution 3: zoo::rollapply

Solution 4: stacked lead() finding

Solution 5: shift

Solution 5: Rcpp for loop

Solution 6: pile find (by dmytro)

Solution 7: Sieved find (by carroll_jono)

Solution 8: Coerce to string and use grep

Solution 9: Idomatic C++ solution hrbrmstr

Solution 10: Rcpp boyer-moore jimhester

needle is in the middle of the haystack

needle is very close to beginning of haystack

Conclusion

About

Solution 1: `for` loop

Solution 2: `vapply`

Solution 3: `zoo::rollapply`

Solution 4: stacked `lead()` finding

Solution 5: `shift`

Solution 5: Rcpp `for` loop

Solution 6: `pile` find (by dmytro)

`needle` is in the middle of the `haystack`

`needle` is very close to beginning of `haystack`