Binary search in a sorted text file

I have a big sorted file with billions of lines of variable lengths. Given a new line I would like to know which byte number it would get if it had been included in the sorted file.

Example

an
cn
dn
fn
gn

Given the input ‘foo’ I would get the output 9.

This is easy to do by simply going through the whole file, but being billions of lines of variable lengths it would be faster to do a binary search.

Does such a text processing tool already exist?

Edit:

It does now: https://gitlab.com/ole.tange/tangetools/blob/master/2search

Contents hide

Answers:

Method 1

Method 2

Method 3

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

(This is not a correct answer to your question,
just a starting point.)

I used sgrep (sorted grep) in a similar situation.

Unfortunately (we need the current state) it does not have a byte-offset output; but I think it could be easily added.

Method 2

I’m not aware of some standard tool doing this.
However you can write your own. For example the following ruby script should do the job.

file, key = ARGV.shift, ARGV.shift
min, max = 0, File.size(file)

File.open(file) do |f|
  while max-min>1 do
    middle = (max+min)/2
    f.seek middle
    f.readline
    if f.eof? or f.readline>=key
      max = middle
    else
      min = middle
    end
  end
  f.seek max
  f.readline
  p f.pos+1
end

It’s a bit tricky because after the seek you are usually in the middle of some line and therefore need to do one readline to get to the beginning of the following line, which you can read and compare to your key.

Method 3

Based on Michas solution here is a more complete program:

https://gitlab.com/ole.tange/tangetools/-/tree/master/2search

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating