I have a big sorted file with billions of lines of variable lengths. Given a new line I would like to know which byte number it would get if it had been included in the sorted file.
Example
an cn dn fn gn
Given the input ‘foo’ I would get the output 9.
This is easy to do by simply going through the whole file, but being billions of lines of variable lengths it would be faster to do a binary search.
Does such a text processing tool already exist?
Edit:
It does now: https://gitlab.com/ole.tange/tangetools/blob/master/2search
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
(This is not a correct answer to your question,
just a starting point.)
I used sgrep (sorted grep) in a similar situation.
Unfortunately (we need the current state) it does not have a byte-offset output; but I think it could be easily added.
Method 2
I’m not aware of some standard tool doing this.
However you can write your own. For example the following ruby script should do the job.
file, key = ARGV.shift, ARGV.shift
min, max = 0, File.size(file)
File.open(file) do |f|
while max-min>1 do
middle = (max+min)/2
f.seek middle
f.readline
if f.eof? or f.readline>=key
max = middle
else
min = middle
end
end
f.seek max
f.readline
p f.pos+1
end
It’s a bit tricky because after the seek you are usually in the middle of some line and therefore need to do one readline to get to the beginning of the following line, which you can read and compare to your key.
Method 3
Based on Michas solution here is a more complete program:
https://gitlab.com/ole.tange/tangetools/-/tree/master/2search
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0