Binary search in a sorted text file
I have a big sorted file with billions of lines of variable lengths. Given a new line I would like to know which byte number it would get if it had been included in the sorted file.
I have a big sorted file with billions of lines of variable lengths. Given a new line I would like to know which byte number it would get if it had been included in the sorted file.
I have two sentence-aligned parallel corpora (text files) with about 50 mil words. (from the Europarl corpus -> parallel translation of legal documents).
I’d now like to shuffle the lines of the two files, but both in the same way. I wanted to approach that using gshuf (I’m on a Mac) using one unique random source.
This question comes from
How can I delete all text between curly brackets in a multiline text file? (just the same, but without the requirements for nesting).
% select all lines
A related question is here.
I want to create a large test file with lines containg dates listed by the second, but my method is taking inordinately long… (or at least, that’s how it feels 🙂 … 43 minutes to create only 1051201 lines. 20.1 MB file….
Given a file with two columns:
I have a program that exits automatically upon reading an EOF in a given stream ( in the following case, stdin ).
Now I want to make a shell script, which creates a named pipe and connect the program’s stdin to it. Then the script writes to the pipe several times using echo and cat ( and other tools that automatically generates an EOF when they exit ). The problem I’m facing is, when the first echo is done, it sends an EOF to the pipe and make the program exit. If I use something like tail -f then I can’t send an EOF when I intend to quit the program. I’m researching a balanced solution but to no avail.
I’ve already found both how to prevent EOFs and how to manually send an EOF but I can’t combine them. Is there any hint?
I know how to combine the result of different command
I encountered this use case today. It seems simple at first glance, but fiddling around with sort, uniq, sed and awk revealed that it’s nontrivial.