Shuffle two parallel text files

I have two sentence-aligned parallel corpora (text files) with about 50 mil words. (from the Europarl corpus -> parallel translation of legal documents).
I’d now like to shuffle the lines of the two files, but both in the same way. I wanted to approach that using gshuf (I’m on a Mac) using one unique random source.

gshuf --random-source /path/to/some/random/data file1
gshuf --random-source /path/to/some/random/data file2

But I got the error message end of file, because apparently the random seed needs to contain all the words that the file to be sorted contains. Is that true? If yes, how should I create a random seed that is good for my needs?
If no, in what other way could I randomize the files in parallel?
I thought about pasting them together, randomizing and then splitting again. However, this seems ugly since I would need to first find a delimiter that doesn’t occur in the files.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I do not know if there is a more elegant method but this works for me:

mkfifo onerandom tworandom threerandom
tee onerandom tworandom threerandom < /dev/urandom > /dev/null &
shuf --random-source=onerandom onefile > onefile.shuf &
shuf --random-source=tworandom twofile > twofile.shuf &
shuf --random-source=threerandom threefile > threefile.shuf &
wait

Result:

$ head -n 3 *.shuf
==> onefile.shuf <==
24532 one
47259 one
58678 one

==> threefile.shuf <==
24532 three
47259 three
58678 three

==> twofile.shuf <==
24532 two
47259 two
58678 two

But the files must have the exact same number of lines.


The GNU Coreutils documentation also provides a nice solution for repeated randomness using openssl as a seeded random generator:

https://www.gnu.org/software/coreutils/manual/html_node/Random-sources.html#Random-sources

get_seeded_random()
{
  seed="$1"
  openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt 
    </dev/zero 2>/dev/null
}

shuf -i1-100 --random-source=<(get_seeded_random 42)

However, consider using a better seed than “42”, unless you want anyone else to be able to reproduce “your” random result, too.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x