Grep huge number of patterns from huge file

I have a file that’s growing about 200,000 lines a day, and it is all formed with blocks of three lines as such:

1358726575123       # key
    Joseph Muller   # name
    carpenter       # job
9973834728345
    Andres Smith
    student
7836472098652
    Mariah Anthony
    dentist

Now, I have another file from which I extract about 10,000 key patterns, such as 1358726575123. Then I run a for loop with these patterns and have to check them against the first file. If the file doesn’t contain such pattern, I save the pattern in a third file for further processing:

for number in $(grep -o '[0-9]{12}' file2); do  # finds about 10.000 keys
     if ! grep -q ^$number$ file1; then           # file1 is a huge file
         printf "$numbern" >>file3               # we'll process file3 later
     fi
done

The example code greps a huge file 10,000 times, and I run this loop about once a minute, during the whole day.

Since the huge file keeps growing, what can I do to make all this faster and save some CPU? I wonder whether sorting the file somehow by its key (if so, how?) or using a db instead of plain text would help…

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

The problem, of course, is that you run grep on the big file 10,000 times. You should read both files only once. If you want to stay outside scripting languages, you can do it this way:

  1. Extract all numbers from file 1 and sort them
  2. Extract all numbers from file 2 and sort them
  3. Run comm on the sorted lists to get what’s only on the second list

Something like this:

$ grep -o '^[0-9]{12}$' file1 | sort -u -o file1.sorted
$ grep -o  '[0-9]{12}'  file2 | sort -u -o file2.sorted
$ comm -13 file1.sorted file2.sorted > file3

See man comm.

If you could truncate the big file every day (like a log file) you could keep a cache of sorted numbers and wouldn’t need to parse it whole every time.

Method 2

This answer is based on the awk answer posted by potong..
It is twice as fast as the comm method (on my system), for the same 6 million lines in main-file and 10 thousand keys… (now updated to use FNR,NR)

Although awk is faster than your current system, and will give you and your computer(s) some breathing space, be aware that when data processing is as intense as you’ve described, you will get best overall results by switching to a dedicated database; eg. SQlite, MySQL…


awk '{ if (/^[^0-9]/) { next }              # Skip lines which do not hold key values
       if (FNR==NR) { main[$0]=1 }          # Process keys from file "mainfile"
       else if (main[$0]==0) { keys[$0]=1 } # Process keys from file "keys"
     } END { for(key in keys) print key }' 
       "mainfile" "keys" >"keys.not-in-main"

# For 6 million lines in "mainfile" and 10 thousand keys in "keys"

# The awk  method
# time:
#   real    0m14.495s
#   user    0m14.457s
#   sys     0m0.044s

# The comm  method
# time:
#   real    0m27.976s
#   user    0m28.046s
#   sys     0m0.104s

Method 3

Yes, definitely do use a database. They’re made exactly for tasks like this.

Method 4

This might work for you:

 awk '/^[0-9]/{a[$0]++}END{for(x in a)if(a[x]==1)print x}' file{1,2} >file3

EDIT:

Amended script to allow for duplicates and unknown keys in both files, still produces keys from the first file not present in the second:

 awk '/^[0-9]/{if(FNR==NR){a[$0]=1;next};if($0 in a){a[$0]=2}}END{for(x in a)if(a[x]==1)print x}' file{1,2} >file3

Method 5

With that much data, you should really switch to a database. In the meantime, one thing you must do to get anywhere near decent performance is not to search file1 separately for each key. Run a single grep to extract all the non-excluded keys at once. Since that grep also returns lines that don’t contain a key, filter those away.

grep -o '[0-9]{12}' file2 |
grep -Fxv -f - file1 |
grep -vx '[0-9]{12}' >file3

(-Fx means to search whole lines, literally. -f - means to read a list of patterns from standard input.)

Method 6

Permit me to reinforce what others have said, “Get thee to a database!”

There are MySQL binaries freely available for most platforms.

Why not SQLite? It’s memory-based, loading a flat-file when you start it, then closing it when you’re done. This means that if your computer crashes or the SQLite process goes away, so does all the data.

Your problem looks like just a couple lines of SQL, and will run in milliseconds!

After installing MySQL (which I recommend over other choices), I’d shell out $40 for O’Reilly’s SQL Cookbook by Anthony Molinaro, which has lots of problem patterns, starting with simple SELECT * FROM table queries, and going through aggregates and multiple joins.

Method 7

I’m not sure if this is the exact output that you are looking for, but probably the easiest way is:

grep -o '[0-9]{12}' file2 | sed 's/.*/^&$/' > /tmp/numpatterns.grep
grep -vf /tmp/numpatterns.grep file1 > file3
rm -f /tmp/numpatterns.grep

You could also use:

sed -ne '/.*([0-9]{12}.*/^1$/p' file2 > /tmp/numpatterns.grep
grep -vf /tmp/numpatterns.grep file1 > file3
rm -f /tmp/numpatterns.grep

Each of these creates a temporary pattern file which is used to glean out the numbers from the large file (file1).

Method 8

I fully agree with you getting a database (MySQL is fairly easy to use). Until you get that running, I like Angus’s comm solution, but so many people are trying with grep and getting it wrong that I thought I’d show the (or at least one) correct way to do it with grep.

grep -o '[0-9]{12}' keyfile | grep -v -f <(grep -o '^[0-9]{12}' bigfile)

The first grep gets the keys. The third grep (in the <(...)) takes all the keys used in the big file, and the <(...) passes it like a file as an argument to -f in the second grep. That causes the second grep to use it as a list of lines to match. It then uses this to match its input (the list of keys) from the pipe (first grep), and prints any keys extracted from the key file and not (-v) the big file.

Of course you can do this with temporary files you have to keep track of and remember to delete:

grep -o '[0-9]{12}'  keyfile >allkeys
grep -o '^[0-9]{12}' bigfile >usedkeys
grep -v -f usedkeys allkeys

This prints all lines in allkeys that don’t appear in usedkeys.

Method 9

The keyfile does not change? Then you should avoid searching the old entries again and again.

With tail -f you can get the output of a growing file.

tail -f growingfile | grep -f keyfile

grep -f reads the patterns from a file, one line as a pattern.

Method 10

Wasn’t going to post my answer because I thought that such amount of data shouldn’t be processed with a shell script, and the right answer to use a database was already given. But since now there are 7 other approaches…

Reads the first file in memory, then greps the second file for numbers and checks if values are stored in memory. It should be faster than multiple greps, if you have enough memory to load the whole file, that is.

declare -a record
while read key
do
    read name
    read job
    record[$key]="$name:$job"
done < file1

for number in $(grep -o '[0-9]{12}' file2)
do
    [[ -n ${mylist[$number]} ]] || echo $number >> file3
done

Method 11

I agree with @jan-steinman that you should use a database for this kind of task. There are lots of ways to hack together a solution with a shell script as the other answers show, but doing it that way will lead to a lot of misery if you’re going to use and maintain the code for any length of time more than just a one-day throw-away project.

Assuming you’re on a Linux box then you most likely have Python installed by default which includes the sqlite3 library as of Python v2.5. You can check your Python version with:

% python -V
Python 2.7.2+

I recommend using sqlite3 library because it is a simple file-based solution that exists for all platforms (including inside your web browser!) and it does not require a server to be installed. Essentially zero-configuration and zero-maintenance.

Below is a simple python script that will parse the file format that you gave as an example and then does a simple “select all” query and outputs everything it stored in the db.

#!/usr/bin/env python

import sqlite3
import sys

dbname = '/tmp/simple.db'
filename = '/tmp/input.txt'
with sqlite3.connect(dbname) as conn:
    conn.execute('''create table if not exists people (key integer primary key, name text, job text)''')
    with open(filename) as f:
        for key in f:
            key = key.strip()
            name = f.next().strip()
            job = f.next().strip()
            try:
                conn.execute('''insert into people values (?,?,?)''', (key, name, job))
            except sqlite3.IntegrityError:
                sys.stderr.write('record already exists: %s, %s, %sn' % (key, name, job))
    cur = conn.cursor()

    # get all people
    cur.execute('''select * from people''')
    for row in cur:
        print row

    # get just two specific people
    person_list = [1358726575123, 9973834728345]
    cur.execute('''select * from people where key in (?,?)''', person_list)
    for row in cur:
        print row

    # a more general way to get however many people are in the list
    person_list = [1358726575123, 9973834728345]
    template = ','.join(['?'] * len(person_list))
    cur.execute('''select * from people where key in (%s)''' % (template), person_list)
    for row in cur:
        print row

Yes, this means that you’ll need to learn some SQL, but it will be well worth it in the long run. Also, instead of parsing your log files, maybe you could write data directly to your sqlite database.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x