Reading a text file and splitting it into single words in python

I have this text file made up of numbers and words, for example like this – 09807754 18 n 03 aristocrat 0 blue_blood 0 patrician and I want to split it so that each word or number will come up as a new line.

A whitespace separator would be ideal as I would like the words with the dashes to stay connected.

This is what I have so far:

f = open('words.txt', 'r')
for word in f:
    print(word)

not really sure how to go from here, I would like this to be the output:

09807754
18
n
3
aristocrat
...

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Given this file:

$ cat words.txt
line1 word1 word2
line2 word3 word4
line3 word5 word6

If you just want one word at a time (ignoring the meaning of spaces vs line breaks in the file):

with open('words.txt','r') as f:
    for line in f:
        for word in line.split():
           print(word)

Prints:

line1
word1
word2
line2
...
word6

Similarly, if you want to flatten the file into a single flat list of words in the file, you might do something like this:

with open('words.txt') as f:
    flat_list=[word for line in f for word in line.split()]

>>> flat_list
['line1', 'word1', 'word2', 'line2', 'word3', 'word4', 'line3', 'word5', 'word6']

Which can create the same output as the first example with print 'n'.join(flat_list)

Or, if you want a nested list of the words in each line of the file (for example, to create a matrix of rows and columns from a file):

with open('words.txt') as f:
    matrix=[line.split() for line in f]

>>> matrix
[['line1', 'word1', 'word2'], ['line2', 'word3', 'word4'], ['line3', 'word5', 'word6']]

If you want a regex solution, which would allow you to filter wordN vs lineN type words in the example file:

import re
with open("words.txt") as f:
    for line in f:
        for word in re.findall(r'bwordd+', line):
            # wordN by wordN with no lineN

Or, if you want that to be a line by line generator with a regex:

 with open("words.txt") as f:
     (word for line in f for word in re.findall(r'w+', line))

Method 2

f = open('words.txt')
for word in f.read().split():
    print(word)

Method 3

As supplementary,
if you are reading a vvvvery large file, and you don’t want read all of the content into memory at once, you might consider using a buffer, then return each word by yield:

def read_words(inputfile):
    with open(inputfile, 'r') as f:
        while True:
            buf = f.read(10240)
            if not buf:
                break

            # make sure we end on a space (word boundary)
            while not str.isspace(buf[-1]):
                ch = f.read(1)
                if not ch:
                    break
                buf += ch

            words = buf.split()
            for word in words:
                yield word
        yield '' #handle the scene that the file is empty

if __name__ == "__main__":
    for word in read_words('./very_large_file.txt'):
        process(word)

Method 4

What you can do is use nltk to tokenize words and then store all of the words in a list, here’s what I did.
If you don’t know nltk; it stands for natural language toolkit and is used to process natural language. Here’s some resource if you wanna get started
[http://www.nltk.org/book/]

import nltk 
from nltk.tokenize import word_tokenize 
file = open("abc.txt",newline='')
result = file.read()
words = word_tokenize(result)
for i in words:
       print(i)

The output will be this:

09807754
18
n
03
aristocrat
0
blue_blood
0
patrician

Method 5

with open(filename) as file:
    words = file.read().split()

Its a List of all words in your file.

import re
with open(filename) as file:
    words = re.findall(r"([a-zA-Z-]+)", file.read())

Method 6

Here is my totally functional approach which avoids having to read and split lines. It makes use of the itertools module:

Note for python 3, replace itertools.imap with map

import itertools

def readwords(mfile):
    byte_stream = itertools.groupby(
        itertools.takewhile(lambda c: bool(c),
            itertools.imap(mfile.read,
                itertools.repeat(1))), str.isspace)

    return ("".join(group) for pred, group in byte_stream if not pred)

Sample usage:

>>> import sys
>>> for w in readwords(sys.stdin):
...     print (w)
... 
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
           
It's soo very Functional!
It's
soo
very
Functional!
>>>

I guess in your case, this would be the way to use the function:

with open('words.txt', 'r') as f:
    for word in readwords(f):
        print(word)


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x