How to read contents of an Table in MS-Word file Using Python?

How can I read and process contents of every cell of a table in a DOCX file?

I am using Python 3.2 on Windows 7 and PyWin32 to access the MS-Word Document.

I am a beginner so I don’t know proper way to reach to table cells. So far I have just done this:

import win32com.client as win32
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = False 
doc = word.Documents.Open("MyDocument")

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Jumping in rather late in life, but thought I’d put this out anyway:
Now (2015), you can use the pretty neat doc python library:
https://python-docx.readthedocs.org/en/latest/. And then:

from docx import Document

wordDoc = Document('<path to docx file>')

for table in wordDoc.tables:
    for row in table.rows:
        for cell in row.cells:
            print cell.text

Method 2

Here is what works for me in Python 2.7:

import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument

To see how many tables your document has:

doc.Tables.Count

Then, you can select the table you want by its index. Note that, unlike python, COM indexing starts at 1:

table = doc.Tables(1)

To select a cell:

table.Cell(Row = 1, Column= 1)

To get its content:

table.Cell(Row =1, Column =1).Range.Text

Hope that this helps.

EDIT:

An example of a function that returns Column index based on its heading:

def Column_index(header_text):
for i in range(1 , table.Columns.Count+1):
    if table.Cell(Row = 1,Column = i).Range.Text == header_text:
        return i

then you can access the cell you want this way for example:

table.Cell(Row =1, Column = Column_index("The Column Header") ).Range.Text

Method 3

I found a simple code snippet on a blog Reading Table Contents Using Python by etienne

The great thing about this is that you don’t need any non-standard python libraries installed.

The format of a docx file is described at Open Office XML.

import zipfile
import xml.etree.ElementTree

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'

with zipfile.ZipFile('<path to docx file>') as docx:
    tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))

for table in tree.iter(TABLE):
    for row in table.iter(ROW):
        for cell in row.iter(CELL):
            print ''.join(node.text for node in cell.iter(TEXT))


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x