How can I read and process contents of every cell of a table in a DOCX file?
I am using Python 3.2 on Windows 7 and PyWin32 to access the MS-Word Document.
I am a beginner so I don’t know proper way to reach to table cells. So far I have just done this:
import win32com.client as win32
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open("MyDocument")
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Jumping in rather late in life, but thought I’d put this out anyway:
Now (2015), you can use the pretty neat doc python library:
https://python-docx.readthedocs.org/en/latest/. And then:
from docx import Document
wordDoc = Document('<path to docx file>')
for table in wordDoc.tables:
for row in table.rows:
for cell in row.cells:
print cell.text
Method 2
Here is what works for me in Python 2.7:
import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument
To see how many tables your document has:
doc.Tables.Count
Then, you can select the table you want by its index. Note that, unlike python, COM indexing starts at 1:
table = doc.Tables(1)
To select a cell:
table.Cell(Row = 1, Column= 1)
To get its content:
table.Cell(Row =1, Column =1).Range.Text
Hope that this helps.
EDIT:
An example of a function that returns Column index based on its heading:
def Column_index(header_text):
for i in range(1 , table.Columns.Count+1):
if table.Cell(Row = 1,Column = i).Range.Text == header_text:
return i
then you can access the cell you want this way for example:
table.Cell(Row =1, Column = Column_index("The Column Header") ).Range.Text
Method 3
I found a simple code snippet on a blog Reading Table Contents Using Python by etienne
The great thing about this is that you don’t need any non-standard python libraries installed.
The format of a docx file is described at Open Office XML.
import zipfile
import xml.etree.ElementTree
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'
with zipfile.ZipFile('<path to docx file>') as docx:
tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))
for table in tree.iter(TABLE):
for row in table.iter(ROW):
for cell in row.iter(CELL):
print ''.join(node.text for node in cell.iter(TEXT))
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0