This post will talk about how to read Word Documents with Python. We’re going to cover three different packages – docx2txt, docx, and my personal favorite: docx2python.
Let’s talk about docx2text first. This is a Python package that allows you to scrape text and images from Word Documents. The example below reads in a Word Document containing the Zen of Python. As you can see, once we’ve imported docx2txt, all we need is one line of code to read in the text from the Word Document. We can read in the document using a method in the package called process, which takes the name of the file as input. Regular text, listed items, hyperlink text, and table text will all be returned in a single string.
import docx2txt # read in word file result = docx2txt.process("zen_of_python.docx")
What if the file has images? In that case we just need a minor tweak to our code. When we run the process method, we can pass an extra parameter that specifies the name of an output directory. Running docx2txt.process will extract any images in the Word Document and save them into this specified folder. The text from the file will still also be extracted and stored in the result variable.
import docx2txt result = docx2txt.process("zen_of_python_with_image.docx", "C:/path/to/store/files")
docx2txt will also scrape any text from tables. Again, this will be returned into a single string with any other text found in the document, which means this text can more difficult to parse. Later in this post we’ll talk about docx2python, which allows you to scrape tables in a more structured format.
The source code behind docx2txt is derived from code in the docx package, which can also be used to scrape Word Documents. docx is a powerful library for manipulating and creating Word Documents, but can also (with some restrictions) read in text from Word files.
In the example below, we open a connection to our sample word file using the docx.Document method. Here we just input the name of the file we want to connect to. Then, we can scrape the text from each paragraph in the file using a list comprehension in conjunction with doc.paragraphs. This will include scraping separate lines defined in the Word Document for listed items. Unlike docx2txt, docx, cannot scrape images from Word Documents. Also, docx will not scrape out hyperlinks and text in tables defined in the Word Document.
import docx # open connection to Word Document doc = docx.Document("zen_of_python.docx") # read in each paragraph in file result = [p.text for p in doc.paragraphs]
docx2python is another package we can use to scrape Word Documents. It has some additional features beyond docx2txt and docx. For example, it is able to return the text scraped from a document in a more structured format. Let’s test out our Word Document with docx2python. We’re going to add a simple table in the document so that we can extract that as well (see below).
docx2python contains a method with the same name. If we call this method with the document’s name as input, we get back an object with several attributes.
from docx2python import docx2python # extract docx content doc_result = docx2python('zen_of_python.docx')
Each attribute provides either text or information from the file. For example, consider that our file has three main components – the text containing the Zen of Python, a table, and an image. If we call doc_result.body, each of these components will be returned as separate items in a list.
# get separate components of the document doc_result.body # get the text from Zen of Python doc_result[0] # get the image doc_result[1] # get the table text doc_result[2]
The table text result is returned as a nested list, as you can see below. Each row (including the header) gets returned as a separate sub-list. The 0th element of the list refers to the header – or 0th row of the table. The next element refers to the next row in the table and so on. In turn, each value in a row is returned as an individual sub-list within that row’s corresponding list.
We can convert this result into a tabular format using pandas. The data frame is still a little messy – each cell in the data frame is a list containing a single value. This value also has quite a few “\t”‘s (which represent tab spaces).
pd.DataFrame(doc_result.body[1][1:])
Here, we use the applymap method to apply the lambda function below to every cell in the data frame. This function gets the individual value within the list in each cell and removes all instances of “\t”.
import pandas as pd pd.DataFrame(doc_result.body[1][1:]).\ applymap(lambda val: val[0].strip("\t"))
Next, let’s change the column headers to what we see in the Word file (which was also returned to us in doc_result.body).
df.columns = [val[0].strip("\t") for val in doc_result.body[1][0]]
We can extract the Word file’s images using the images attribute of our doc_result object. doc_result.images consists of a dictionary where the keys are the names of the image files (not automatically written to disk) and the corresponding values are the images files in binary format.
type(doc_result.images) # dict doc_result.images.keys() # dict_keys(['image1.png'])
We can write the binary-formatted image out to a physical file like this:
for key,val in doc_result.images.items(): f = open(key, "wb") f.write(val) f.close()
Above we’re just looping through the keys (image file names) and values (binary images) in the dictionary and writing each out to file. In this case, we only have one image in the document, so we just get one written out.
The docx2python result has several other attributes we can use to extract text or information from the file. For example, if we want to just get all of the file’s text in a single string (similar to docx2txt) we can run doc_result.text.
# get all text in a single string doc_result.text
In addition to text, we can also get metadata about the file using the properties attribute. This returns information such as the creator of the document, the created / last modified dates, and number of revisions.
doc_result.properties
If the document you’re scraping has headers and footers, you can also scrape those out like this (note the singular version of “header” and “footer”):
# get the headers doc_result.header # get the footers doc_result.footer
Footnotes can also be extracted like this:
doc_result.footnotes
We can also specify that we want to get an HTML object returned with the docx2python method that supports a few types of tags including font (size and color), italics, bold, and underline text. We just need to specify the parameter “html = True”. In the example below we see The Zen of Python in bold and underlined print. Corresponding to this, we can see the HTML version of this in the second snapshot below. The HTML feature does not currently support table-related tags, so I would recommend using the method we went through above if you’re looking to scrape tables from Word documents.
doc_html_result = docx2python('zen_of_python.docx', html = True)
Hope you enjoyed this post! Please check out other Python posts of mine below or by clicking here.