pandoc—a friend

Pandoc—a friend

Problem: Suppose you have a table in Word file (*.docx) and you want to extract the data from it and put inside a Pandas DataFrame.

Pandoc will convert various types of file formats. We will use Pandas to convert *.docx to html.

$ brew install pandoc
$ pandoc -s settimo2014comparison.docx -t html -o settimo2014comparison.html

Now, load in the html file using pandas. Note: parses <table> tags, so text and figures is not an issue.

import pandas as pd
pd.read_html("settimo2014comparison.html")