pandoc—a friend
Pandoc—a friend
Problem: Suppose you have a table in Word file (*.docx) and you want to extract the data from it and put inside a Pandas DataFrame.
Pandoc will convert various types of file formats. We will use Pandas to convert *.docx to html.
$ brew install pandoc
$ pandoc -s settimo2014comparison.docx -t html -o settimo2014comparison.html
Now, load in the html file using pandas
. Note: parses <table>
tags, so text and figures is not an issue.
import pandas as pd
pd.read_html("settimo2014comparison.html")