Scientific Computing for Chemists: An Undergraduate Course in Simulations, Data Processing, and Visualization¶

Abstract:¶

The Scientific Computing for Chemists course taught at Wabash College teaches chemistry students to use the Python programming language, Jupyter notebooks, and a number of common Python scientific libraries to process, analyze, and visualize data. Assuming no prior programming experience, the course introduces students to basic programming and applies these skills to solve a variety of chemical problems. The course is structured around Jupyter notebooks as easily shareable documents for lectures, homework sets, and projects; the software used in this course is free and open source, making it easily accessible to any school or research lab.

Weiss, Charles J. "Scientific Computing for Chemists: An Undergraduate Course in Simulations, Data Processing, and Visualization." Journal of Chemical Education 94.5 (2017): 592-597.

What is a Jupyter Notebook?¶

What does it do, how does it work, and why should you use it?¶

"The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more."

Notebooks can be shared on GitHub. Check out my GitHub repository.

Hidden Features:¶

List the %Magic Commands¶

%lsmagic

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

Example:¶

%ls

67-66-3-IR.jdx*               dance_of_the_goblins.mp3*
HPLC.xlsx*                    dance_of_the_goblins.wav*
JupyterNotebookExample.ipynb* integration_plot.png*
README.md                     ipython_log.py*
ThinkDSP/                     jcamp.txt*
Traj_1071_THR18_87.B_pub.mp4* plots.py*
cleopatra.flac*               plots.pyc
custom.css*                   testing.txt*

Use various languages:¶

Languages: Python, HTML, Bash, R, LaTex, etc.¶

MathJax (LaTeX-like)¶

Application of Hookes Law to vibrational frequency: $$\displaystyle \widetilde{\nu}={\frac{1}{2\pi c}}{\sqrt {\frac {k_{force}}{\mu}}} \tag{1}$$

Lennard-Jones potential: $$ LJ(r) = 4\epsilon[ {(\frac{\sigma}{r})}^{12} - {(\frac{\sigma}{r})}^{6} ] \tag{2} $$

Redox Reactions:¶

$$ [Fe(CN)_{6}]^{3-} + e^{-} \rightleftharpoons [Fe(CN)_{6}]^{4-} \hspace{0.5cm} E_{cell} = 0.356\hspace{0.1cm} V \tag{3} $$

Bash shell¶

%%bash
# Very basic bash script:

job="count"  # the object that you want to toggle
num=10 # the max value of our loop

# Creating a list of elements
enu=(zero one two three four five six \
     seven eight nine ten)

if [ $job == "count" ];                   # if the job = "count", then...
then
    for i in $(seq 0 $num);
    do
        if [ $i -eq 3 ] || [ $i -eq 7 ] ; # if i = 3 OR i = 7
        then echo ${enu[i]};              # then print the ith element
        
        else
            echo $i;                      
        fi
    done
else
    echo ${enu[$num]};                    # If job ≠ count, then...
fi

0
1
2
three
4
5
6
seven
8
9
10

Here is a list of all the Jupyter Kernels available.¶

They can be installed and used inside the notebook.¶

Sympy¶

import sympy #symbolic mathematics library in python
x,y,z = sympy.symbols('x,y,z')        # creating our variables
f = (2 + x) * (3 + x)                    # creating our function
F = '(2 + x)(3 + x)'                     # string of the function
print(sympy.expand(f),sympy.solve(f))
print(sympy.solve(f))
print(sympy.diff(f,x))                   # differentiate 
print(sympy.integrate(f,x))              # integrate

(x**2 + 5*x + 6, [-3, -2])
[-3, -2]
2*x + 5
x**3/3 + 5*x**2/2 + 6*x

sympy.init_printing()                  # pretty print
sympy.integrate(f,x)

%matplotlib inline 
#matplotlib inline =  have the plot render within the cell
import numpy as np # linear algebra/scientfic computing library

# from -50 to 50 with 22 evenly spaced points
x = np.linspace(-50,50,22)
print(len(x)) # print the length of the vector
print(x)      # print the vector

y_ = [] # appending elements to this list
for i in range(0,len(x)):
    y_.append(x[i]**3. /3. + 5.*x[i]**2. /2. + 6.*x[i])
    
y = np.array(y_) # convert the list into an array    

# Lets plot the function:
from matplotlib import pyplot as plt 

fig = plt.figure(figsize=(9,9))   # creating a figure object
ax = fig.add_subplot(211)         # setting the size of the
# plot of x and y and label it:
ax.plot(x,y,label='1st integration of %s'%F)          
ax.set_xlabel('x')              
ax.set_ylabel('y')    
ax.legend(loc='best')
fig.show()

22
[-50.         -45.23809524 -40.47619048 -35.71428571 -30.95238095
 -26.19047619 -21.42857143 -16.66666667 -11.9047619   -7.14285714
  -2.38095238   2.38095238   7.14285714  11.9047619   16.66666667
  21.42857143  26.19047619  30.95238095  35.71428571  40.47619048
  45.23809524  50.        ]

/Users/tuc41004/anaconda2/lib/python2.7/site-packages/matplotlib/figure.py:457: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "

# Import "plot" script of my own creation
from plots import simple_plot as sp 

# call on the function simple_plot:
sp(Type='scatter',size=211,fig_size=(9,9),x=x,y=y,
   xlabel='x',ylabel='y',color='k',invert_x_axis=False,
   fit=True,order=3,name='integration_plot.png')

LINEST data:
y=0.333333*x**3.+(2.500000)*x**2.+(6.000000)*x+(-0.000000)
<scipy.stats._distn_infrastructure.rv_frozen object at 0x10efd6850>
Ttest_indResult(statistic=-0.57273919167183229, pvalue=0.56987493264451738)

But, what if we wanted to look at the source-code?¶

%%bash
# Lets print the first 5 lines of this script:
head -n 5 ./plots.py

#!/usr/bin/env python

# Load Libraries:{{{
import numpy as np ############################# Linear Algebra Library
from scipy.optimize import fsolve

Data acquisition¶

Importing Data:¶

Using text files - ".dat";".txt";".csv";".log";".db";etc.¶

Using excel files¶

import openpyxl # from excel to python
wb = openpyxl.load_workbook('HPLC.xlsx')
Sheet1 = wb.get_sheet_by_name('Sheet1')

peak = [Sheet1.cell(row=i, column=k).value for i in range(2, 12) for k in range(1,2)]
area = [Sheet1.cell(row=i, column=k).value for i in range(2, 12) for k in range(4,5)]
height = [Sheet1.cell(row=i, column=k).value for i in range(2, 12) for k in range(5,6)]
tR = [Sheet1.cell(row=i, column=k).value for i in range(2, 12) for k in range(2,3)]
vol = [Sheet1.cell(row=i, column=k).value for i in range(2, 12) for k in range(3,4)]

print peak
print area
print vol
print tR
print height

[u'caff std', u'Caff + h2o', u'Caff 3 mL', u'Caff 6 mL', u'Caff 9 mL', u'Caff 12 mL', u'decaf +h2o', u'decaf 3mL', u'decaf 6mL', u'decaf 9mL']
[500000L, 215000L, 254000L, 285000L, 336000L, 370000L, 15400L, 42100L, 73700L, 122000L]
[None, 0L, 3L, 6L, 9L, 12L, 0L, 3L, 6L, 9L]
[2.125, 2.123, 2.125, 2.125, 2.127, 2.128, 2.118, 2.13, 2.13, 2.132]
[110000L, 43600L, 50800L, 56100L, 65700L, 69700L, 1960L, 6890L, 12700L, 21000L]

Playing With Sounds:¶

Plotting the Waveform for the first 4.5 seconds of "Dance of the Goblins" by Antonio Bazzini¶

Waveform $\rightarrow$ mp4¶

%%bash
cd /volumes/rdrive/Fall2017/Instrumental_Design/Project/
ffmpeg -i dance_of_the_goblins.mp3 dance_of_the_goblins.wav

bash: line 1: cd: /volumes/rdrive/Fall2017/Instrumental_Design/Project/: No such file or directory
ffmpeg version 3.4.2 Copyright (c) 2000-2018 the FFmpeg developers
  built with Apple LLVM version 9.0.0 (clang-900.0.39.2)
  configuration: --prefix=/usr/local/Cellar/ffmpeg/3.4.2 --enable-shared --enable-pthreads --enable-version3 --enable-hardcoded-tables --enable-avresample --cc=clang --host-cflags= --host-ldflags= --disable-jack --enable-gpl --enable-libass --enable-libfdk-aac --enable-libfontconfig --enable-libfreetype --enable-libmp3lame --enable-libopus --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-libxvid --enable-opencl --enable-videotoolbox --disable-lzma --enable-nonfree
  libavutil      55. 78.100 / 55. 78.100
  libavcodec     57.107.100 / 57.107.100
  libavformat    57. 83.100 / 57. 83.100
  libavdevice    57. 10.100 / 57. 10.100
  libavfilter     6.107.100 /  6.107.100
  libavresample   3.  7.  0 /  3.  7.  0
  libswscale      4.  8.100 /  4.  8.100
  libswresample   2.  9.100 /  2.  9.100
  libpostproc    54.  7.100 / 54.  7.100
Input #0, mp3, from 'dance_of_the_goblins.mp3':
  Metadata:
    major_brand     : dash
    minor_version   : 0
    compatible_brands: iso6mp41
    encoder         : Lavf56.25.101
  Duration: 00:04:58.03, start: 0.025057, bitrate: 192 kb/s
    Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16p, 192 kb/s
    Metadata:
      encoder         : Lavc56.13
File 'dance_of_the_goblins.wav' already exists. Overwrite ? [y/N] Not overwriting - exiting

import os
wd = '/Volumes/RMR_4TB/Undergrad_courses/Instrumental_Design/Project/'
os.chdir(wd+'ThinkDSP/code')
from __future__ import print_function, division
import thinkdsp
import thinkplot
import scipy.fftpack
from scipy.fftpack import fft as fft
from scipy.fftpack import ifft as ifft
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
os.chdir(wd)

Create widget for playing inside notebook¶

wave = wd+'dance_of_the_goblins.wav'
response = thinkdsp.read_wave(wave)
start = 2.0
duration = response.duration/65.
response = response.segment(start=start, duration=duration)
response.shift(-start)
response.normalize()
print(duration,'seconds')

4.58470713414 seconds

response.make_audio()

segment = response.segment(start=start,duration=duration)
segment.normalize()
data = segment.quantize(bound=True,dtype=float)
print('Number of data points = ',len(data))

Number of data points =  113986

Setting x and y¶

# Waveform:
total_time = segment.duration+start
x = np.array([total_time/len(data)*i for i in range(0,len(data),1)])
y = [data[i] for i in range(0,len(data),1)]

# Fourier Transform
# Number of sample points
n = len(y) # len = length of a vector
d = 1/response.framerate
print('timestep = ',d)
FT_y = fft(y)
freq = np.fft.fftfreq(n,d)
xf,yf = [],[]
for i in range(0,len(FT_y)):
    if freq[i] > 0 and FT_y[i] >0:
        yf.append(FT_y[i])
        xf.append(freq[i])

timestep =  2.26757369615e-05

Plotting and Signal Filtering¶

%pylab inline
from plots import onebytwo
x1,x2 = x,xf
y1,y2 = y,yf
x1label,x2label = 'Time,t/(seconds)','Frequency, $\\nu$ /(Hz)'
y1label,y2label = 'Signal Intensity','Signal Intensity'
Type1,Type2 = 'line','line'
fit1,fit2 = False,False
color = 'k'
order1,order2 = 1,1
name = 'FT.png'
onebytwo(Type1=Type1,Type2=Type2,x1=x1,y1=y1,x2=x2,y2=y2,
         x1label=x1label,y1label=y1label,x2label=x2label,
         y2label=y2label,name=name,color=color,fit1=fit1,fit2=fit2,
        order1=order1,order2=order2,
         invert_x1_axis=False,invert_x2_axis=False)

Populating the interactive namespace from numpy and matplotlib

/Users/RR/anaconda2/lib/python2.7/site-packages/IPython/core/magics/pylab.py:161: UserWarning: pylab import has clobbered these variables: ['f', 'fft', 'ifft']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"
/Users/RR/anaconda2/lib/python2.7/site-packages/numpy/core/numeric.py:531: ComplexWarning: Casting complex values to real discards the imaginary part
  return array(a, dtype, copy=False, order=order)

Weighted Moving Average Smoothing¶

The above method treats each point in the average the same and only takes the average with the immediately adjacent data points. The triangular smooth approach weights the different data points. For example, if we take the average with five data points as described below.

$$ S_{j} = \frac{D_{j−2} +2D_{j−1} +3D_{j} +2D_{j+1} +D_{j+1}}{ 9}$$

def tri_smooth(array):
    '''
    (ndarray) -> ndarray
    For an ndarray x, returns a new array xs as the weighted average of each point
    The new array xs is shorter than x, len(xs) = len(x)-4.
    '''
    sum = array[:-4] + 2*array[1:-3] + 3*array[2:-2] + 2*array[3:-1] + array[4:] 
    array_smooth = sum/9
    return(array_smooth)

Compare Moving Average and Savitzky–Golay Filter:¶

from scipy.fftpack import *
from scipy.signal import *

# Weighted Moving average:
xs = tri_smooth(array=np.array(xf))
ys = tri_smooth(array=np.array(yf))

# 
sfx = scipy.signal.savgol_filter(xf, window_length=55, polyorder=3)
sfy = scipy.signal.savgol_filter(yf, window_length=55, polyorder=3)
#x,y = scipy.ifft(X),scipy.ifft(Y)

%pylab inline

x1,x2 = xs,sfx
y1,y2 = ys,sfy
x1label,x2label = '','Frequency, $\\nu$ /(Hz)'
y1label,y2label = 'Signal Intensity','Signal Intensity'
Type1,Type2 = 'line','line'
fit1,fit2 = False,False
color = 'k'
order1,order2 = 1,1
name = 'FT.png'
onebytwo(Type1=Type1,Type2=Type2,x1=x1,y1=y1,x2=x2,y2=y2,
         x1label=x1label,y1label=y1label,x2label=x2label,
         y2label=y2label,name=name,color=color,fit1=fit1,fit2=fit2,
        order1=order1,order2=order2,
         invert_x1_axis=False,invert_x2_axis=False)

Populating the interactive namespace from numpy and matplotlib

/Users/RR/anaconda2/lib/python2.7/site-packages/IPython/core/magics/pylab.py:161: UserWarning: pylab import has clobbered these variables: ['fft2', 'rfftfreq', 'step', 'blackman', 'csd', 'ifftn', 'correlate', 'hanning', 'fftn', 'irfft', 'square', 'rfft', 'spectral', 'fft', 'convolve', 'detrend', 'diff', 'exponential', 'ifft2', 'bartlett', 'ifft', 'kaiser', 'hamming']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"

If we wanted, we could compare these frequencies to notes played on a piano or violin¶

from IPython.core.display import display,HTML
# uncomment to render webpage in notebook
#display(HTML("https://en.wikipedia.org/wiki/Piano_key_frequencies"))

# Print the first 5000 frequencies, but only find the meaningful ones:
for i in range(0,len(yf)):
    if xf[i] <= 5000:
        #print(xf[i],yf[i])
        if yf[i] >= 1500 and yf[i] <= np.max(yf):
            print('High Intensity:',xf[i],yf[i])
        elif yf[i] <= 750 and yf[i] >= 500:
            print('Mid Intensity:',xf[i],yf[i])

Mid Intensity: 124.578457003 (599.246490355+93.5075799439j)
Mid Intensity: 244.514238591 (661.958345537-1041.80975188j)
Mid Intensity: 369.86647483 (674.382456481+409.785261663j)
High Intensity: 370.640254066 (1676.74743638-64.1712179436j)
Mid Intensity: 371.414033302 (725.545215007-1791.96148067j)
Mid Intensity: 374.509150247 (518.009259438-88.2589895154j)
Mid Intensity: 377.991156809 (683.853566481-424.141803253j)
Mid Intensity: 388.824066113 (557.481733683-13.2501579961j)
Mid Intensity: 433.316372186 (648.480327644+335.481788001j)
Mid Intensity: 494.444931834 (550.480488249-669.992062283j)
Mid Intensity: 498.700717632 (664.274249436+121.336504039j)
Mid Intensity: 501.02205534 (570.497927931+547.527433763j)
Mid Intensity: 502.569613812 (688.067674151-125.349911749j)
Mid Intensity: 526.55677013 (549.160479997+189.187274991j)
Mid Intensity: 616.702051129 (505.703582791+311.682746081j)
High Intensity: 619.797168073 (1538.80471177+642.965678797j)
High Intensity: 620.184057691 (1643.01242333-785.611098359j)
Mid Intensity: 745.923183549 (520.156615555+525.941166264j)
Mid Intensity: 990.824311758 (666.28585511-532.515190507j)
Mid Intensity: 1003.20477953 (621.363188902-10.3078738489j)
Mid Intensity: 1241.52878424 (535.379174535+302.805023687j)
Mid Intensity: 1241.91567385 (611.523613107-236.93916157j)
Mid Intensity: 1246.94523889 (609.664113881-149.173723693j)
Mid Intensity: 1747.58040461 (617.889090108+33.9607469876j)

Thank you for listening!¶

Special thanks to the following people: Dr. Vincent Voelz, Matt Hurley, Hongbin Wan & Yunhui Ge¶

Sources & Dependencies:¶

Anaconda - "Anaconda is the world’s most popular Python data science platform". "Package Management: Manage packages, dependencies and environments with conda"

Numpy, Jupyter/IPython, Matplotlib, Scipy, SymPy, Scikit-image, Pandas

These are not required:¶

nbextensions - Jupyter Notebook extensions. This consists of line numbering,cell folding, various layouts, spell checking, etc.