Scrapy-ing the NIST X-ray Attenuation Databases

I am currently preparing with two other colleagues a review paper on X-ray tomography of cementitious materials, for which I need to retrieve tabulated values of the X-ray mass attenuation coefficients of all elements. NIST provides such data (see X-Ray Mass Attenuation Coefficients by J. H. Hubbell and S. M. Seltzer). However, it does not come in the form of simple files that can be downloaded. Rather, the values are presented in nicely formatted HTML tables (see here for an example).

In this post, I set out to extract this data as a HDF5 table. Of course, this looks like a job for Scrapy!

First steps with Scrapy

I am brand new to Scrapy, but found the excellent tutorial extremely useful. If you use Miniconda from Continuum Analytics, then installation of Scrapy is a very smooth procedure

conda create --name scrapy python=3 scrapy
activate scrapy

which installs Scrapy in a new environment (called scrapy).

Downloading the table for one single element

In this section, we retrieve the data for the mass attenuation coefficient of one single element. We start with the data for hydrogen, which can be found here. Following the Scrapy tutorial, we first use the Scrapy shell.

scrapy shell "http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z01.html"

Looking at the page source, it is observed that the table in ASCII format is located within a pre tag. This suggests the following CSS selector

out = response.css('pre').extract_first()
print(out)

which produces the following abbreviated output

<pre>__________________________________

   <b>Energy</b><sub> </sub>       <i>µ</i>/<i>?</i>        <i>µ</i><sub>en</sub>/<i>?</i>
   <sup> </sup>(MeV)      (cm<sup>2</sup>/g)     (cm<sup>2</sup>/g)
__________________________________

 1.00000E-03  7.217E+00  6.820E+00
 1.50000E-03  2.148E+00  1.752E+00
 .................................
 1.50000E+01  2.539E-02  1.837E-02
 2.00000E+01  2.153E-02  1.606E-02
</pre>

This is already quite good. We now break the above output in lines, eliminate the first six lines which correspond to the header, and convert each row to floats.

lines = out.splitlines()
table = [[float(x) for x in line.split()] for line in lines[6:-1]]
print(table)
[[0.001, 7.217, 6.82], [0.0015, 2.148, 1.752],
..................................................,
[15.0, 0.02539, 0.01837], [20.0, 0.02153, 0.01606]]

… and we're done! Or?!? Let's see what happens with aluminum.

activate scrapy
scrapy shell "http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z13.html"
out = response.css('pre').extract_first()
lines = out.splitlines()
table = [[float(x) for x in line.split()] for line in lines[6:-1]]

This produces the following error

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<console>", line 1, in <listcomp>
  File "<console>", line 1, in <listcomp>
ValueError: could not convert string to float: 'K'

which has to do with the absorption edge of the element. It is marked by a letter in the first column of the table

lines[9]
'K  1.55960E-03  3.957E+03  3.829E+03 '

So, we need to check wether each element of the current row is indeed a float (see function is_float below).

Downloading tables for all elements

All links follow the same pattern http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/zZZ.html, where ZZ is the atomic number. So we only need to generate a list of links in the Spider.

We start by creating a Scrapy project.

scrapy startproject scrapymu

We then edit the scrapymu/scrapymu/spiders/table3.py file

import scrapy

BASE_URL = 'http://physics.nist.gov/PhysRefData/XrayMassCoef'
TABLE3_URL = '/'.join([BASE_URL, 'ElemTab/z{:02d}.html'])


def is_float(x):
    try:
        float(x)
        return True
    except ValueError:
        return False


class Table3Spider(scrapy.Spider):
    name = 'table3'

    start_urls = [TABLE3_URL.format(z) for z in range(1, 93)]

    def parse(self, response):
        # Retrieving the atomic number from the URL
        z = int(response.url[-7:-5])

        pre = response.css('pre').extract_first()
        lines = pre.splitlines()
        all_rows = ([x for x in line.split() if is_float(x)]
                    for line in lines[6:-1])
        rows = [r for r in all_rows if r is not None and r != []]
        yield {z: rows}

It returns a dict of dict, each of which maps the element's atomic number Z to the corresponding table (as a list of rows, each row being itself a list). It can be run as follows

scrapy crawl table3 -o table3.json

Retrieving the material constants of all elements

We will also need the data from Table 1. The spider is quite simple.

import scrapy

BASE_URL = 'http://physics.nist.gov/PhysRefData/XrayMassCoef'
TABLE1_URL = '/'.join([BASE_URL, 'tab1.html'])


def parse_table1_row(row):
    all_cols = (c.strip() for c in row.css('td::text').extract())
    cols = [c for c in all_cols if c != '']
    if cols != []:
        try:
            z = int(cols[0])
            data = {'symbol': cols[1],
                    'name': cols[2],
                    'Z/A': cols[3],
                    'density [g/cm^3]': cols[5]}
            return z, data
        except ValueError:
            return None


class Table1Spider(scrapy.Spider):
    name = 'table1'

    start_urls = [TABLE1_URL]

    def parse(self, response):
        all_rows = (parse_table1_row(r) for r in response.css('tr'))
        rows = (r for r in all_rows if r is not None)
        yield {k: v for k, v in rows}

It returns a dict which maps the element's atomic number Z to a dict containing the element's symbol, name, Z/A ratio and density.

Putting it all together in a HDF5 file

We are now ready to run our two spiders

scrapy crawl table1 -o table1.json
scrapy crawl table3 -o table3.json

You can download the json files here: Table 1 and Table 3. We use the json module in the standard library to read this files, and merge the dict of dicts that corresponds to Table 3 into one single dict.

import json

with open('table1.json', 'r') as f:
    table1 = json.load(f)[0]
with open('table3.json', 'r') as f:
    table3 = dict(next(iter(i.items())) for i in json.load(f))

print(table1['1'])
print(table3['1'][0:5])

We are now ready to create our HDF5 file. Its structure is very simple: we create one dataset per element. The dataset's label is the element's symbol. The table attached to the dataset is the linear absorption coefficient. A few attributes are further attached to the dataset

import h5py
import numpy as np

name = 'X-ray_mass_absorption_coefficients.hdf5'

with h5py.File(os.path.join('.', ROOT, name), 'w') as f:
    for z, attrs in table1.items():
        mu = [[np.float64(x) for x in row] for row in table3[z]]
        dset = f.create_dataset(attrs['symbol'],
                                data=np.asarray(mu))
        dset.attrs['name'] = attrs['name']
        dset.attrs['Z'] = np.int8(z)
        for key in ['Z/A', 'density [g/cm^3]']:
            dset.attrs[key] = np.float64(attrs[key])

X-ray mass attenuation coefficients of concrete

In order to check that everything went fine, we will use the above table to compute the X-ray mass attenuation coefficients of a typical concrete and compare the results to that of Table 4. The composition of the concrete under consideration is given in Table 2. It is reproduced below

Symbol Z Mass fraction
H 1 0.022100
C 6 0.002484
O 8 0.574930
Na 11 0.015208
Mg 12 0.001266
Al 13 0.019953
Si 14 0.304627
K 19 0.010045
Ca 20 0.042951
Fe 26 0.006435

So we need to combine linearly the mass attenuation coefficients for all these elements. We first start a new python session, define the chemical composition and import the table. Then, we interpolate linearly (in log-space) the tables and compute the linear combination.

Note that I shifted the sampling points slightly to the left and right of each absorption edges in order to facilitate interpolation.

import os.path

import h5py
import matplotlib.pyplot as plt
import numpy as np

from scipy.interpolate import interp1d

input_name = 'X-ray_mass_absorption_coefficients.hdf5'
output_name = 'cement.png'

composition = {'H': 0.022100,
               'C': 0.002484,
               'O': 0.574930,
               'Na': 0.015208,
               'Mg': 0.001266,
               'Al': 0.019953,
               'Si': 0.304627,
               'K' : 0.010045,
               'Ca': 0.042951,
               'Fe': 0.006435}

ref_data = np.array([[1.00000E-03, 3.466E+03],
                     [1.03542E-03, 3.164E+03],
                     [1.07209999E-03, 2.889E+03],
                     [1.07210001E-03, 2.978E+03],
                     [1.18283E-03, 2.302E+03],
                     [1.30499999E-03, 1.775E+03],
                     [1.30500001E-03, 1.781E+03],
                     [1.50000E-03, 1.227E+03],
                     [1.55959999E-03, 1.104E+03],
                     [1.55960001E-03, 1.176E+03],
                     [1.69350E-03, 9.419E+02],
                     [1.83889999E-03, 7.525E+02],
                     [1.83890001E-03, 1.631E+03],
                     [2.00000E-03, 1.368E+03],
                     [3.00000E-03, 4.646E+02],
                     [3.60739999E-03, 2.804E+02],
                     [3.60740001E-03, 2.911E+02],
                     [4.00000E-03, 2.188E+02],
                     [4.03809999E-03, 2.131E+02],
                     [4.03810001E-03, 2.520E+02],
                     [5.00000E-03, 1.401E+02],
                     [6.00000E-03, 8.401E+01],
                     [7.11199999E-03, 5.187E+01],
                     [7.11200001E-03, 5.415E+01],
                     [8.00000E-03, 3.878E+01],
                     [1.00000E-02, 2.045E+01],
                     [1.50000E-02, 6.351E+00],
                     [2.00000E-02, 2.806E+00],
                     [3.00000E-02, 9.601E-01],
                     [4.00000E-02, 5.058E-01],
                     [5.00000E-02, 3.412E-01],
                     [6.00000E-02, 2.660E-01],
                     [8.00000E-02, 2.014E-01],
                     [1.00000E-01, 1.738E-01],
                     [1.50000E-01, 1.436E-01],
                     [2.00000E-01, 1.282E-01],
                     [3.00000E-01, 1.097E-01],
                     [4.00000E-01, 9.783E-02],
                     [5.00000E-01, 8.915E-02],
                     [6.00000E-01, 8.236E-02],
                     [8.00000E-01, 7.227E-02],
                     [1.00000E+00, 6.495E-02],
                     [1.25000E+00, 5.807E-02],
                     [1.50000E+00, 5.288E-02],
                     [2.00000E+00, 4.557E-02],
                     [3.00000E+00, 3.701E-02],
                     [4.00000E+00, 3.217E-02],
                     [5.00000E+00, 2.908E-02],
                     [6.00000E+00, 2.697E-02],
                     [8.00000E+00, 2.432E-02],
                     [1.00000E+01, 2.278E-02],
                     [1.50000E+01, 2.096E-02],
                     [2.00000E+01, 2.030E-02]], dtype=np.float64)


def interp_loglog(x, y):
    f = interp1d(np.log(x), np.log(y),
                 kind='linear', assume_sorted=True)
    return lambda t: np.exp(f(np.log(t)))


with h5py.File(os.path.join('.', root, input_name), 'r') as f:
    mu = {i: interp_loglog(f[i][:, 0], f[i][:, 1])
          for i in composition.keys()}

x = ref_data[:, 0]
y_expected = ref_data[:, 1]
y_actual = np.empty_like(y_expected)

y_actual = sum(composition[i]*mu[i](x) for i in composition.keys())

plt.style.use('../include/zenburn-light.mplstyle')
fig, ax = plt.subplots()
ax.loglog(x, y_expected, '-', label='NIST')
ax.loglog(x, y_actual, 'x', label='This post')
ax.legend()
ax.set_xlabel('Photon energy [MeV]')
ax.set_ylabel(r'$\mu/\rho [\mathrm{cm}^2/\mathrm{g}]$')

fig.tight_layout(pad=0.1)
fig.savefig(os.path.join(root, output_name), transparent=True)

This produces the following figure, which shows an excellent agreement with the NIST reference data (which is interpolated much more cleverly). This validates our Scrapy procedure!

Cement

Closing words

In this blog, we learned how to use Scrapy to retrieve some tabulated data that is not available as downloaded files. This is probably the most basic use of this tool, which is very simple, and a joy to use!

The whole scrapymu project is available on GitHub.

Comments

Please send your comments to sebastien [dot] brisard [at] univ [dash] eiffel [dot] fr. Your comments will be inserted below. Your email address will not appear.