Using the crystract Package for Crystallographic Analysis
Don Ngo, Julia Maria Hubner, Anirudh Prabhu
2026-03-18
crystract.RmdIntroduction
This vignette demonstrates the core workflow for using the
crystract package. The package is designed for the
batch processing of Crystallographic Information Files
(.cif) to extract structural and chemical information from the repeating
atomic arrangement that defines a crystal. It calculates key geometric
properties like bond lengths and angles and compiles the results into a
structured format suitable for large-scale data analysis.
We will first demonstrate the main batch-processing function,
analyze_cif_files. Then, to illustrate how it works, we
will walk through the analysis of a single file step-by-step, explaining
the underlying functions, the crystallographic principles, and the
mathematical formulas they employ.
1. The Core crystract Workflow
The primary goal of crystract is to automate the
analysis of many CIF files at once. While the package provides granular
functions for each step of the crystallographic analysis, the most
common use case is to leverage the main wrapper function for an
end-to-end pipeline.
1.1 The Full Pipeline for Batch Processing
The analyze_cif_files() function (and its single-file
counterpart analyze_single_cif()) is the cornerstone of the
package. It performs the entire sequence of operations—from reading
files to calculating bond angles with error propagation. You can easily
specify which bonding algorithms to run simultaneously.
# IMPORTANT: Update this path to point to your own downloaded CIF file.
# 1. Try to find the file in the installed package
cif_path <- system.file("extdata", "1590946.cif", package = "crystract")
# 2. If that fails, look in the source directory
if (cif_path == "") {
cif_path <- "../inst/extdata/1590946.cif"
}
# 3. Final check to provide a clear error if both fail
if (!file.exists(cif_path)) {
stop("Could not find 1590946.cif in installed package or inst/extdata/")
}
# Run the pipeline on our single example file, extracting multiple bonding types at once
analysis_results <- analyze_single_cif(
cif_path,
bonding_algorithms = c("minimum_distance", "brunner", "econ", "voronoi", "crystal_nn")
)
# Let's inspect the structure of the output table.
# It's a single row containing all our results in nested data.tables.
str(analysis_results, max.level = 2)
#> Classes 'data.table' and 'data.frame': 1 obs. of 27 variables:
#> $ file_name : chr "1590946.cif"
#> $ database_code : chr "depnum_ccdc_archive CCDC 1590946"
#> $ chemical_formula : chr "Si1 Sr2"
#> $ structure_type : logi NA
#> $ space_group_name : chr "P n m a"
#> $ space_group_number : chr "62"
#> $ unit_cell_metrics :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 1 obs. of 12 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ atomic_coordinates :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 3 obs. of 14 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ symmetry_operations :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 8 obs. of 3 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ transformed_coords :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 12 obs. of 5 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ expanded_coords :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 324 obs. of 8 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ distances :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 969 obs. of 6 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ bonds_minimum_distance :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 14 obs. of 8 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> .. ..- attr(*, "sorted")= chr [1:3] "Atom1" "Atom2" "Distance"
#> $ cn_minimum_distance :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 3 obs. of 3 variables:
#> .. ..- attr(*, "sorted")= chr "Atom"
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ angles_minimum_distance:List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 30 obs. of 5 variables:
#> .. ..- attr(*, "sorted")= chr [1:3] "CentralAtom" "Neighbor1" "Neighbor2"
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ bonds_brunner :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 24 obs. of 8 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> .. ..- attr(*, "sorted")= chr [1:3] "Atom1" "Atom2" "Distance"
#> $ cn_brunner :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 3 obs. of 3 variables:
#> .. ..- attr(*, "sorted")= chr "Atom"
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ angles_brunner :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 105 obs. of 5 variables:
#> .. ..- attr(*, "sorted")= chr [1:3] "CentralAtom" "Neighbor1" "Neighbor2"
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ bonds_econ :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 36 obs. of 8 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> .. ..- attr(*, "sorted")= chr [1:3] "Atom1" "Atom2" "Distance"
#> $ cn_econ :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 3 obs. of 3 variables:
#> .. ..- attr(*, "sorted")= chr "Atom"
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ angles_econ :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 207 obs. of 5 variables:
#> .. ..- attr(*, "sorted")= chr [1:3] "CentralAtom" "Neighbor1" "Neighbor2"
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ bonds_voronoi :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 42 obs. of 11 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> .. ..- attr(*, "sorted")= chr [1:3] "Atom1" "Atom2" "Distance"
#> $ cn_voronoi :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 3 obs. of 3 variables:
#> .. ..- attr(*, "sorted")= chr "Atom"
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ angles_voronoi :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 0 obs. of 4 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> .. ..- attr(*, "sorted")= chr "CentralAtom"
#> $ bonds_crystal_nn :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 36 obs. of 8 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> .. ..- attr(*, "sorted")= chr [1:3] "Atom1" "Atom2" "Distance"
#> $ cn_crystal_nn :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 3 obs. of 3 variables:
#> .. ..- attr(*, "sorted")= chr "Atom"
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> $ angles_crystal_nn :List of 1
#> ..$ :Classes 'data.table' and 'data.frame': 0 obs. of 4 variables:
#> .. ..- attr(*, ".internal.selfref")=<externalptr>
#> .. ..- attr(*, "sorted")= chr "CentralAtom"
#> - attr(*, ".internal.selfref")=<externalptr>As shown in the structure output, the result is a tidy
data.table containing all extracted and calculated
information. We can easily access the nested data frames for further
analysis. Note that the output includes separate results for each
bonding algorithm requested, such as
bonds_minimum_distance, bonds_brunner,
bonds_voronoi, etc.
To get the final bonded pairs table from the Minimum Distance method (which includes propagated errors):
1.2 A Step-by-Step Walkthrough
To understand what happens under the hood, this section breaks down the process using individual functions. We will use the same CIF file to demonstrate each step.
1.2.1 Loading CIF Data
We use the package’s read_cif_files() function to load
data into memory.
# The path was defined in the previous section:
cif_data_list <- read_cif_files(cif_path)
# We'll work with the content of the first file
cif_content <- cif_data_list[[1]]
# Let's look at the first few lines of the raw data
knitr::kable(
head(cif_content),
caption = "First 6 lines of the raw CIF data."
)| V1 |
|---|
| ####################################################################### |
| # |
| # This file contains crystal structure data downloaded from the |
| # Cambridge Structural Database (CSD) hosted by the Cambridge |
| # Crystallographic Data Centre (CCDC) in cooperation with FIZ Karlsruhe. |
| # |
1.2.2 Extracting Metadata and Unit Cell Parameters
A crystal structure is defined by its unit cell and the arrangement of atoms within it. We first extract this high-level information.
database_code <- extract_database_code(cif_content)
chemical_formula <- extract_chemical_formula(cif_content)
space_group_name <- extract_space_group_name(cif_content)
space_group_number <- extract_space_group_number(cif_content)
cat("Database Code:", database_code, "\n")
#> Database Code: depnum_ccdc_archive CCDC 1590946
cat("Chemical Formula:", chemical_formula, "\n")
#> Chemical Formula: Si1 Sr2
cat("Space Group:", space_group_name, "(No.", space_group_number, ")\n")
#> Space Group: P n m a (No. 62 )Next, extract_unit_cell_metrics() extracts the six
parameters that define the shape and size of the unit cell: the lengths
of its three axes
(,
,
)
and the angles between them
(,
,
).
Their experimental uncertainties are also extracted.
unit_cell_metrics <- extract_unit_cell_metrics(cif_content)
print(unit_cell_metrics)
#> _cell_length_a _cell_length_b _cell_length_c _cell_angle_alpha
#> <num> <num> <num> <num>
#> 1: 8.11 5.15 9.54 90
#> _cell_angle_beta _cell_angle_gamma _cell_length_a_error
#> <num> <num> <num>
#> 1: 90 90 NA
#> _cell_length_b_error _cell_length_c_error _cell_angle_alpha_error
#> <num> <num> <num>
#> 1: NA NA NA
#> _cell_angle_beta_error _cell_angle_gamma_error
#> <num> <num>
#> 1: NA NA1.2.3 Extracting Atomic and Symmetry Data
Instead of listing every atom in the unit cell, a CIF file efficiently describes the structure using only the asymmetric unit: the smallest set of unique atoms. All other atoms in the unit cell can be generated by applying the crystal’s symmetry operations to this unique set.
# Extract the coordinates of the unique atoms in the asymmetric unit
atomic_coordinates <- extract_atomic_coordinates(cif_content)
print("Asymmetric Atomic Coordinates:")
#> [1] "Asymmetric Atomic Coordinates:"
print(atomic_coordinates)
#> Label TypeSymbol WyckoffSymbol WyckoffMultiplicity OxidationState
#> <char> <char> <char> <num> <num>
#> 1: Sr1 Sr0+ <NA> NA NA
#> 2: Sr2 Sr0+ <NA> NA NA
#> 3: Si1 Si0+ <NA> NA NA
#> ThermalParam Occupancy OccupancyError x_a y_b z_c x_error
#> <num> <num> <num> <num> <num> <num> <num>
#> 1: NA 1 NA 0.6529 0.25 0.0769 0
#> 2: NA 1 NA 0.5192 0.25 0.6748 0
#> 3: NA 1 NA 0.2539 0.25 0.1028 0
#> y_error z_error
#> <num> <num>
#> 1: 0 0
#> 2: 0 0
#> 3: 0 0
# Extract the symmetry operations
symmetry_operations <- extract_symmetry_operations(cif_content)
print("Symmetry Operations (first 6 of 8):")
#> [1] "Symmetry Operations (first 6 of 8):"
print(head(symmetry_operations))
#> x y z
#> <char> <char> <char>
#> 1: x+1/2 y -z+1/2
#> 2: x -y+1/2 z
#> 3: -x+1/2 y+1/2 z+1/2
#> 4: -x -y -z
#> 5: -x+1/2 -y z+1/2
#> 6: -x y+1/2 -z1.2.4 Generating the Full Crystal Structure
Now we use the asymmetric atoms and symmetry operations to computationally build the full crystal structure. This is a two-step process.
Formula Context: Symmetry Operations
A symmetry operation is an affine transformation that maps an initial fractional coordinate to a new coordinate . It consists of a rotation/reflection matrix and a translation vector .
The apply_symmetry_operations function applies all of
the crystal’s symmetry operations to each atom in the asymmetric unit,
generating the complete set of atoms within the primary unit cell. Note
that it requires the unit cell metrics to resolve distance tolerances
across boundaries.
Formula Context: Supercell Expansion
To find all nearest neighbors of an atom, we must also consider atoms
in adjacent unit cells. The expand_transformed_coords
function generates a supercell (in this case, 3x3x3) by
translating the primary unit cell atoms by integer vectors
,
where
each range from -1 to 1. An atom at fractional coordinate
generates new coordinates:
# Apply symmetry to generate all atoms in the primary unit cell
transformed_coords <- apply_symmetry_operations(atomic_coordinates, symmetry_operations, unit_cell_metrics)
print("Unique atoms in full unit cell (first 6):")
#> [1] "Unique atoms in full unit cell (first 6):"
print(head(transformed_coords))
#> Label SymOp x_a y_b z_c
#> <char> <int> <num> <num> <num>
#> 1: Sr1_1 1 0.1529 0.25 0.4231
#> 2: Sr1_2 2 0.6529 0.25 0.0769
#> 3: Sr1_3 3 0.8471 0.75 0.5769
#> 4: Sr1_4 4 0.3471 0.75 0.9231
#> 5: Sr2_1 1 0.0192 0.25 0.8252
#> 6: Sr2_2 2 0.5192 0.25 0.6748
# Expand into a supercell for neighbor calculations
expanded_coords <- expand_transformed_coords(transformed_coords)
print("Atoms in supercell (first 6):")
#> [1] "Atoms in supercell (first 6):"
print(head(expanded_coords))
#> Label SymOp x_a y_b z_c Tx Ty Tz
#> <char> <int> <num> <num> <num> <int> <int> <int>
#> 1: Sr1_1_-1_-1_-1 1 -0.8471 -0.75 -0.5769 -1 -1 -1
#> 2: Sr1_2_-1_-1_-1 2 -0.3471 -0.75 -0.9231 -1 -1 -1
#> 3: Sr1_3_-1_-1_-1 3 -0.1529 -0.25 -0.4231 -1 -1 -1
#> 4: Sr1_4_-1_-1_-1 4 -0.6529 -0.25 -0.0769 -1 -1 -1
#> 5: Sr2_1_-1_-1_-1 1 -0.9808 -0.75 -0.1748 -1 -1 -1
#> 6: Sr2_2_-1_-1_-1 2 -0.4808 -0.75 -0.3252 -1 -1 -11.2.5 Calculating Interatomic Distances
Formula Context: Interatomic Distance in a Triclinic System
Because unit cell axes are not always orthogonal, the simple Pythagorean theorem is insufficient. The distance between two atoms at fractional coordinates and requires the full crystallographic distance formula:
The calculate_distances function computes the distances
from each central atom (from the asymmetric unit) to all other atoms in
the generated supercell.
distances <- calculate_distances(atomic_coordinates, expanded_coords, unit_cell_metrics)
print("Calculated Distances (shortest 6):")
#> [1] "Calculated Distances (shortest 6):"
print(head(distances[order(Distance)]))
#> Atom1 Atom2 Distance DeltaX DeltaY DeltaZ
#> <char> <char> <num> <num> <num> <num>
#> 1: Si1 Sr1_1_0_0_0 3.163544 0.1010 0.0 -0.3203
#> 2: Sr1 Si1_1_0_0_0 3.163544 -0.1010 0.0 -0.3203
#> 3: Si1 Sr1_4_0_-1_-1 3.184477 -0.0932 0.5 0.1797
#> 4: Sr1 Si1_4_0_-1_-1 3.184477 -0.0932 0.5 0.1797
#> 5: Si1 Sr1_4_0_0_-1 3.184477 -0.0932 -0.5 0.1797
#> 6: Sr1 Si1_4_0_0_-1 3.184477 -0.0932 -0.5 0.17971.2.6 Identifying Bonds and Neighbors
Formula Context: Bonding and Coordination Number
Identifying which atoms are “bonded” is key to determining the
coordination number (CN). crystract
implements several robust, geometric, and mathematical approaches:
-
minimum_distance: Defines a custom distance cutoff for each central atom : Here, is the shortest distance from atom to any other atom, and is a tolerance parameter (default is 0.1). -
brunner: Identifies bonds by finding the largest gap in the reciprocal distances. Matchespymatgen.analysis.local_env.BrunnerNNReciprocal. -
econ: Uses Hoppe’s iterative method to calculate Effective Coordination Numbers, weighting neighbor contributions exponentially by distance. -
voronoi: Performs a 3D Voronoi tessellation to find neighbors sharing a face. Note:crystractcalculates these faces using the Delaunay dual. Results may slightly differ from exact Voronoi tessellation implementations (like Voro++) due to geometric handling of degenerate, co-planar vertices. -
crystal_nn: A sophisticated approach combining Voronoi solid angles, ionic radii distance cutoffs, and electronegativity weights.
Let’s run each algorithm to demonstrate:
# 1. Minimum Distance
bonds_min <- minimum_distance(distances, delta = 0.1)
# 2. Brunner's Method
bonds_brunner <- brunner_nn_reciprocal(distances)
# 3. Hoppe's ECoN
bonds_econ <- econ_nn(distances, atomic_coordinates)
# 4. Voronoi Tessellation
bonds_voronoi <- voronoi_nn(atomic_coordinates, expanded_coords, unit_cell_metrics)
# 5. CrystalNN
bonds_crystal_nn <- crystal_nn(distances, atomic_coordinates, expanded_coords, unit_cell_metrics)
cat("Minimum Distance found", nrow(bonds_min), "bonds.\n")
#> Minimum Distance found 14 bonds.
cat("CrystalNN found", nrow(bonds_crystal_nn), "bonds.\n")
#> CrystalNN found 36 bonds.
# Calculate integer neighbor counts based on the bonded pairs (e.g., using CrystalNN)
neighbor_counts <- calculate_neighbor_counts(bonds_crystal_nn)
print("CrystalNN Neighbor Counts:")
#> [1] "CrystalNN Neighbor Counts:"
print(neighbor_counts)
#> Atom CoordinationNumber
#> <char> <int>
#> 1: Sr1_2 12
#> 2: Sr2_2 15
#> 3: Si1_2 91.2.7 Calculating Bond Angles
Formula Context: Bond Angle Calculation
To calculate a bond angle A-B-C (with B at the vertex), the fractional coordinates must first be converted to an orthogonal Cartesian system.
- Fractional to Cartesian Conversion: A fractional coordinate is converted to a Cartesian coordinate via a transformation matrix :
where is the unit cell volume.
- Angle via Dot Product: With atoms in Cartesian space, the angle between vectors (from B to A) and (from B to C) is:
The calculate_angles function implements this for all
possible bond angles around each central atom. We’ll pass it our
bonds_min results:
bond_angles <- calculate_angles(
bonds_min,
atomic_coordinates,
expanded_coords,
unit_cell_metrics
)
print("Calculated Bond Angles (first 6):")
#> [1] "Calculated Bond Angles (first 6):"
print(head(bond_angles))
#> CentralAtom Neighbor1 Neighbor2 Angle
#> <char> <char> <char> <num>
#> 1: Si1 Sr1_1_0_0_0 Sr1_2_0_0_0 109.37260
#> 2: Si1 Sr1_1_0_0_0 Sr1_4_0_-1_-1 125.55190
#> 3: Si1 Sr1_1_0_0_0 Sr1_4_0_0_-1 125.55190
#> 4: Si1 Sr1_1_0_0_0 Sr2_1_0_0_-1 129.28796
#> 5: Si1 Sr1_1_0_0_0 Sr2_3_-1_-1_0 69.08689
#> 6: Si1 Sr1_1_0_0_0 Sr2_3_-1_0_0 69.086891.2.8 Error Propagation
Finally, crystract propagates the experimental
uncertainties from cell parameters and atomic coordinates to the
calculated distances and angles.
Formula Context: Error Propagation
The uncertainty, , in a calculated value that depends on several uncorrelated variables (each with uncertainty ) is found using the sum of squares of the partial derivatives:
Uncertainty in Interatomic Distance ()
The uncertainty in the distance, , is found by applying the general formula to the distance . The input parameters are the 12 variables that define the distance: . Letting , etc., the partial derivatives are, for example:
Uncertainty in Bond Angle ()
Propagating error to the bond angle is a more complex, multi-step process involving propagating initial errors first to Cartesian coordinates, and then from those Cartesian uncertainties to the final angle via the chain rule.
# Propagate errors for interatomic distances
bonded_pairs_with_error <- propagate_distance_error(
bonds_min,
atomic_coordinates,
unit_cell_metrics
)
print("Bonded Pairs with Distance Error (first 6):")
#> [1] "Bonded Pairs with Distance Error (first 6):"
print(head(bonded_pairs_with_error))
#> Key: <Atom1, Atom2, Distance>
#> Atom1 Atom2 Distance DeltaX DeltaY DeltaZ Weight
#> <char> <char> <num> <num> <num> <num> <num>
#> 1: Si1 Sr1_1_0_0_0 3.163544 0.1010 0.0 -0.3203 1.0000000
#> 2: Si1 Sr1_2_0_0_0 3.245310 -0.3990 0.0 0.0259 0.9748050
#> 3: Si1 Sr1_4_0_-1_-1 3.184477 -0.0932 0.5 0.1797 0.9934267
#> 4: Si1 Sr1_4_0_0_-1 3.184477 -0.0932 -0.5 0.1797 0.9934267
#> 5: Si1 Sr2_1_0_0_-1 3.261366 0.2347 0.0 0.2776 0.9700058
#> 6: Si1 Sr2_3_-1_-1_0 3.465249 0.2731 0.5 -0.0720 0.9129342
#> DistanceError
#> <num>
#> 1: 0
#> 2: 0
#> 3: 0
#> 4: 0
#> 5: 0
#> 6: 0
# Propagate errors for bond angles
bond_angles_with_error <- propagate_angle_error(
bond_angles,
atomic_coordinates,
expanded_coords,
unit_cell_metrics
)
print("Bond Angles with Angle Error (first 6):")
#> [1] "Bond Angles with Angle Error (first 6):"
print(head(bond_angles_with_error))
#> Key: <CentralAtom, Neighbor1, Neighbor2>
#> CentralAtom Neighbor1 Neighbor2 Angle AngleError
#> <char> <char> <char> <num> <num>
#> 1: Si1 Sr1_1_0_0_0 Sr1_2_0_0_0 109.37260 0
#> 2: Si1 Sr1_1_0_0_0 Sr1_4_0_-1_-1 125.55190 0
#> 3: Si1 Sr1_1_0_0_0 Sr1_4_0_0_-1 125.55190 0
#> 4: Si1 Sr1_1_0_0_0 Sr2_1_0_0_-1 129.28796 0
#> 5: Si1 Sr1_1_0_0_0 Sr2_3_-1_-1_0 69.08689 0
#> 6: Si1 Sr1_1_0_0_0 Sr2_3_-1_0_0 69.08689 02. Tools for Post-Processing and Analysis
After running the main analysis pipeline, crystract
provides several helper functions to filter and refine the results,
allowing you to isolate data based on chemical identity or key
crystallographic properties.
2.1 Filtering by Chemical Identity
The filter_atoms_by_symbol() function provides an
interactive way to filter results (like bond or angle tables) to focus
on the coordination environment around a specific type of element.
# In an interactive R session, you would run this:
filtered_bonds <- filter_atoms_by_symbol(
data_table = bonded_pairs_with_error,
atom_col = "Atom1" # Filter based on the central atom
)If you were to type Si and press Enter, the function
would return a new data.table containing only the rows
where the central atom (Atom1) is Silicon.
2.2 Filtering by Wyckoff Position
For many crystallographic studies, it is crucial to analyze atoms
based on their specific site symmetry, described by their
Wyckoff position (e.g., “2a”, “6c”). The
filter_by_wyckoff_symbol() function is designed for this
purpose.
# 1. In our example, the asymmetric atoms do not have Wyckoff information from the CIF.
# We will mock them as "4c" for this demonstration to show how the function works.
atomic_coordinates[, WyckoffSymbol := c("c", "c", "c")]
atomic_coordinates[, WyckoffMultiplicity := c(4, 4, 4)]
print("Atomic coordinates showing Wyckoff information:")
#> [1] "Atomic coordinates showing Wyckoff information:"
print(atomic_coordinates[, .(Label, WyckoffSymbol, WyckoffMultiplicity)])
#> Label WyckoffSymbol WyckoffMultiplicity
#> <char> <char> <num>
#> 1: Sr1 c 4
#> 2: Sr2 c 4
#> 3: Si1 c 4
cat("\n")
# 2. Filter bonds where the central atom is on the "4c" Wyckoff site.
bonds_from_4c_site <- filter_by_wyckoff_symbol(
data_table = bonded_pairs_with_error,
atomic_coordinates = atomic_coordinates,
atom_col = "Atom1",
wyckoff_symbols = "4c"
)
print(paste("Number of rows in original bond table:", nrow(bonded_pairs_with_error)))
#> [1] "Number of rows in original bond table: 14"
print(paste("Number of rows after filtering for site '4c':", nrow(bonds_from_4c_site)))
#> [1] "Number of rows after filtering for site '4c': 14"2.3 Filtering Ghost Distances Using Atomic Radii
A common issue in crystallography is site-occupancy disorder, where a single crystallographic site is statistically co-occupied by different atoms. Standard calculations can produce physically meaningless, short “ghost” distances.
The filter_ghost_distances() function cleans the
distance table by removing these artifacts. It uses a built-in table of
covalent radii to establish a plausible bond length range. Any
calculated distance falling outside this physical range (defined by a
margin) is filtered out.
Note: Since this function evaluates the entire supercell distance matrix, pairs of atoms that are simply far away will be logged as “TOO LONG”.
# A distance d is kept if: (r1+r2)*(1-margin) <= d <= (r1+r2)*(1+margin)
filtered_result <- filter_ghost_distances(
distances = distances,
atomic_coordinates = atomic_coordinates,
margin = 0.1 # Default margin is 10%
)
kept_distances <- filtered_result$kept
removed_distances <- filtered_result$removed
cat("Total distances calculated:", nrow(distances), "\n")
#> Total distances calculated: 969
cat("Distances kept after filtering:", nrow(kept_distances), "\n")
#> Distances kept after filtering: 24
cat("Unlikely / non-bonded distances removed:", nrow(removed_distances), "\n\n")
#> Unlikely / non-bonded distances removed: 945
print("Subset of removed distances (showing physically impossible / too long connections):")
#> [1] "Subset of removed distances (showing physically impossible / too long connections):"
print(head(removed_distances))
#> Atom1 Atom2 Distance expected_dist lower_bound upper_bound
#> <char> <char> <num> <num> <num> <num>
#> 1: Si1 Si1_1_-1_-1_-1 9.395616 2.34 2.106 2.574
#> 2: Si1 Si1_2_-1_-1_-1 13.539062 2.34 2.106 2.574
#> 3: Si1 Si1_3_-1_-1_-1 9.807429 2.34 2.106 2.574
#> 4: Si1 Si1_4_-1_-1_-1 5.238116 2.34 2.106 2.574
#> 5: Si1 Si1_1_0_-1_-1 9.395616 2.34 2.106 2.574
#> 6: Si1 Si1_2_0_-1_-1 10.841314 2.34 2.106 2.574
#> Reason
#> <char>
#> 1: Distance is TOO LONG
#> 2: Distance is TOO LONG
#> 3: Distance is TOO LONG
#> 4: Distance is TOO LONG
#> 5: Distance is TOO LONG
#> 6: Distance is TOO LONG2.4 Filtering by Element Exclusion
To focus an analysis on a specific part of a structure (e.g., a host
framework without guest atoms), filter_by_elements() allows
for the complete removal of bonds involving certain elements.
# Let's filter our bond table to exclude any bonds involving Strontium ("Sr").
# Since all bonds in this structure are Si-Sr, the result should be an empty table.
bonds_without_sr <- filter_by_elements(
distances = bonded_pairs_with_error,
atomic_coordinates = atomic_coordinates,
elements_to_exclude = "Sr"
)
cat("Number of bonds in original table:", nrow(bonded_pairs_with_error), "\n")
#> Number of bonds in original table: 14
cat("Number of bonds after excluding 'Sr':", nrow(bonds_without_sr), "\n")
#> Number of bonds after excluding 'Sr': 02.5 Calculating Weighted Average Network Bond Distance
A common goal is to compute a single summary statistic that represents a structure.
The calculate_weighted_average_network_distance()
function calculates a representative bond length for a
specified atomic network (defined by a set of Wyckoff sites).
This calculation should always be performed on a table of identified bonds to ensure the result is physically meaningful.
Formula Context: Weighted Average Network Distance
To obtain a single, representative bond length for an atomic network, the function calculates a true weighted average over all individual bonds in the unit cell. This is equivalent to calculating the total sum of all bond lengths within the network and dividing by the total number of bonds. This method correctly accounts for site multiplicity, site occupancy, and varying coordination numbers.
The formula is defined as:
$$ \large \bar{d}_\text{network} = \frac{\sum_j (m_j \cdot \text{occ}_j \cdot S_j)}{\sum_j (m_j \cdot \text{occ}_j \cdot n_j)} $$
Where, for each unique atomic site j in the asymmetric unit: - is the Wyckoff multiplicity (how many equivalent atoms are generated by symmetry). - is the site occupancy (accounts for disorder; 1.0 for a fully occupied site). - is the coordination number (the number of bonds for an atom at site j). - is the sum of all bond distances for a single atom at site j. It is calculated as:
$$ \large S_j = \sum_{i=1}^{n_j} d_{ij} $$
The numerator, , represents the grand total sum of the lengths of every bond in the unit cell. The denominator, , represents the total number of bonds in the unit cell.
Best Practices for Accurate Calculation
For the most physically meaningful and accurate results, it is crucial to perform this calculation as the final step in a filtering pipeline, especially when dealing with complex or disordered structures. The recommended workflow is:
- Calculate all potential interatomic distances using
calculate_distances(). - Clean this raw data by removing non-physical “ghost” distances with
filter_ghost_distances(). - Identify the true bonded pairs from the cleaned data using an
algorithm like
minimum_distance(). -
Only then, pass the resulting table of confirmed
bonds to
calculate_weighted_average_network_distance().
This “Ghosts -> Bonds -> Average” sequence ensures the final metric is calculated only over actual chemical bonds, preventing artifacts from skewing the result.
# Calculate the weighted average BOND distance for the entire Sr2Si network.
# First, identify the bonds in the structure. We use the `bonds_min` table
# created in section 1.2.6.
# Then, define the Wyckoff sites belonging to the network. Here, it's just "4c".
# (We assigned dummy "4c" and "4a" Wyckoff positions to our coordinates in section 2.2).
network_wyckoff_sites <- "4c"
# Apply the function to the table of identified bonds.
# For this simple, ordered structure, all occupancies are 1.0, but the function
# correctly applies the full formula.
weighted_avg_bond_dist <- calculate_weighted_average_network_distance(
distances = bonds_min, # Use the bond table as input
atomic_coordinates = atomic_coordinates,
wyckoff_symbols = network_wyckoff_sites
)
cat("Weighted average network bond distance for the '4c' sites:", weighted_avg_bond_dist, "Å\n")
#> Weighted average network bond distance for the '4c' sites: 3.281382 Å3. End-to-End Example: High-Throughput Batch Processing & CSV Extraction
While the previous examples demonstrated crystract on a
single file to explain the underlying mechanics, the package is heavily
optimized for processing thousands of structures simultaneously using
parallel workers.
Below is a complete, real-world example demonstrating how you might override the internal atomic radii tables with a custom dictionary, run an exhaustive parallel analysis using all five bonding algorithms, and then unnest the resulting tables to save the extracted Coordination Numbers (CNs) out to CSV files.
# --- 1. Define and set the custom radii dictionary ---
radii_dict <- c(
Ac=2.15, Ag=1.45, Al=1.21, Am=1.8, Ar=1.06, As=1.19, At=1.5, Au=1.36, B=0.84,
Ba=2.15, Be=0.96, Bi=1.48, Br=1.2, C=0.73, Ca=1.76, Cd=1.44, Ce=2.04, Cl=1.02,
Cm=1.69, Co=1.38, Cr=1.39, Cs=2.44, Cu=1.32, Dy=1.92, Er=1.89, Eu=1.98, F=0.57,
Fe=1.42, Fr=2.6, Ga=1.22, Gd=1.96, Ge=1.2, H=0.31, He=0.28, Hf=1.75, Hg=1.32,
Ho=1.92, I=1.39, In=1.42, Ir=1.41, K=2.03, Kr=1.16, La=2.07, Li=1.28, Lu=1.87,
Mg=1.41, Mn=1.5, Mo=1.54, N=0.71, Na=1.66, Nb=1.64, Nd=2.01, Ne=0.58, Ni=1.24,
Np=1.9, O=0.66, Os=1.44, P=1.07, Pa=2, Pb=1.46, Pd=1.39, Pm=1.99, Po=1.4, Pr=2.03,
Pt=1.36, Pu=1.87, Ra=2.21, Rb=2.2, Re=1.51, Rh=1.42, Rn=1.5, Ru=1.46, S=1.05,
Sb=1.39, Sc=1.7, Se=1.2, Si=1.11, Sm=1.98, Sn=1.39, Sr=1.95, Ta=1.7, Tb=1.94,
Tc=1.47, Te=1.38, Th=2.06, Ti=1.6, Tl=1.45, Tm=1.9, U=1.96, V=1.53, W=1.62,
Xe=1.4, Y=1.9, Yb=1.87, Zn=1.22, Zr=1.75
)
custom_radii_table <- data.table(
Symbol = names(radii_dict),
Radius = as.numeric(radii_dict),
Type = "covalent"
)
# Inject the custom table into the current crystract session
set_radii_data(custom_radii_table)
# --- 2. Run the batch analysis ---
cat("Starting batch analysis on CIF files...\n")
results <- analyze_cif_files(
file_paths = "path/to/cif_directory",
tolerance = 1e-4,
bonding_algorithms = c("minimum_distance", "brunner", "econ", "voronoi", "crystal_nn"),
calculate_bond_angles = FALSE, # Skip angles to speed up extraction
perform_error_propagation = FALSE, # Skip uncertainties
workers = 4 # Adjust based on your available CPU cores
)
# --- 3. Extract and Save Coordination Numbers to CSV ---
# Create a dedicated folder for the outputs to avoid clutter
output_dir <- "path/to/output_directory"
if (!dir.exists(output_dir)) dir.create(output_dir)
# Identify all coordination number columns generated by the pipeline
cn_columns <- grep("^cn_", names(results), value = TRUE)
cat("\nExtracting coordination numbers and saving to CSV...\n")
for (col in cn_columns) {
# Extract the list of data.tables for this specific algorithm
list_of_dts <- results[[col]]
# Associate each table with its CIF file name
names(list_of_dts) <- results$file_name
# Unnest and bind all tables together into one giant table
combined_dt <- rbindlist(list_of_dts, idcol = "file_name", fill = TRUE)
if (nrow(combined_dt) > 0) {
# Format the file path (e.g., "cn_crystal_nn_crystract.csv")
file_path <- file.path(output_dir, paste0(col, "_crystract.csv"))
# Save to CSV
fwrite(combined_dt, file_path)
cat(sprintf("Saved %d rows across %d files to: %s\n",
nrow(combined_dt),
uniqueN(combined_dt$file_name),
basename(file_path)))
} else {
cat(sprintf("No valid coordination data found for %s, skipping.\n", col))
}
}
cat("\nAll operations finished successfully! Files are in:", output_dir, "\n")This workflow ensures that you can adapt crystract’s
core metrics to your own chemical definitions while harnessing its
parallel speed to crunch massive datasets efficiently.
Conclusion
This vignette has demonstrated the core workflow of the
crystract package, from the high-level, automated
processing of multiple CIF files with analyze_cif_files to
a detailed, step-by-step breakdown of the underlying calculations. We
have seen how the package extracts fundamental data, builds a complete
crystal model, calculates key geometric properties across five different
mathematical definitions of bonding, and rigorously propagates
experimental uncertainties into these final results.
Furthermore, with a powerful suite of post-processing tools such as
filter_atoms_by_symbol,
filter_by_wyckoff_symbol,
filter_ghost_distances,
calculate_weighted_average_network_distance, and
user-defined radii tables, you can easily refine, clean, analyze, and
share the generated datasets to focus on the specific chemical or
crystallographic features of interest.
The goal of crystract is to provide a robust and
accessible platform for crystallographic data mining in R.