Skip to content

Wrapping a Python module and exposing functionality to run within QCrBox#

This guide walks you through the process of encapsulating a Python module within a QCrBox container. Specifically, we'll focus on a module that queries the Crystallographic Open Database (COD) for structures with similar elements and unit cell parameters. Our goal is to make this module's functionality accessible within a QCrBox container. The resulting container is already present in QCrBox as cod_check. However, we will go through the necessary steps to recreate the functionality.

Prerequisites#

Before starting, ensure your development environment is set up following the guide located here. During this tutorial you will work with Docker, Python, and an understanding of YAML configurations. If you're new to these concepts you can just type in the commands as listed in the tutorial. Alternatively, you can consult additional resources on Docker, Python modules, and YAML for foundational knowledge.

Initial Setup#

To begin, initialize a new QCrBox container for our module:

  1. Open your terminal.
  2. Type qcb init cod_check_tutorial and press Enter.
  3. You will be prompted to provide basic information about your application through a guided dialogue. Follow the prompts to complete the setup.
Please provide some basic information about your application.
The following dialog will guide you through the relevant settings.

  [1/7] Select application_type
    1 - CLI
    Choose from [1] (1): 1
  [2/7] application_slug (cod_check_tutorial):
  [3/7] application_name (Cod Check): COD Check
  [4/7] application_version (x.y.z): 0.0.1
  [5/7] description (Brief description of the application.): Can be used to check whether there is a similar structure in the crystallographic open database and output similar structures.
  [6/7] url (): https://my.official.module.url
  [7/7] email (): module_contact@university.somewhere

Created scaffolding for new application in 'T:\QCrBox_location\services\applications\cod_check_tutorial'.

Understanding the Generated Scaffolding#

Navigate to the application's folder to see the files generated by the boilerplace CLI. You'll find:

  • docker-compose.cod_check_tutorial.*.yml: Docker Compose files, typically unchanged for non-GUI applications.
  • sample_cmd.sh: An example bash file for CLI applications. This can be deleted.
  • Dockerfile: Contains instructions to build the container.
  • config_cod_check_tutorial.yaml: Future versions will use this to define exposed functions. Currently, it requests CIF keywords.
  • configure_cod_check_tutorial.py: Here, we'll implement our module's functionality and register it with QCrBox.

Next, download the Python file here and copy simple_cod_module.py into the cod_check_tutorial folder.

Adding the First Command to config_cod_check_tutorial.yaml#

In the first step of this tutorial, we aim to introduce a command that outputs the number of structures with matching unit cell parameters and elements, as specified in a CIF file, to a JSON file within the work folder. Refer to the simple_cod_module.py script to understand the functionalities we're integrating. We will use the cif_to_search_pars function to generate search parameters and then employ get_number_fitting_cod_entries to find the count of matching structures.

The initial setup, generated by qcb init, has already populated the top section of the YAML file. Your task now is to customize this section with our specific command details. Start by renaming name from the placeholder to "get_number_fitting_cod_entries" and change implemented_as from "cli_command" to "python_callable". Also add the module we will import from by adding import_path: "configure_cod_check_tutorial" and the function we want to expose by adding callable_name: "get_number_fitting_cod_entries". Note that if the function name is the same as the exposed command name (for example: get_number_fitting_cod_entries) then callable_name can be omitted and the value of the name entry is used.

The command will require three parameters:

  1. input_cif_path (QCrBox.input_cif): Specifies which CIF file to use for checking similar structures. The type of this argument is special in that is requires the specification of cif entries. We will add a placeholder to be filled in the next section.
  2. cellpar_deviation_perc (float): Defines the maximum allowable deviation, in percentage, for unit cell parameters between COD structures and our target structure. The default value is set at 2.0%.
  3. listed_elements_only (boolean): When set to True, the search will only include entries containing the exact elements listed in the input_cif file. If False, the search will accept entries with additional elements beyond those listed. By default, this is set to False.

Here is how you should structure the command in the YAML file:

commands:
  - name: "get_number_fitting_cod_entries"
    implemented_as: "python_callable"
    import_path: "configure_cod_check_tutorial"
    callable_name: "qcb_get_number_fitting_cod_entries"
    parameters:
      - name: "input_cif_path"
        dtype: "QCrBox.input_cif"
        description: "Path to CIF file to refine"
        required_entries: [...]
        required: True
      - name: "cellpar_deviation_perc"
        dtype: "float"
        description: "Maximum deviation in cell parameters (lengths and angles) in %"
        default_value: 2.0
        required: false
      - name: "listed_elements_only"
        dtype: "bool"
        description: "If true: only elements present in CIF file can be present in databank entry, if false: additional elements can be present"
        default_value: false
        required: false

Specifying Required CIF Entries For Input#

Next, we must identify which CIF entries have to be in the input cif file for our command to function. Inspect the cif_to_search_pars function to determine these entries. If you're adding only one command, list these required entries directly in the input_cif_path parameter definition. Ensure that required_entries: aligns with the name: and type: sections of that parameter for proper structure.

        required_entries: [
          "_cell_length_a", "_cell_length_b", "_cell_length_c",
          "_cell_angle_alpha", "_cell_angle_beta", "_cell_angle_gamma",
          "_chemical_formula_sum"
        ]

For scenarios where certain CIF entries are beneficial but not mandatory, you could list them under optional_entries. However, in this context, all listed entries are necessary, completing our current modifications to the YAML file. For further information about cif entry handling consult the yaml section from the CIF HowTo.

Implementing the Python Glue Code#

Important Note: Currently, some functionality that will eventually be automated—specifically, the registration of our application and commands in Python, as well as CIF file handling and conversion—requires manual implementation. This step is temporary and is planned to be automated in future updates, following the developer alpha release. We're releasing this functionality now to provide a foundation for exploration and development.

Next, we need to implement the module and function we have referenced in the YAML file. To begin, open the configure_cod_check_tutorial.py file. Start by importing necessary functions from the Python base libraries as well as two different modules.

from pathlib import Path
import json

from qcrboxtools.cif.cif2cif import cif_file_to_specific_by_yml
from simple_cod_module import cif_to_search_pars, get_number_fitting_cod_entries

The function cif_file_to_specific_by_yml is designed to manage the CIF files' input and output, converting the CIF keywords used by QCrBox into those required by simple_cod_module. Additionally, we'll utilize two specific functions from simple_cod_module to execute our desired logic.

Let's proceed to define the necessary Python functions within configure_cod_check_tutorial.py:

YAML_PATH = "./config_cod_check_tutorial.yaml"

def parse_input(input_cif_path, cellpar_deviation_perc, listed_elements_only):
    # Convert string paths to Path objects for easier file handling
    input_cif_path = Path(input_cif_path)

    # Convert cellpar_deviation to the correct type and convert the given percentage to a decimal
    cellpar_deviation = float(cellpar_deviation_perc) / 100.0

    # Validate 'listed_elements_only' as a boolean value
    if listed_elements_only.lower() not in ("true", "false"):
        raise ValueError("'listed_elements_only' must be a boolean (true or false).")

    # Convert 'listed_elements_only' to a boolean
    listed_elements_only = listed_elements_only.lower() == "true"

    # Use the parent directory of the input CIF file as the working directory
    work_folder = input_cif_path.parent

    # Specify the path for the modified CIF file
    work_cif_path = work_folder / "qcrbox_work.cif"

    # Adjust the CIF file according to the requirements of 'simple_cod_module'
    cif_file_to_specific_by_yml(
        input_cif_path=input_cif_path,
        output_cif_path=work_cif_path,
        yml_path=YAML_PATH,  # Referencing the edited YAML configuration
        command="get_number_fitting_cod_entries",  # Command name as specified in the YAML
        parameter="input_cif_path",  # Parameter name as specified in the YAML
    )
    return work_cif_path, cellpar_deviation, listed_elements_only

def qcb_get_number_fitting_cod_entries(input_cif_path, cellpar_deviation_perc, listed_elements_only):
    # Transform input parameters from string to appropriate Python objects
    work_cif_path, cellpar_deviation, listed_elements_only = parse_input(input_cif_path, cellpar_deviation_perc, listed_elements_only)

    # Retrieve the number of matching entries
    elements, cell_dict = cif_to_search_pars(work_cif_path)
    n_entries = get_number_fitting_cod_entries(elements, cell_dict, cellpar_deviation, listed_elements_only)

    # Save the output as a JSON file
    with open(work_cif_path.parent / "nentries.json", "w", encoding="UTF-8") as fobj:
        json.dump({"n_entries": n_entries}, fobj)

The parse_input function will eventually be phased out as QCrBox plans will take over the input parameter handling and CIF file conversion making the explicit implementation obsolete.

Registering the Python Function as a QCrBox Command#

To integrate our command with QCrBox, it's necessary to register it within the system. Update the script's concluding section as follows:

client = QCrBoxRegistryClient()
application = client.register_application("COD Check", version="0.0.1")

# Register the command with QCrBox, linking it to our Python function
application.register_python_callable("get_number_fitting_cod_entries", qcb_get_number_fitting_cod_entries)

client.run()

QCrBox recognizes the parameter names from our Python function, using them directly as command parameters within the for the commands exposed by the QCrBox container.

Configuring the Dockerfile#

The Dockerfile is preconfigured with some entries that we won't need. Here's a simplified explanation and modifications required:

  1. Base Image Setup: The file begins with specifying the base image for the application. We use the qcrbox/base-application as our starting point, utilizing the latest version available.

    ARG QCRBOX_DOCKER_TAG
    FROM qcrbox/base-application:${QCRBOX_DOCKER_TAG}
    

  2. Environment Setup: Specifies using /bin/bash for running future commands.

    SHELL ["/bin/bash", "-c"]
    

  3. Inclusion of QCrBox settings files: The following lines will copy the /configure_*.py and the /config_*.yaml to our container, that we have modified to integrate our program with QCrBox.

    COPY configure_cod_check_tutorial.py ./
    COPY config_cod_check_tutorial.yaml ./
    
  4. Module Inclusion: Ensure our module and its dependencies are included and properly set up in the container. For instance, add the Python module with:

    COPY ./simple_cod_module.py ./
    

  5. Dependency Management: Install necessary dependencies, like the requests module. Choose between using micromamba for Conda environments or pip for Python environments.

    • For micromamba:
      RUN micromamba install -n qcrbox requests --yes
      
      If you have a large number of dependencies, working with a conda .yml file is more sensible.
    • For pip:
      RUN pip install requests
      
  6. Delete unnecessary lines: The following two lines are not necessary for a Python application and should be deleted ```Dockerfile COPY sample_cmd.sh /opt/cod_check_tutorial/bin/ ````

    ```Dockerfile ENV PATH="$PATH:/opt/cod_check_tutorial/bin/" ````

Building the container with the first command exposed#

To create a QCrBox image for our application, we'll execute a specific build command using the application slug defined earlier. Open your terminal and input the following command to start the build process:

qcb build cod_check_tutorial

Important Note: By default, qcb build without additional arguments performs a full rebuild of all dependencies to ensure everything is up-to-date. If you have recently completed a build and wish to save time, you can opt for the --no-deps argument. This option focuses solely on building the QCrBox image without updating the dependencies.

After completing the build process, you can launch your newly created QCrBox image with the following command:

qcb up cod_check_tutorial --no-rebuild-deps

This command starts the container without recompiling the image or its dependencies, assuming they were recently built. If you aim to update both dependencies and the image before launching, simply omit the --no-rebuild-deps flag. This ensures that your QCrBox image and all related components are fully up-to-date.

Build a function to load in a structure from the best matching unit cell#

Our goal is to incorporate atomic parameters from the most compatible structure within the COD into our CIF file. This allows us to bypass the structure solution phase if matching information is readily available. To achieve this, we introduce a new command into the config_cod_check_tutorial.yaml file. Append the following new command definition at the end of the file:

  - name: "merge_closest_cod_entry"
    implemented_as: "python_callable"
    import_name: "configure_cod_check_tutorial"
    callable_name: "merge_closest_cod_entry"  # can be omitted in this case because it is identical to `name` above
    parameters:
      - name: "input_cif_path"
        type: "QCrBox.input_cif"
        required_entries: [
          "_cell_length_a", "_cell_length_b", "_cell_length_c",
          "_cell_angle_alpha", "_cell_angle_beta", "_cell_angle_gamma",
          "_chemical_formula_sum"
        ]
        default_value: None
      - name: "output_cif_path"
        type: "QCrBox.output_cif"
        required_entries: [...]
      - name: "cellpar_deviation_perc"
        type: "float"
        default_value: 2.0
        required: false
      - name: "listed_elements_only"
        type: "bool"
        default_value: false
        required: false

You might notice two things: Firstly, we now have an output_cif_path parameter, which we will tackle in the next section. Secondly, our required cif entries for the input_cif_path are exactly the same, as they are used by the same function within the simple_cod_module.py file. Repeating the required cif entries might be fine in this case, as the number of entries is rather low, However, we would like to only define the set of entries once. In QCrBox we can do that using cif_entry_sets. At the end of the file we create a new entry set for our commands:

cif_entry_sets:
  - name: "cell_elements"
    required: [
      "_cell_length_a", "_cell_length_b", "_cell_length_c", "_cell_angle_alpha",
      "_cell_angle_beta", "_cell_angle_gamma", "_chemical_formula_sum"
    ]

Instead of writing the entries into the individual functions we now replace the required_entry sections to have our command definition look like this:

  - name: "merge_closest_cod_entry"
    implemented_as: "python_callable"
    parameters:
      - name: "input_cif_path"
        dtype: "QCrBox.input_cif"
        required_entry_sets: ["cell_elements"]
        default_value: None
      - name: "output_cif_path"
        dtype: "QCrBox.output_cif"
        required_entries: [...]
      - name: "cellpar_deviation_perc"
        dtype: "float"
        default_value: 2.0
        required: false
      - name: "listed_elements_only"
        dtype: "bool"
        default_value: false
        required: false

Try to update the definition of the get_number_fitting_cod_entries on your own. Note that you can have multiple cif entry sets. You can also combine entry sets with individual keywords to mix and match whatever your commands need.

Defining Cif Output Entries Within the YAML file#

Now, we want to define the output cif entries for our function. Again for the finer details, consult the yaml section from the CIF HowTo. The general idea is that we keep all values, which are still valid from the original cif file and only add / substitute new entries. Finally, we also delete all values from the original cif file, that have been invalidated.

The first two entries are again required_entries and optional_entries and their set counterparts. For cif output, required values are values that have to be in a cif file that comes from a successful calculation, whereas optional entries can be in a successful calculation.

Finally, invalidated_entries are entries that are no longer valid with a transformation. In the given example we have changed atom positions and displacement parameters. Accordingly, anything depending on these parameters should not be kept from the original input cif file. We should delete all derived _geom parameters and all quality indicators (as our diffraction data might be different from the one from the COD). Invalidated entries should be given to match regular expressions as used by the python re module.

The cif output path parameter should be modified to:

      - name: "output_cif_path"
        type: "QCrBox.output_cif"
        required_entries: [
          "_atom_site_label", "_atom_site_fract_x", "_atom_site_fract_y", "_atom_site_fract_z",
          "_atom_site_occupancy", "_atom_site_U_iso_or_equiv", "_symmetry_equiv_pos_as_xyz",
        ]
        optional_entries: [
          "_atom_site_aniso_label", "_atom_site_aniso_U_11", "_atom_site_aniso_U_22",
          "_atom_site_aniso_U_33", "_atom_site_aniso_U_12", "_atom_site_aniso_U_13",
          "_atom_site_aniso_U_23"
        ]
        invalidated_entries: [
          "_atom_site.*", "_geom.*", ".*refine.*", "_iucr.*", "_shelx.*"
        ]
        required: True

Again while the required and optional entries determine what is copied from our evaluation (here the database lookup), the invalidated entries will exclude entries from the input cif. The remaining entries from both files are then merged and output in the location of output_cif_path.

Developing the Python Glue Code for Our Merge Command.#

We will now modify the configure_cod_check_tutorial.py file to add the new functionality. We will use more functionality from both QCrBoxtools and our COD module. Our input section should now look like this:

from qcrbox.registry.client import QCrBoxRegistryClient
from qcrboxtools.cif.cif2cif import (
    cif_file_merge_to_unified_by_yml,
    cif_file_to_specific_by_yml,
)
from qcrboxtools.cif.merge import replace_structure_from_cif

from simple_cod_module import (
    cif_to_search_pars,
    get_number_fitting_cod_entries,
    get_fitting_cod_entries,
    download_cod_cif,
)

The cif_file_merge_to_unified_by_yml is the counterpart of the first function from cif2cif. We can use it to cut a non-unified cif to the entries we expect to be changed, convert it to the unified set of entries and then merge to input file we have used for our search and which should contain the X-ray data, as well as the unit cell parameters. The replace_structure_from_cif function will do the actual replacement. The function get_fitting_cod_entries returns a list of dictionaries of cod entries, sorted by the sum of squared differences in the unit cell parameters. Finally, download_cod_cif can be used to download an entry from the cod.

We can now implement our function

def merge_closest_cod_entry(
    input_cif_path,
    output_cif_path,
    cellpar_deviation_perc,
    listed_elements_only
):
    # cast the input parameters from strings to python objects
    work_cif_path, cellpar_deviation, listed_elements_only = parse_input(
        input_cif_path, cellpar_deviation_perc, listed_elements_only
    )

    output_cif_path = Path(output_cif_path)

    # get the list of fitting entries
    elements, cell_dict = cif_to_search_pars(work_cif_path)
    entry_lst = get_fitting_cod_entries(elements, cell_dict, cellpar_deviation, listed_elements_only)

    # if no fitting entries found, raise an error
    if len(entry_lst) == 0:
        raise ValueError("No fitting entries found")

    # download the cif file of the most fitting entry
    cod_cif_path = work_cif_path.parent / "cod.cif"
    download_cod_cif(entry_lst[0]["file"], cod_cif_path)

    # merge the input cif file with the downloaded cif file
    cif_file_merge_to_unified_by_yml(
        input_cif_path=cod_cif_path,
        output_cif_path=output_cif_path,
        merge_cif_path=input_cif_path,
        yml_path=YAML_PATH,
        command="merge_closest_cod_entry",
        parameter="output_cif_path",
    )

And finally we register that function with QCrBox by adding

application.register_python_callable("merge_closest_cod_entry", merge_closest_cod_entry)

right before client.run()

Rebuilding and Restarting the Container#

If you have not done so you can shut down QCrBox by typing qcb down. You can now restart and rebuild the container by typing. Rebuilding without dependencies might be faster if you have just rebuid everything.

qcb up cod_check_tutorial

Conclusion and final remarks#

We have now exposed two commands in QCrbox from a Python module. One that only analyses a cif file to produce some output, and another one that works from an input cif file to an output cif. If you want to interact with what you have build, a ipython notebook you can put into the examples folder can be found here.

For more examples you might consider looking into the already implemented programs in services/applications. If this tutorial is unclear at any point please raise an issue on Github with the specific problem that you ran into.