STAT 39000: Project 6 — Spring 2022

Motivation: In this project we will continue to get familiar with SLURM, the job scheduler installed on our clusters at Purdue, including Brown.

Context: This is the second in a series of (now) 4 projects focused on parallel computing using SLURM and Python.

Scope: SLURM, unix, Python

Learning Objectives
  • Use basic SLURM commands to submit jobs, check job status, kill jobs, and more.

  • Understand the differences between srun and sbatch commands.

  • Predict the resources (cpus and memory) an srun job will use based on the arguments and context.

  • Write and use a job script to solve a problem faster than you would be able to without a high performance computing (HPC) system.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/coco/attempt02/*.jpg

Questions

Question 1

This project, and the next, will have a variety of different types of deliverables. Ultimately, each question will result in some entry in a Jupyter notebook, and/or 1 or more additional Python and/or Bash scripts. In addition, to properly save screenshots in your Jupyter notebook, please follow the guidelines here. Images that don’t appear in your notebook in Gradescope will not get credit.

In project 5, question 2, we asked you to test out a variety of srun commands with variations in the options. As you are probably now well-aware — it can be difficult to understand what combination of parameters are needed. With that being said, in this course, we will focus on jobs that can be perfectly or embarassingly parallel, and single core single threaded jobs. So, the following job script is a safe and effective way to break your jobs up.

#!/bin/bash
#SBATCH --account=datamine
#SBATCH --job-name=serial_job_test    # Job name
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=me@purdue.edu     # Where to send mail
#SBATCH --ntasks=3                    # Number of tasks (total)
#SBATCH --cpus-per-task=1             # Number of cores per task
#SBATCH -o /dev/null                  # Output to dev null
#SBATCH -e /dev/null                  # Error to dev null

srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 some_command &
srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 some_command &
srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 some_command &

wait

Just be sure to modify your job script ntasks and the amount of time and memory you need for each job step.

Remember, you use sbatch to submit a job. Your job will have steps (each srun line). Your steps will have tasks (in our case, each srun will run a single task).

To add to the difficulty you maybe had understanding the various options available to you, if you used a terminal from with Jupyter Lab, you were technically already in a SLURM job with -c 4! In this project, we will learn to side-step this complication from within the Jupyter Lab environment so it is equivalent to SSH’ing into a frontend node in a fresh session.

When inside a SLURM job, a variety of environment variables are set that alters how srun behaves. If you open a terminal from within Jupyter Lab and run the following, you will see.

env | grep -i slurm

These variables altered the behavior of srun. We can however, unset these variables, and the behavior will revert to the default behavior. In your terminal, run the following.

for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done;

Confirm that the environment variables are unset by running the following.

env | grep -i slurm

Great! Now, we can work in our nice Jupyter Lab environment without any concern that SLURM environment variables are changing any behaviors. Let’s test it out with something actually predictable.

get_info.py
#!/usr/bin/python3

import time
import socket
from pathlib import Path
import datetime

def main():

    print(f"started: {datetime.datetime.now()}")
    print(socket.gethostname())

    with open("/proc/self/cgroup") as file:
        for line in file:
            if 'cpuset' in line:
                cpu_loc = "cpuset" + line.split(":")[2].strip()

            if 'memory' in line:
                mem_loc = "memory" + line.split(":")[2].strip()

    base_loc = Path("/sys/fs/cgroup/")
    with open(base_loc / cpu_loc / "cpuset.cpus") as file:
        num_cpus = len(file.read().strip().split(","))
        print(f"CPUS: {num_cpus}")

    with open(base_loc / mem_loc / "memory.limit_in_bytes") as file:
        mem_in_bytes = int(file.read().strip())
        print(f"Memory: {mem_in_bytes/1024**2} Mbs")

    time.sleep(3)
    print(f"ended: {datetime.datetime.now()}")

if __name__ == "__main__":
    main()
my_job.sh
#!/bin/bash
#SBATCH --account=datamine
#SBATCH --job-name=serial_job_test    # Job name
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=me@purdue.edu     # Where to send mail
#SBATCH --ntasks=3                    # Number of tasks (total)
#SBATCH --cpus-per-task=1             # Number of cores per task
#SBATCH -o /dev/null                  # Output to dev null
#SBATCH -e /dev/null                  # Error to dev null

srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 $HOME/get_info.py > 1.txt &
srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 $HOME/get_info.py > 2.txt &
srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 $HOME/get_info.py > 3.txt &
srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 $HOME/get_info.py > 4.txt &

wait

Place get_info.py in your $HOME directory and launch the job with the following command.

sbatch my_job.sh

Make sure to give your get_info.py script execute permissions.

chmod +x get_info.py

Note that there is no -c option needed for srun commands anymore! In the previous project, you needed to specify -c 1 (for example) to override the behavior inherited from the "surrounding" job where the setting is -c 4. This is no longer needed because we’ve unset the environment variables that tell srun to inherit those settings.

Check out the contents of 1.txt, 2.txt, 3.txt, and 4.txt. Explain in as much detail as possible what resources (cpus) were allocated for the job, what resources (cpus and memory) were allocated for each step, and how the jobs resources (cpus) effected the results of each step.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

I hope that the previous question was helpful, and gave you at least 1 reliable way to write job scripts for embarrassingly parallel jobs, where you can predict what will happen.

If at this point in time you are wondering "why would we do this when we can just use joblib and get 24 cores and power through some job?". The answer is because joblib will be limited to the number of cpus on the given node you are running your Python script on. SLURM allows us to allocate well over 24 cpus, and has much higher computing potential! In addition to that, it is (arguably) easier to write a single threaded Python job to run on SLURM, than to parallelize your code using joblib.

In the previous project, you were able to use the sha256 hash to efficiently find the extra image that the trickster Dr. Ward added to our dataset. Dr. Ward, knowing all about hashing algorithms, thinks he has a simple solution to circumventing your work. In the "new" dataset: /depot/datamine/data/coco/attempt02, he has modified the value of a single pixel of his duplicate image.

Re-run your SLURM job from the previous project on the new dataset, and process the results to try to find the duplicate image. Was Dr. Ward’s modification successful? Do your best to explain why or why not.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Unfortunately, Dr. Ward was right, and our methodology didn’t work. Luckily, there is a cool technique called perceptual hashing that is almost meant just for this! Perceptual hashing is a technique that can be used to know whether or not any two images appear the same, without actually viewing the images. The general idea is this. Given two images that are essentially the same (maybe they have a few different pixels, have been cropped, gone through a filter, etc.), a perceptual hash can give you a very good idea whether the images are the "same" (or close enough). Of course, it is not a perfect tool, but most likely good enough for our purposes.

To be a little more specific, two images are very likely the same if their perceptual hashes are the same. If two perceptual hashes are the same, their Hamming distance is 0. For example, if your hashes were: 8f373714acfcf4d0 and 8f373714acfcf4d0, you would the Hamming distance would be 0, because if you convert the hexadecimal values to binary, at each position in the string of 0s and 1s, the values are identical. If 1 of the 0s and 1s didnt match after converting to binary, this would be a Hamming distance of 1.

Use the imagehash library, and modify your job script from the previous project to use perceptual hashing instead of the sha256 algorithm to produce 1 file for each image where the filename remains the same as the original image, and the contents of the file contains the hash.

Make sure to clear out your slurm environment variables before submitting your job to run with sbatch. If you are submitting the job from a terminal, run the following.

for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done;
sbatch my_job.sh

If you are in a bash cell in Jupyter Lab, do the same.

%%bash

for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done;
sbatch my_job.sh

In order for the imagehash library to work, we need to make sure the libffi dependency is loaded up. Before executing the hash script in your srun command, first prepend source /etc/profile.d/modules.sh; module use /scratch/brown/kamstut/tdm/opt/modulefiles; module load libffi/3.4.2;. So, it should look something like:

#!/bin/bash
#SBATCH --account=datamine
...other SBATCH options...

source /etc/profile.d/modules.sh
module use /scratch/brown/kamstut/tdm/opt/modulefiles
module load libffi/3.4.2

srun ... &

wait

In order for your hash script to find the imagehash library, we need to use our course Python environment. To do that change your shebang line to this monster #!/scratch/brown/kamstut/tdm/apps/jupyter/kernels/f2021-s2022/.venv/bin/python, then, just run the script via $HOME/my_script.py hash …​

To help get you going using this package, let me demonstrate using the package.

import imagehash
from PIL import Image

my_hash = imagehash.phash(Image.open("/depot/datamine/data/coco/attempt02/000000000008.jpg"))
print(my_hash) # d16c8e9fe1600a9f
my_hash # numpy array of True (1) and False (0) values
my_hex = "d16c8e9fe1600a9f"
imagehash.hex_to_hash(my_hex) # numpy array of True (1) and False (0)

Make sure that you pass the hash as a string to the output_file.write method. So something like: output_file.write(str(file_hash)).

Make sure that once you’ve written your script, my_script.sh, that you submit it to SLURM using sbatch my_script.sh, not ./my_script.sh.

It would be a good idea to make sure you’ve modified your hash script to work properly with the imagehash library. Test out the script by running the following (assuming your Python code is called hash.py, and it is in your $HOME directory.

$HOME/hash.py hash --output $HOME /depot/datamine/data/coco/attempt02/000000000008.jpg

This should produce a file, $HOME/000000000008.jpg, containing the hash of the image.

Make sure your hash.py script has execute permissions!

chmod +x $HOME/hash.py

We’ve now posted the solutions to project 5 question 4. See here.

Process the results (like in the previous project). Did you find the duplicate image? Explain what you think could have happened.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

What!?! That is pretty cool! You found the "wrong" duplicate image? Well, I guess it is totally fine to find multiple duplicates. Modify the code you used to find the duplicates so it finds all of the duplicates and originals. In total there should be 50. Display 2-5 of the pairs (or triplets or more). Can you see any of the subtle differences? Hopefully you find the results to be pretty cool! If you look, you will find Dr. Wards hidden picture, but you do not have to exhaustively display all 50 images.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.