Generating Images for SuAVE Bibliographic Network
The following code will generate images based on search queries that are constructed from the variables “Name”, “Affiliation”, “City”, and “Country” in the dataset.
Requirements
- Use a VPN to run this code. I recommend ProtonVPN as it is free
- In addition, this code only works for Python versions 3.7 and 3.8 due to the package face_recognition.
- Dataset with the following columns (at least): “Name”, “Affiliation”, “City”, and “Country” in the dataset.
To get a notebook to run within these requirements, follow these steps (you need Anaconda Navigator for this to work):
- Navigate to your terminal
- Create a virtual environment using the following command (replace “myenv” with the environment name of your choosing):
conda create -n myenv python=3.7
- Activate the environment with the following command (replace “myenv” with the environment name of your choosing):
- conda activate myenv
- Create a kernel that works in Python 3.7. For this step, make sure to have ipykernel package (if not, enter the command
conda install ipykernel
)To do this, use the following command (replace “myenv_kernel” with the environment name of your choosing):python -m ipykernel install --user --name=myenv_kernel
- Next, in your terminal, enter the command
jupyter notebook
. This will open jupyter for you. If this does not work, type inconda install jupyter
and it should work then. - Now, open a Jupyter Notebook. Click on “Kernel” and then “Change kernel”. Select the one you just created.
Checkpoint
-
Dataset (to run the code on) with the columns “Name”, “Affiliation”, “City”, and “Country”
-
Utilizing VPN
-
Jupyter Notebook open with a kernel running in Python 3.7 or 3.8
Note for the Following Code: The only liberty you should take in this code (besides changing the variables names designated to custom to the specific laptop and dataset) is adjusting the limit of the number of images generated per query. If you are not getting as many valid results as desired (no valid images generated for some people), increase the image generation per query limit. Keep in mind this greatly impacts runtime for large datasets. Start at a limit of 2 and increase as needed.
Imports
import face_recognition
import urllib.request
import urllib
import re
import ssl
import itertools
import gender_guesser.detector as gender
import csv
import html
import os
import requests
from bs4 import BeautifulSoup
from PIL import Image
import json
import time
import io
from PIL import Image
import cv2
import numpy as np
import lxml
Dataset Information
In this code, change csv_file
and download_dir
to the specific location of the dataset and where you want pictures to be downloaded to. Also, change the column names to match the ones in the dataset.
#change `csv_file` and `download_dir` to the specific location of the dataset and where you want pictures to be downloaded to
csv_file = "C:/Users/jdk23/Downloads/tester1234.csv"
#format: "/directory/"
download_dir = "/Users/jdk23/Downloads/"
#format: "/directory/"
#change the values of these variables to the names of the columns in the data frame
name_col = 'Name'
affiliation_col = 'Affiliation'
city_col = 'City'
country_col = 'Country'
General Functions to Get the Queries per Entry and URLs
def generate_queries(name, affiliation, country, city):
args = [name, affiliation, country, city]
combinations = []
for i in range(2, len(args) + 1):
comb = list(itertools.combinations(args, i))
for c in comb:
if c[0] == name and "" not in c and "Unknown" not in c:
if city in c and country in c:
continue
combinations.append(list(c))
final_combinations = [name]
for combo in combinations:
new_combo = ' '.join(combo)
final_combinations.append(new_combo)
return final_combinations
def get_image_urls(limit, name = '', sex = '', affiliation = '', country = '', city = ''):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
if sex == '':
d = gender.Detector(case_sensitive=False)
first = name.split(" ")[0]
guessed_gender = d.get_gender(first)
if "male" == guessed_gender or "female" == guessed_gender:
sex = guessed_gender
queries = generate_queries(name, affiliation, country, city)
all_links = {}
for query in queries:
html = requests.get("https://www.google.com/search?q=" + query + "+person&tbm=isch&hl=en&gl=us", headers = headers)
soup = BeautifulSoup(html.text, "lxml")
all_script_tags = soup.select("script")
matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)
matched_google_images_thumbnails = ", ".join(
re.findall(r'\[\"(https\:\/\/encry#sortpted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(", ")
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)[:limit]
full_res_images = [
bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in
matched_google_full_resolution_images
]
all_links[query] = full_res_images
return all_links
Face Prediction Code
Download deploy_gender.prototxt, gender_net.caffemodel, deploy.prototxt.txt, and res10_300x300_ssd_iter_140000_fp16.caffemodel and change GENDER_MODEL, GENDER_PROTO, FACE_PROTO, and FACE_MODEL to the respective file destinations
deploy_gender.prototxt: https://drive.google.com/file/d/1AW3WduLk1haTVAxHOkVS_BEzel1WXQHP/view
gender_net.caffemodel: https://drive.google.com/file/d/1W_moLzMlGiELyPxWiYQJ9KFaXroQ_NFQ/view
deploy.prototxt.txt: https://raw.githubusercontent.com/opencv/opencv/master/samples/dnn/face_detector/deploy.prototxt
res10_300x300_ssd_iter_140000_fp16.caffemodel: https://raw.githubusercontent.com/opencv/opencv_3rdparty/dnn_samples_face_detector_20180205_fp16/res10_300x300_ssd_iter_140000_fp16.caffemodel
#format: /directory/correct_file
faceProto = ""
faceModel = ""
genderProto = ''
genderModel = ''
faceNet = cv2.dnn.readNet(faceModel, faceProto)
genderNet = cv2.dnn.readNet(genderModel, genderProto)
MODEL_MEAN_VALUES = (78.4263377603, 87.7689143744, 114.895847746)
genderList = ['Male', 'Female']
padding = 20
def faceBox(faceNet, frames):
frameHeight = frames.shape[0]
frameWidth = frames.shape[1]
blob = cv2.dnn.blobFromImage(frames, 1.0, (300, 300), [104, 117, 123], swapRB=False)
faceNet.setInput(blob)
detection = faceNet.forward()
bboxs = []
for i in range(detection.shape[2]):
confidence = detection[0, 0, i, 2]
if confidence > 0.7:
x1 = int(detection[0, 0, i, 3] * frameWidth)
y1 = int(detection[0, 0, i, 4] * frameHeight)
x2 = int(detection[0, 0, i, 5] * frameWidth)
y2 = int(detection[0, 0, i, 6] * frameHeight)
bboxs.append([x1, y1, x2, y2])
cv2.rectangle(frames, (x1, y1), (x2, y2), (0, 255, 0), 1)
return frames, bboxs
def predict_gender(image_url):
opener = urllib.request.build_opener()
opener.addheaders = [
('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.9999.999 Safari/537.36')
]
urllib.request.install_opener(opener)
try:
req = urllib.request.urlopen(image_url)
arr = np.asarray(bytearray(req.read()), dtype=np.uint8)
image = cv2.imdecode(arr, -1)
except Exception as e:
#print("Error occurred while reading the image from URL:", str(e))
return "delete"
# Detect faces in the image
try:
face_locations = face_recognition.face_locations(image)
face_encodings = face_recognition.face_encodings(image, face_locations)
except Exception as e:
print("Error occurred while reading the image from URL:", str(e))
return "delete"
genders = []
for face_location, face_encoding in zip(face_locations, face_encodings):
# Extract face region from the image
top, right, bottom, left = face_location
face = image[top:bottom, left:right]
# Perform gender prediction
blob = cv2.dnn.blobFromImage(face, 1.0, (227, 227), MODEL_MEAN_VALUES, swapRB=False)
genderNet.setInput(blob)
genderPreds = genderNet.forward()
gender = genderList[genderPreds[0].argmax()]
confidence = genderPreds[0][genderPreds[0].argmax()]
genders.append((gender, confidence))
label = f"{gender}, {confidence:.2f}"
cv2.rectangle(image, (left, top), (right, bottom), (0, 255, 0), 2)
cv2.putText(image, label, (left, top - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 2, cv2.LINE_AA)
if len(genders) == 1:
return genders[0][0] + ": " + str(genders[0][1])
else:
return "delete"
General Functions
def extract_numbers_and_periods(string):
numbers_and_periods = re.findall(r"[\d.]+", string)
return "".join(numbers_and_periods)
def max_value(dictionary):
if dictionary == {}:
return {"No pics found": ""}
max_key = max(dictionary, key=dictionary.get)
max_value = dictionary[max_key]
return max_key, max_value
def add_column_to_csv(filename, column_name, data):
# Read the existing CSV file into memory
rows = []
with open(filename, 'r') as file:
reader = csv.reader(file)
headers = next(reader)
headers.append(column_name)
rows.append(headers)
for row in reader:
row.append(data.pop(0))
rows.append(row)
# Write the updated data back to the CSV file
with open(filename, 'w', newline='') as file:
writer = csv.writer(file)
for row in rows:
writer.writerow(row)
Code that Generates the Data, URLs to add to Dataset
Remember that you can mess with limit
in line 1 to achieve better/more results. As a warning, increasing limit
by as much as 1 can greatly increase the already long run time. Setting limit
as 2 or 3 is recommended.
limit = 2
predicted_genders = []
final_results = []
final_img_predictions = []
to_redo = []
with open(csv_file, 'r') as file:
reader = csv.reader(file)
num_rows = len(list(reader))
names = []
def process_data(csv_file, start_index, end_index, counter):
internal = []
print(F"{start_index}-{end_index}")
errors = []
inner_count = 0
dict_collection = []
d = gender.Detector(case_sensitive=False)
with open(csv_file, 'r') as file:
query_dict = {}
reader = csv.DictReader(file)
for i, row in enumerate(reader):
if i < start_index:
continue
if i >= end_index:
break
###
#operates that these are the names of columns
name = row[name_col]
name = html.unescape(name)
names.append(name)
#print(F"{counter}: {name}")
affiliation = row[affiliation_col]
affiliation = html.unescape(affiliation)
country = row[country_col]
country = html.unescape(country)
city = row[city_col]
city = html.unescape(city)
###
#if row['Gender Match'] != 'Do not match':
# continue
predicted_genders.append(d.get_gender(name.split(" ")[0]))
if i % 25 == 0 and i != 0:
print(i)
try:
imgs = get_image_urls(limit, name=name, affiliation=affiliation, country=country, city=city)
dict_collection.append(imgs)
##
total_length = 0
for lst in imgs.values():
total_length += len(lst)
if total_length == 0 and inner_count not in errors:
errors.append({inner_count:[name, affiliation, country, city, counter]})
##
except requests.exceptions.ReadTimeout as e:
print(f"Request timed out for row {i}. Skipping.")
dict_collection.append({name: []})
errors.append({inner_count:[name, affiliation, country, city, counter]})
if counter % 250 == 0 and counter != 0:
print(f"Processed {counter} rows. Pausing for 3 minutes...")
#time.sleep(180)
counter += 1
inner_count += 1
#print(dict_collection)
#print(F"# redos: {len(errors)}")
for error in errors:
try:
name = list(error.values())[0][0]
affiliation = list(error.values())[0][1]
country = list(error.values())[0][2]
city = list(error.values())[0][3]
ovr_index = list(error.values())[0][4]
print(F"Redo for: {name}")
imgs = get_image_urls(limit, name=name, affiliation=affiliation, country=country, city=city)
dict_collection[list(error.keys())[0]] = imgs
total_length = 0
for lst in imgs.values():
total_length += len(lst)
if total_length == 0:
to_redo.append({ovr_index:[name, affiliation, country, city]})
except requests.exceptions.ReadTimeout as e:
print(f"Request timed out for {name}. Skipping.")
print(F"Gender Guesses for {start_index}-{end_index}")
ct_2 = 0
abcd = 0
img_results = []
for image_dict in dict_collection:
if ct_2 % 25 == 0:
print(ct_2)
ct_2 += 1
to_add = {}
for search, images in image_dict.items():
#print(F"Search: {search}\n")
search_result = []
to_delete = []
#number pictures
count = 1
for index in range(len(images)):
##remove images with multiple/no faces
#print(F"Image {count}")
gender_guess = predict_gender(images[index])
#print(gender_guess)
count += 1
#removes unwanted photos from list - ones with too many faces, no faces, or raises an error in the code
if gender_guess == "delete" or gender_guess == None:
to_delete.append(index)
else:
search_result.append(gender_guess)
for index in reversed(to_delete):
#use dict_collection so that you can add images
image_dict[search].remove(images[index])
to_add[search] = search_result
img_results.append(to_add)
#print(img_results)
index_count = 0
sex = ""
for image_dictionary in img_results:
for key in image_dictionary.keys():
if 'female' in key:
sex = 'female'
elif 'male' in key and 'female' not in key:
sex = 'male'
filtered_dict = {}
for k, v in image_dictionary.items():
to_remove = []
filtered_list = []
for i in v:
if sex != "" and sex == 'male' and "Female" not in i:
filtered_list.append(i)
elif sex != "" and sex == 'female' and "Male" not in i:
filtered_list.append(i)
elif sex != "":
to_remove.append(v.index(i))
reversed_remove = reversed(to_remove)
for index in reversed_remove:
#removes misclassified images from dict_collection
dict_collection[index_count][k].pop(index)
image_dictionary[k].pop(index)
filtered_dict[k] = filtered_list
image_dictionary = filtered_dict
index_count += 1
idx_ct = 0
curr_results = []
for image_dict in img_results:
#print(image_dict)
if image_dict == {}:
final_results.append({"none":[]})
final_img_predictions.append({"none":[]})
idx_ct += 1
continue
fin_results = {}
for key, results in image_dict.items():
if len(results) != 0:
img_pctg = len(results) / limit
avg_rating = 0
for result in results:
avg_rating += float(extract_numbers_and_periods(result))
avg_rating = (avg_rating / len(results)) / 100
fin_results[key] = (avg_rating * 0.5) + (0.5 * img_pctg)
else:
fin_results[key] = 0
best = image_dict[max_value(fin_results)[0]]
best_search = max_value(fin_results)[0]
#print(best_search)
to_add = {}
to_add_2 = {}
to_add[best_search] = dict_collection[idx_ct][best_search]
to_add_2[best_search] = best
final_results.append(to_add)
curr_results.append(to_add)
final_img_predictions.append(to_add_2)
idx_ct += 1
print(curr_results)
return counter
####
start_index = 0
end_index = 100
counter = 0
while start_index <= num_rows:
counter = process_data(csv_file, start_index, end_index, counter)
start_index += 100
end_index += 100
time.sleep(20)
####
Add Predicted Genders from Name to Dataset
for i in range(len(predicted_genders)):
if predicted_genders[i] != 'male' and predicted_genders[i] != 'female':
predicted_genders[i] = 'unknown'
add_column_to_csv(csv_file, "Gender Prediction - Name", predicted_genders)
Add Image URL(s) and Gender Prediction from the First Image to the Dataset
first_urls = []
first_img_result = []
rest_urls = []
#Iterate through the list of dictionaries
count = 0
for image_dictionary in final_results:
for key, values in image_dictionary.items():
# Check if there are URLs for this key
if len(values) > 0:
# Add the first URL to the first_urls list
first_urls.append(values[0])
###
first_img_result.append(final_img_predictions[count][key][0])
###
# Add the rest of the URLs to the rest_urls list
else:
first_urls.append("https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR08w3j9UMPcMqpu55Q-sQAJGXRwJ-20KEJfwiAyrcMjuO1NW5MhJ2otwqIejs&s%22")
first_img_result.append("")
if len(values) > 1:
rest_urls.append(", ".join(values[1:]))
else:
# Add "" to both lists if there are no URLs
rest_urls.append("")
count += 1
def download_image(url, directory):
response = requests.get(url)
if response.status_code == 200:
filename = os.path.basename(url)
filepath = os.path.join(directory, filename)
with open(filepath, 'wb') as f:
f.write(response.content)
print(f"Downloaded {filename} successfully!")
else:
print("Failed to download image")
add_column_to_csv(csv_file, "Image", first_urls)
add_column_to_csv(csv_file, "Image Result", first_img_result)
add_column_to_csv(csv_file, "Images", rest_urls)
Code that Adds Column of Whether Gender Prediction from Name and Image Match or Not
names = []
with open(csv_file, 'r') as file:
d = gender.Detector(case_sensitive=False)
reader = csv.DictReader(file)
for i, row in enumerate(reader):
#operates that these are the names of columns
name = row["Name"]
name = html.unescape(name)
names.append(name)
first_img_result = []
count = 0
for image_dictionary in final_results:
for key, values in image_dictionary.items():
# Check if there are URLs for this key
if len(values) > 0:
# Add the first URL to the first_urls list
first_img_result.append(final_img_predictions[count][key][0])
###
# Add the rest of the URLs to the rest_urls list
else:
first_img_result.append("")
count += 1
d = gender.Detector(case_sensitive=False)
predict_genders = []
for name in names:
predict_genders.append(d.get_gender(name.split(" ")[0]))
matches = []
for i in range(counter):
if "Male" in first_img_result[i]:
if "female" in predict_genders[i].lower():
matches.append("Do not match")
elif "male" in predict_genders[i].lower():
matches.append("Match")
else:
matches.append("Not Applicable")
elif "Female" in first_img_result[i]:
if "female" in predict_genders[i].lower():
matches.append("Match")
elif "male" in predict_genders[i].lower():
matches.append("Do not match")
else:
matches.append("Not Applicable")
else:
matches.append("Not Applicable")
add_column_to_csv(csv_file, "Gender Match", matches)
Code that Removes Bad Image URLs
def download_image(url, directory):
response = requests.get(url)
if response.status_code == 200:
filename = os.path.basename(url)
filepath = os.path.join(directory, filename)
with open(filepath, 'wb') as f:
f.write(response.content)
print(f"Downloaded {filename} successfully!")
else:
print("Failed to download image")
folder_path = '/Users/jdk23/Downloads/aquifer_photos/' # Replace with the path to your folder
removed_images2 = []
# Iterate over the files in the folder
for filename in os.listdir(folder_path):
file_path = os.path.join(folder_path, filename)
# Check if the file is an image (you can adjust the extensions as needed)
if filename.lower().endswith(('.jpg', '.jpeg', '.png', '.gif')):
# Validate the image file
try:
img = Image.open(file_path)
img.verify() # Verifies the image file's integrity
img.close()
except (IOError, SyntaxError):
# Handle the case where the file is not a valid image or cannot be opened
removed_images2.append(filename)
# Remove the file if desired
os.remove(file_path)
else:
removed_images2.append(filename)
os.remove(file_path)
# Print the names of removed images
for image_name in removed_images2:
print(f"Image removed: {image_name}")
Code that Downloads First Image and Adds #img Column with Image Names
def keep_allowed_chars(input_string):
pattern = r'[^a-zA-Z0-9-.]' # pattern to match non-allowed chars
return re.sub(pattern, '', input_string)
image_info = []
with open(csv_file, 'r') as file:
# create a CSV reader object
csv_reader = csv.reader(file)
# read in the values in the "Image" column
header = next(csv_reader)
image_col_index = header.index('Image')
name_col_index = header.index('Name')
for row in csv_reader:
to_add = {}
if row[image_col_index] == "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR08w3j9UMPcMqpu55Q-sQAJGXRwJ-20KEJfwiAyrcMjuO1NW5MhJ2otwqIejs&s":
to_add['url'] = ""
else:
to_add['url'] = row[image_col_index]
name = row[name_col_index].replace(" ", "")
name = name.replace(".", "")
name = keep_allowed_chars(name)
to_add['filename'] = name
image_info.append(to_add)
column_name = '#img'
if not os.path.exists(download_dir):
os.makedirs(download_dir)
#Download images and save to directory
file_paths = []
nofilename = os.path.join(download_dir, "NoImageAvailable.jpg")
noresponse = requests.get("https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR08w3j9UMPcMqpu55Q-sQAJGXRwJ-20KEJfwiAyrcMjuO1NW5MhJ2otwqIejs&s%22")
with open(nofilename, 'wb') as f:
f.write(noresponse.content)
file_paths.append(os.path.abspath(nofilename))
for info in image_info:
url = info['url']
if url == "":
continue
filename = os.path.join(download_dir, f"{info['filename']}.jpg")
response = requests.get(url)
with open(filename, 'wb') as f:
f.write(response.content)
file_paths.append(os.path.abspath(filename))
#Append image information to CSV file
add_to_csv = []
for i in image_info:
if i['url'] == "":
add_to_csv.append("NoImageAvailable")
else:
add_to_csv.append(i['filename'])
add_column_to_csv(csv_file, "#img", add_to_csv)
In the end, the following columns should be added to the csv: Gender Prediction - Name (based off of first name), Image (the first image url of the best search query for each observation), Image Result (the gender prediction based off of the first image url of the best search query for each observation), Images (the rest of the images in the best search query), Gender Match (if Guessed_Gender and Image Result match), and #img (to display the best image in the SuAVE visualization)
Creating SuAVE Visualization
To produce the SuAVE visualization, do these steps:
- Choose upload CSV and upload the updated CSV
- Title your SuAVE visualization
- Instead of creating an empty repository, upload the images you downloaded (the order your upload the images does not matter)
- Alternatively, upload your images to dzgen.sdsc.edu/upload; enter your email, select the images to upload, and select store in file server. Then, click upload. Wait until it sends a link to your email, and copy that link under the upload your images via link option.
Now you will have a SuAVE visualization with images. Be sure to manually check some images to assure validity and scan over images to make sure there are no wacky images.