Mini Experiment: Broken Multimodal Telephone
Mini Experiment: Broken Multimodal Telephone
At a birthday party recently we played a game where a pad of paper is passed around a circle, with each person writing a sentence to describe a picture. The next person then draws a picture to match the sentence, and so on. The results are often hilarious, with the original sentence and the final picture often being completely different. Of course, the next day I had to replicate this with an image generation model and a multimodal model ping-ponging back and forth.
To generate the images, I went with Dalle-3 via the OpenAI API:
from openai import OpenAI
= OpenAI(api_key="your_key")
openai_client = openai_client.images.generate(
response ="dall-e-3",
model="The dolphins have taken over the world. The dolphin king celebrates.",
prompt="1024x1024",
size="standard",
quality=1,
n
)= response.data[0].url image_url
This image URL can then be passed to Antropic’s Haiku model, which is fantastically cheap and capable of taking both images and text as inputs:
= anthropic_client.messages.create(
message ="claude-3-haiku-20240307",
model=100,
max_tokens=0.5,
temperature=[
messages
{"role": "user",
"content": [
{"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": base64.b64encode(httpx.get(image_url).content).decode("utf-8"),
},
},
{"type": "text",
"text": "Provide a short description of the image."
}
],
}
],
)= message.content[0].text prompt
Then prompt can be passed back to Dalle-3 to generate a new image, and so on. Here are a few GIFs with some results:
It’s interesting to see how long these stay coherent. Previous times I’ve tried this things have gone abstract fairly quickly, here the theme diverges but does get stuck in attractors that still often make sense. I look forward to repeating this as models improve :) If you try this and make anything fun let me know! Here’s how I make the GIFs:
def save_results_as_gif(results, filename, time_per_frame=1):
= []
images for prompt, image in results:
# Create a black image with the same size as the original image
= PILImage.new("RGB", image.size, (0, 0, 0))
black_image = ImageDraw.Draw(black_image)
draw = ImageFont.truetype("arial_narrow_7.ttf", 20)
font = "Prompt: " + prompt
text
# Add newlines to the text to roughly keep it within the image
= []
text_lines = 80
max_width = ''
line for word in text.split():
if len(line + word) <= max_width:
+= word + ' '
line else:
text_lines.append(line)= word + ' '
line
text_lines.append(line)= '\n'.join(text_lines)
text
= 800, 20
text_width, text_height = ((image.width - text_width) // 2, (image.height - text_height) // 2)
text_position =font, fill=(255, 255, 255))
draw.text(text_position, text, font
# Append the black image and the original image to the list of frames
images.append(black_image)
images.append(image)
# Save the frames as a GIF
=time_per_frame)
imageio.mimsave(filename, images, duration
# Example usage (results is a list of tuples of prompts and images)
"broken_telephone1.gif", time_per_frame=1500) save_results_as_gif(results,