Art Museum lets you find art with your voice. It's powered by a public API from the Art Institute of Chicago, and Alexa Conversations, Amazon's next generation AI-driven dialog manager. It was recently awarded the Grand Prize in the Alexa Conversations Skills Challenge.
Art Museum Trailer (February 2021)
Art Museum Trailer (February 2021)
The skill is our latest exploration at the intersection of short-form media and voice. Created during lockdown, while most museums including the Art Institute remained closed, a primary motivation for us was making their collection more accessible to new audiences in new ways. Read on to learn about the evolution of the project, how we made it, and how we're paying it forward to the arts, media, and technology communities in Chicago and beyond.
<aside> 👩‍🔬 Amazon Science Blog: Making an art collection browsable by voice
</aside>
<aside> 👩‍💻 Alexa Developer Blog: Announcing General Availability for Alexa Conversations
</aside>
The original concept dates back to AWS re:Invent 2018. The Art Institute had recently announced the release of 40,000+ Creative Commons images from their collection, which inspired this Alexa Hack Day demo. It was well received, but the idea didn't make it past a kitschy prototype. The release of Alexa Conversations gave us an opportunity to revisit the initial kernel.
The Art Museum – voted "Best Overall Voice Experience" @ AWS re:Invent Alexa Hack Day (November 2018)
Building for voice is hard! While you envision an experience based on how you hope users will speak ("the happy path"), the reality is that people can and will say almost anything at every turn. So the mission critical challenge for building robust voice products always comes down to how you anticipate and mitigate this. Think of it like bumper bowling. No matter how someone rolls the ball, you want them to knock down some pins.
Previewed at Amazon re:MARS in 2019 and released in public beta last summer, Alexa Conversations is a new creative and technical paradigm for voice designers and developers. The contest was an opportunity for the voice community to start exploring this new technology with guidance and support from the Alexa product and engineering teams that built it. To learn more, check out the Amazon Science and Alexa Developer blogs. But here's a key bit:
Alexa developers can now leverage a state-of-the-art dialogue manager powered by deep learning to create complex, nonlinear experiences — conversations that go well beyond today's typical one-shot interactions, such as "Alexa, what's the weather forecast for today?" or "Alexa, set a ten-minute pasta timer".
Source: Amazon Science blog
Alexa Conversations gives you the tools to craft sample dialogs – each representing modular pieces of your skill's happy path – which in turn, enables flexible, non-linear input from the user. Each of these sample interactions are grounded in successful outcomes (API Definitions). For example, an OrderMovieTickets
API could be defined with required theatre
, showtime
, and title
slots, and it will extrapolate a model of all the ways a user might order tickets based on the build artifacts you provide (annotated dialogs, API definitions, slot types, and response templates). This allows a user to buy tickets in any sequence they wish, and easily make changes along the way ("actually, make that the 8pm showing!"). Given this flexibility, Alexa Conversations is especially well suited for goal-oriented, transactional use cases. But with Art Museum, we explored a different use case: discovery and navigation across a catalog of media.
From THE FUTURE IS MODULAR, John G's lightning talk at VOICE Summit 2019:
Today, we click and tap screens to find stuff. And we scroll. And we process lots of visual information very, very quickly. Like thumbnails. And captions. And titles. And once we decide on something, we’re still not satisfied. We scrub through it! We navigate through Network > Show > Season > Episode. Genre > Artist > Album > Song. We’ve all been trained to think in hierarchy. And the entire media industry is organized around it, from creation to distribution. And while aggregators and playlists and feed based networks have challenged this structure, consumers still choose with their eyes. But discovery with your ears is a different beast. Because speech is inherently linear. We can’t speak two things at once, and neither can our assistants, so perusing the endless abyss of choice that we all wade through every day is far less efficient without our eyes. Especially when we don’t quite know what we’re looking for. Or when we do know what we want, but we don’t quite know what it’s called.
Art Museum lets you traverse a vast art collection with simple language. As a starting place, you can go broad. 🗣: "I want to see a painting". 🗣: “Show me another one like that”. And as you explore the collection, you can drill down. 🗣: “Show me paintings from France”. 🗣: “Show ones with horses in them”. 🗣: “Bring me to sculptures from India”. 🗣: “Actually, show one from Germany”.
Of course none of this would be possible without the Art Institute of Chicago — a world class museum with a world class API. In addition to a full REST endpoint, they offer one click data dumps, allowing easy access to their entire collection in JSON. As a starting place, we filtered out every record that didn't meet the following criteria:
"is_public_domain: true
"has_multimedia_resources": true
This left us with ~325 Creative Commons Zero licensed images with accompanying audio content. From there, we began to design and compile a conversational metadata layer based around three pillars: category
, origin
and detail
(and after some early user feedback, title
and artist
). These pillars serve as the connective tissue between our Alexa Conversations interaction model and the Art Institute's dataset.
We also ran the images through AWS Rekognition's object and scene detection algorithm to bring some additional descriptive tags into the mix. Our dataset is a blend of existing metadata from the API, supplemental descriptive tags from Rekognition, and a lot of elbow grease and UX testing to smooth it all out.
Sample interactions on Echo Show (January 2021)
Sample interactions on Echo Show (January 2021)
A Sunday on La Grande Jatte — 1884 by Georges Seurat (with museum tour audio)
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/d1924882-ba03-4f80-8aa0-71599369b7a5/cd612b0c-3dfd-e213-3927-010b375c77a6.mp3
Alexa Conversations Skills Challenge Submission Video (September 2020)
Alexa Conversations Skills Challenge Submission Video (September 2020)
This is one of our annotated sample dialogs, representing a user's intention to invoke our getArt
API with detail
and origin
arguments. Every line of sample user dialogue is supported by an utterance set to further train the model: