The state of creative AI: will video producers/editors get superpowers?

Disruptive innovations begin at the bottom of a market with simple applications, then move up until they displace established ways of working. Today, we are witnessing the entry of Artificial Intelligence (AI) into basic video production. As technology becomes more powerful, the impact of generative AI will increase.

In this article I will show examples that are representative of the current state of AI and have the potential to impact the jobs of video producers and editors.

Next-level color grading — a.k.a. style transfer

Color grading is an art form in itself. Sure, everyone can slap a color filter on a video, but getting the same tone between scenes requires skill and experience. Currently, there are AI-powered tools for Adobe Premiere and DaVinci Resolve and others, which let you pick a look and apply it to a scene.

However, Style Transfer — a hot topic in generative AI — extends beyond that. Objects in the scene will not only change color but also texture and shape. Researchers from Intel presented an approach to make synthetic images look more realistic (May ‘21). Grand Theft Auto (GTA) scenes were adapted to look like a realistic dashcam video.

GTA is notorious for violence and purposeful accidents. The photorealism style makes this even scarier.

You can see how dry patches of dirt turn into green grass, the texture of the road changes, cars get new reflections and the sky gets greyer to match the original dashcam images shot in Germany on the project’s website. Though this technology is still in its infancy, you can imagine how it might be used to make it rain — literally, or to turn highway 66 into the German Autobahn.

Shoot the foreground, swap the background

Researchers from the Weizmann Institute of Science and Adobe Research (Sep ’21) were able to edit the background of a video while keeping the foreground intact. Or the other way around, edit foreground objects and keep the background intact. With AI, a videos can be decomposed into a foreground and background map (atlas). Editing one of the atlases and recomposing the video creates amazing and natural-looking results.

Video showing the swapping process — Video of the swapping process. Image from project page

Video showing the original and the swapped background. Image from project page video.

If George Lucas decided to move an epic battle from the ice planet Hoth to the deserts of Tatooine, it could be done. It would also allow producers to shoot in suboptimal circumstances and fix it “in post”.

The best of both worlds: Combining CGI with AI

CGI characters are key to the Video Effects (VFX) industry. CGI characters are created by rendering 3D models of actors or fantasy creatures. Despite great advances in the field, faces still look somewhat artificial. The problem areas are eyes, hair and the inside of the mouth.

Researchers from Disney published a method (Nov ’21) combining both techniques. In short, they rendered a CGI face with the correct expression, viewpoint, and lighting, but without the problem areas, and fed this to an AI to produce the rest. The result are faces that match the appearance of the 3D character with natural looking hair, eyes, and inner mouth.

Mocap for the masses

Motion capture (Mocap) is used to map the movements of actors onto 3D characters to make movements more realistic. This requires an expensive set-up with actors wearing markers on their body or multi-angle camera setups. Filmmakers are attempting to simplify this process. Recent AI developments are bringing us closer to “markerless” mocap based on just one video.

Researchers from Nvidia published a new mocap method using AI pose estimation based on a video image. This is combined with a physics model to help correct glitches in the pose estimation. This technique could bring mocap to video producers without Hollywood budgets.

Video showing a football player with a 3D Mocap figure overlayed on top. — Image from the Nvidia blog

Synthetic dance floor

Capturing motion and applying it to 3D models is one step. Having AI create realistic motion is the next step. Researchers at Google analysed 10 million frames of dance videos to create an AI choreographer (Jan ‘21). The AI model is shown a 2 second dance video to demonstrate the desired style and a piece of music. The AI will then generate a dance motion that follows the rhythm.

Apply photorealistic style transfer to these 3D dancers and we could easily add synthetic dancers to music videos. I’m not saying this is the world we should live in, just that we could.

Still a long way off

These techniques would let producers shoot in less-than-optimal conditions and fix it later, create synthetic video that looks realistic, or swap foreground or background objects, more quickly and cheaper than current methods. Making high-end VFX more accessible would expand the creative freedom of artists.

We will see more AI-driven VFX developments in the next few years, but there are still many obstacles to overcome: first, we need to achieve production-ready quality and then we need to integrate them into the production pipeline seamlessly.

These proofs-of-concept can make us dream of new creative possibilities, but let’s see what we can do now.

Deep fear

In 2017 the first Deepfake appeared of then-President Obama. Since then, deepfakes have largely lived in the internet’s darker corners for adult entertainment purposes. Despite of this, deepfakes continue to be feared for their potential socio-political harm. Deepfakes of politicians could spread fake news. Or live deepfakes could listen in on confidential video calls.

The sheer number of academic articles on deepfake detection proves the deep fear. In 2019, ZAO, a popular Chinese deepfake app, caused concern.

However, there was no fake news explosion. Regular people were mostly using the app to slap their faces on popular movie scenes.

And when a string of European politicians had been pranked by an impersonator of a Russian dissident in April ’21, numerous reports accused Russia of a sophisticated deepfake plot.

Dancing queen: deepfakes in entertainment

In 1976, Abba sang about the Dancing Queen, but it was not until 2020 that we actually got to see the queen dance. Channel 4’s 2020 Queen’s alternative Christmas speech features a deepfake of Queen Elizabeth.

Then there is the infamous @DeepTomCruise, who fooled Justin Bieber into picking a fight with a deepfake version of Tom Cruise.

Despite the deepfake alarm, their quality isn’t top-notch yet. When swapping a face, the source and destination heads must have the same shape and hairstyle, otherwise is just looks weird.

DeepTomCruise looks good, because the role is played by a Tom Cruise look-alike. Moreover, DeepTomCruise’s creator, Chris Ume, told the 2021 TNW Conference that a 10–20 second video still needs about 24 hours of post-production work to enhance details and fix glitches.

While high-resolution deepfake technology for live videocalls might lie around in the labs of spy agencies, this type of quality cannot be achieved with free software you can download.

Synthetic video for production

In more controlled environments, synthetic video is possible. Synthesia lets you create synthetic talking heads for training or corporate communication purposes. While the application domain is still limited, the ease of use and time savings are impressive.

As a user, you select an actor, type the voice over, and the platform generates a video with the synthetic actor narrating your text. You can also create a synthetic version of yourself to act in the video.

Screenshot of the Synthesia platform showing a slide with a synthetic actor — Screenshot of the Synthesia platform. Image form Synthesia press kit

Using traditional techniques, it would require an actor, a producer, a studio, lighting, a camera, a microphone, shooting, editing, and a couple thousands of dollars to produce a 10-minute video. 10 minutes of synthetic video would cost you 30 USD and a couple of minutes.

Besides it would cut the carbon footprint of video production dramatically — given the platform is powered by renewable energy.

Voice clones — a.k.a. deep fake voices

A voice clone is a text-to-speech AI model that syntesizes someone’s voice. Creating a decent voice clone requires you to read 30 minutes of text and feed it to the AI model. However researchers have been looking into creating voice clones from just a few seconds of audio.

Authentic-sounding voice clones are a lot more difficult to produce than faces. Most voice clones suffer from metallic sound, caused by background noise and echo that is embedded in the voice clone.

That is the reason why deep fakes, like DeepTomCruise, Fake Obama or Channel 4’s Queen’s Speech still rely on voice actors. We are still far from the doom scenario in which your deep fake voice can be used for criminal purposes based on a few seconds of your voice.

After typing in the text in the Descript software, the voice clone reads it out loud. Video by author.

Daniel Verten, Head of Creative at Synthesia, explained to me that another reason why voice clones are difficult is that most voice synthesis models are designed for narration or IVR applications. This makes them less suitable for more emotional uses such as radio commercials or movies. He also confided to me that in the next 12 to 18 months, the quality of synthetic English voice and the expression of emotions will significantly improve.

Use cases for synthetic video and voices

Once voice quality and emotional expression improves, and we can include many languages, we can see lots of uses.

For example automatic dubbing of movies in another language — including emotions. And while we’re at it, let AI lip sync the actors to match the other language. Audio books by their original author (even when they’re dead). Text-to-video that uses the synthetic voice and face of someone without using voice impersonators (not sure if this will make the world safer). Synthetic generation of radio commercials based on text input. And so on.

While we can all be creators with just a phone, high quality video production and VFX remains out of reach for many. It requires resources, time and a lot of skill.

Imagine what would happen if we combine style transfer, object swapping, AI-powered mocap, CGI combined with AI, voice clones, etc. in a frictionless production pipeline. It would dramatically change the nature of video production. This would reduce the cost of huge sets involving dozens of people and the time of tedious post-production. It would bring advanced techniques to more artists, expanding their creative freedom. But, the shift will also lead to less operational and manual jobs in the industry.

What if we had Hollywood-level production capabilities? Small studios should ask themselves. And large studios should ask themselves, what if everyone had our production capabilities? How will we be able to make a difference?

With each year passing these questions will become more pressing for the creative industry.