April 27, 2009
"Fun with YouTube's Audio Content ID System"
It's very evident why they choose to mute the entire audio track of a positively ID'd video instead of just the part with the problem audio: The fingerprinter can only reliably say "yes, [one particular song] is in here, somewhere," but it doesn't know exactly where in the video the infringing content starts or for how long it plays. It's far easier to just nuke the entire audio track than try to figure out precisely how to cut into it.

Remember #silentyoutube, when YouTube started switching out "unauthorised audio tracks" rendering videos silent? Well just how does YouTube's audio content ID system work? Slashdot links to an interesting analysis and discussion of the functioning of YouTube audio fingerprinting technology by someone from Rochester Institute of Technology's Computer Science House (sorry, I couldn't work out who the author is).

Looking for a way to understand how the fingerprinting service identifies music, the author tweaked a range of factors incuding pitch, speed, sampling length, and the "stereo imagery" of 1982's "I know what boys like" by The Waitresses, uploading the song repeatedly to YouTube to see if it would get caught in the filter or not. The reasons for using this track are sound, and worth quoting:


  • It was the first song I ever saw that was identified and removed by YouTube's fingerprinting system.

  • It has a very distinctive sound that I thought would be easily identifiable. It's also really repetitive, which probably makes it an easy target for an automated system to detect.

  • It's one of the few songs I actually have readily available in an uncompressed format. The majority of my music collection is stored with lossy data compression, which might have impacted the results.

  • In general, it's just a terrible song. I wanted to highlight the fact that somewhere out there, somebody thinks this 27-year-old heap is still valuable enough to be barred from YouTube.


The results are interesting, if a litle inconclusive (see the table of results for each test, in the original article). The system seems to be progressively scanning all of YouTube's content, it hears in mono, and it is fairly persistent and resilient, identifing some tracks where the audio was fairly well screwed with. However it seems to operate around the first 30 seconds of the track:

When I muted the beginning of the song up until 0:30 (leaving the rest to play) the fingerprinter missed it. When I kept the beginning up until 0:30 and muted everything from 0:30 to the end, the fingerprinter caught it. That indicates that the content database only knows about something in the first 30 seconds of the song. As long as you cut that part off, you can theoretically use the remainder of the song without being detected. I don't know if all samples in the content database suffer from similar weaknesses, but it's something that merits further research.

It seems you can thwart it by accelerating the pitch or speed by 5% though. I find work like this incredibly interesting, because it highlights both the opaqueness of how YouTube operates and the inherent technical difficulties of trying to resolve cultural matters with technical solutions. Having content remixed, repurposed, appropriated and made use of by people signals the success of a creative endeavour - it means your content has become cultural meaningful, even if it doesn't remain so for very long. Approaches that attempt to patrol, circumscribe or exert strict control over the circulation of content, approach it as a commodity alone, and the current markets (as I've argued elsewhere) are ones that don't operate according solely to commodity logics anymore (if indeed they ever did).

This post was previously published here.