Python multiprocessing

One of my biggest complaints with Python is the GIL. The Global Interpreter Lock prevents using multiple threads to execute CPU bound tasks. However, using the multiprocessing module, a Python script can side step the GIL by running other, separate instances of the Python interpreter. Each instance, having its own GIL, can run the same function in parallel with different input. The operating system can then, at the process level, distribute the work among multiple processors and processor cores. Perhaps the most efficient way of doing this is using the map() function.

Multiprocessing module

Let’s take a look at an example program:

import multiprocessing
import time

def process_item(item):
    print("[START]", item)
    time.sleep(3)
    print("       ", item, "[STOP]")

def main():
    MAX_PROCESSES = 2
    items = ["1", "2", "3", "4", "5"]

    pool = multiprocessing.Pool(MAX_PROCESSES)
    pool.map(process_item, items)

if __name__ == "__main__":
    main()

The important bits are the two lines related to the pool variable. pool = multiprocessing.Pool(MAX_PROCESSES) tells the multiprocessing module how many processes should be spawned to process our data. The call to pool.map(process_item, items) is where the magic happens. map() takes each element from the items list and passes it as a parameter to the process_item() function. Here we have a MAX_PROCESSES of 2, so map() takes the first two elements from the items list, spawns two new Python interpreter processes, and uses them to run the process_item() function each with one of the two elements. When one of those processes completes, map() takes the next unprocessed element from the items list and repeats the process until all elements have been processed.

The task

I have a TON of digital audio files. I also like to store these in multiple formats. One lossless format for archiving, one lossy format for playback on smaller devices. So I routinely do batch re-encodings of my audio files to my preferred formats. However, transcoding the files one at a time is slow and my encoder of choice does not take advantage of multiple processor cores. Python and multiprocessing to the rescue!

The solution

import glob
import os
import multiprocessing
import subprocess
import time

MAX_PROC = 8

def process_files(f):
    print("[i] Converting", f)
    f.replace("(", "\(")
    f.replace(")", "\)")
    # Convert to OGG
    cmd = ["ffmpeg", "-loglevel", "8", "-y", "-i", "", "-c:a", "libvorbis",
	       "-qscale:a", "6", "-vn", ""]
    cmd[5] = "\""+f+"\"" # input file name
    cmd[11] = "\"audio/ogg/"+f+".ogg\"" # output file name
    print("[DEBUG]", " ".join(cmd))
    os.system(" ".join(cmd))
    # Convert to MP3
    cmd = ["ffmpeg", "-loglevel", "8", "-y", "-i", "", "-c:a", "libmp3lame",
	       "-qscale:a", "3", "-vn", ""]
    cmd[5] = "\""+f+"\"" # input file name
    cmd[11] = "\"audio/mp3/"+f+".mp3\"" # output file name
    print("[DEBUG]", " ".join(cmd))
    os.system(" ".join(cmd))

def main():
    WORK = []
    EXTENSIONS = ["mkv", "mp4", "webm"]
    for ext in EXTENSIONS:
        for x in glob.glob("*."+ext):
            WORK.append(x)
    print("Converting", len(WORK), "files...")

    p = multiprocessing.Pool(MAX_PROC)
    p.map(process_files, WORK)

if __name__ == "__main__":
    main()

This program creates a list of files (WORK) with the extensoins specified in EXTENSIONS.It then calls the process_files() function for each element in the WORK list with up to 8 Python interpreter processes running in parallel. The process_files() function uses os.system() to launch ffmpeg once to convert the input file to MP3, and then again to convert to OGG.

Game Over.