I’ve been dutifully putting song ratings into iTunes for years now, rating each song individually according to its merit. iTunes actually died a while ago and forced me to start the entire rating process over again, but I still hope that one day I will have a fully rated music library.
While I can set up smart playlists within iTunes to get a good mix of music, it’s more interesting to have data that I can visualise. Naturally, I wrote a program to gather and interpret that data for me. Here are the (somewhat voluminous) results:
[edit 20080820: updated the results now that a more significant portion of the library has been rated]
Parsing XML... 7282 track items parsed Building model of track/album/artist relationships... Done! 7282 total tracks 163 genres 1597 artists 2242 albums 74 orphan tracks Pruning library with a threshold of 5... Unrated tracks eliminated... albums with too few ratings eliminated... artists with too few ratings eliminated... genres with too few ratings eliminated... Final cleanup of pruned library... Done! 1372 pruned tracks 46 genres 93 artists 75 albums Average tracks per artist: 13.3669380088 Artists with the most tracks: Red Hot Chili Peppers: 186 KMFDM: 115 Star Ocean The Second Story OST: 86 The Doobie Brothers: 80 Spoon: 74 311: 74 Insane Clown Posse: 64 Fatboy Slim: 62 Powerman 5000: 61 Nine Inch Nails: 59 Trans-Siberian Orchestra: 59 The Kleptones: 56 Pitchshifter: 54 Rage Against the Machine: 51 Cake: 50 77% of tracks have genres noted Average tracks per genre: 9.55828220859 Genres with the most tracks: rock: 869 Other: 783 Pop: 344 Alternative: 323 Soundtrack: 263 Electronic: 259 Metal: 147 Sound Clip: 132 Blues: 114 Game: 109 Classic Rock: 96 Techno: 92 Punk: 89 Industrial: 83 Mix CD: 81 Average albums per artist: 1.73324984346 Artists with the most albums: The Beatles: 27 NOFX: 22 Queen: 19 Red Hot Chili Peppers: 16 Marilyn Manson: 16 311: 14 U2: 14 Dream Theater: 13 KMFDM: 13 Aerosmith: 13 Eminem: 12 Dave Matthews Band: 12 Sublime: 11 Cake: 11 Jars of Clay: 10 22% of tracks have ratings noted Artists with the best average rating Imogen Heap: 100.0 Buckshot LeFonque: 100.0 Splashdown: 100.0 analoq: 100.0 JET: 100.0 Stretch & Vern Present "Maddog": 100.0 The Evolution Control Committe: 100.0 Dispatch: 100.0 Ben Folds - Ben Folds: 100.0 Elastica: 100.0 Metric: 90.0 Remy Zero: 80.0 Mylo Feat. Freeform Five: 80.0 Ленинград: 80.0 川井憲次: 80.0 Genres with the best average rating Ambient Alternative: 80.0 Revival: 80.0 Salsa: 60.0 Art Rock: 60.0 BritPop: 60.0 General Alternative: 60.0 hip stuff: 60.0 Vocal: 60.0 Folklore: 60.0 Retro: 60.0 Humor: 60.0 Broadway: 60.0 Noise: 60.0 Film Soundtrack: 60.0 Folk/Rock: 60.0 Considering only categories with at least five samples to compare between: Artists with the best average rating Moby: 84.0 Gorillaz: 82.0 Roisin Murphy: 80.0 Justice: 80.0 As Fast As: 80.0 Mylo: 77.1428571429 Blockhead: 76.6666666667 Daft Punk: 76.0 Heart: 76.0 Elektel: 76.0 Poe: 74.2857142857 Sara Bareilles: 74.2857142857 Vitalic: 73.3333333333 Prodigy: 73.3333333333 The Knife: 73.3333333333 Albums with the best average rating Cross: 80.0 Discovery: 80.0 Demon Days: 80.0 Ruby Blue: 80.0 Open Letter to the Damned: 80.0 Palookaville: 80.0 Destroy Rock & Roll: 77.7777777778 Uncle Tony's Coloring Book: 76.6666666667 Space Travel with Teddybear: 76.0 Haunted: 74.2857142857 Little Voice: 74.2857142857 V Live: 73.3333333333 Hello Mom! (iTunes Version): 73.3333333333 Silent Shout: 73.3333333333 OK Cowboy: 73.3333333333 Genres with the best average rating Electronica/Dance: 77.7777777778 Anime: 74.2857142857 Dance: 74.0 Electronic: 72.4324324324 Folk: 72.0 Unclassifiable: 70.0 RnB: 70.0 Neo-Electro: 70.0 Mix CD: 69.4736842105 Electronica: 69.3333333333 Alternative Rock: 67.7777777778 Alternative & Punk: 67.2727272727 Techno: 66.4516129032 Acapella: 66.3636363636 General Rock: 65.7142857143
Because I am a good person and like you, here is the source:
iTunesStats.py
#!/usr/env/python
"""
A set of utilities for working with iTunes XML files and generating interesting statistics therefrom.
Dependencies:
PListReader (http://www.shearersoftware.com/software/developers/plist/)
XMLFilter (http://www.shearersoftware.com/software/developers/xmlfilter/)
path (http://www.jorendorff.com/articles/python/path)
"""
from __future__ import division
import sys
from PListReader import PListReader
from XMLFilter import XMLFilter
from path import path
from copy import copy
alphabet = set(list('abcdefghijklmnopqrstuvwxyz'))
def load(iml=None):
if iml is None:
#yup, i'm assuming Windows here
iml = path('~/My Documents/My Music/iTunes/iTunes Music Library.xml').expand().abspath()
if not iml.exists():
#i do take into account the possibility of mac/unix users
iml = path('~/Music/iTunes/iTunes Music Library.xml').expand().abspath()
if not iml.exists():
raise IOError('Could not automatically find "iTunes Music Library.xml"')
else:
iml = path(iml).expand().abspath()
reader = PListReader()
XMLFilter.parseFilePath(iml, reader, features = reader.getRecommendedFeatures())
return reader.getResult()
class Track(object):
class Lib(object):
def __init__(self, track):
self.track = track
self.artist = None
self.album = None
self.genre = None
def __str__(self):
return u''.join([u'Track: ', unicode(self.track), u'\nArtist: ', unicode(self.artist), u'\n',
u'Album: ', unicode(self.album), u'\nGenre: ', unicode(self.genre)])
def __init__(self, tdict, library = None):
for key, val in tdict.iteritems():
self.toAttr(key, val)
keys = set(tdict.keys())
if u'Name' not in keys:
self.name = path(self.location).name
if '.' in self.name:
self.name = self.name.rpartition('.')[0]
self.name = self.name.replace('%20', ' ')
self.name = self.name.replace('_', ' ')
if u'Artist' not in keys or self.artist == 'Various':
self.artist = None
if u'Album' not in keys:
self.album = None
if u'Genre' not in keys or self.genre == 'Unknown':
self.genre = None
if u'Rating' not in keys:
self.rating = None
#these will be initialized from the outside to point to
#the object representations
self.lib = Track.Lib(self)
self.library = library
if self.library is not None:
self.setLibrary(self.library)
def __cmp__(self, other):
return cmp(self.trackID, other.trackID)
def __str__(self):
return unicode(self.name)
def __repr__(self):
return u'<track: %s="%s" -="-" %s="%s">' % (unicode(self.artist), unicode(self.name))
def toAttr(self, keyname, val):
kn = []
first = True
for i in xrange(len(keyname)):
ki = keyname[i]
if ki.lower() in alphabet:
if first:
kn.append(ki.lower())
first = False
else:
kn.append(ki)
setattr(self, ''.join(kn), val)
def setLibrary(self, library):
self.library = library
self.library.tracks.add(self)
if self.album is not None:
self.library.albums.setdefault(self.album.lower(), TrackCollection(self.album)).add(self)
self.lib.album = self.library.albums[self.album.lower()]
if self.artist is not None:
self.library.artists.setdefault(self.artist.lower(), TrackCollection(self.artist)).add(self)
self.lib.artist = self.library.artists[self.artist.lower()]
else:
self.library.orphans.add(self)
if self.genre is not None:
self.library.genres.setdefault(self.genre.lower(), TrackCollection(self.genre)).add(self)
self.lib.genre = self.library.genres[self.genre.lower()]
class TrackCollection(set):
def __init__(self, name):
self.name = name
def __cmp__(self, other):
return cmp(self.name.lower(), other.name.lower())
def __repr__(self):
return '<%s: %i Tracks>' % (self.name, len(self))
def __str__(self):
return self.name
def average(self, key=lambda track: track):
return self.sum((key(track) for track in self)) / len(self)
def sum(self, iterable, key=lambda track: track):
t = 0
for i in iterable:
try:
t += i
except TypeError:
pass
return t
class Library(object):
def __init__(self, iml=None, messages=sys.stdout, suppressAutoIML=False):
self.tracks = set()
self.albums = {}
self.artists = {}
self.genres = {}
self.orphans = set()
self.messages = messages
if not suppressAutoIML:
self.initFromIML(iml)
def initFromIML(self, iml):
"""
Initialize the library from an iTunes Media Library
"""
self.pr("Parsing XML... ", newline=False)
lib = load(iml)
self.pr("%i track items parsed" % len(lib[u'Tracks']))
self.pr("Building model of track/album/artist relationships... ", newline=False)
for tid, track in lib[u'Tracks'].iteritems():
Track(track, self)
self.pr("Done!")
self.pr(" %i total tracks" % len(self.tracks))
self.pr(" %i genres" % len(self.genres))
self.pr(" %i artists" % len(self.artists))
self.pr(" %i albums" % len(self.albums))
self.pr(" %i orphan tracks" % len(self.orphans))
def pr(self, msg='', newline=True):
self.messages.write(unicode(msg).encode("utf-8"))
if newline:
self.messages.write('\n')
def most(self, collection, collectionOperation=lambda col: len(col), viewTop=15, show=False):
"""
See the most populous members of a collection.
Collection is one of "albums", "artists", "genres"
collectionOperation is a function which is performed on each collection. Defaults to lambda col: len(col),
which causes this function to return the most populous members of the collection. Other examples:
lambda col: col.average(lambda track: track.rating) causes this to return the collections with the
best average rating.
viewTop restricts the number displayed. If 0, displays all.
"""
if show:
for col, size in self.most(collection, collectionOperation, viewTop, False):
self.pr(unicode(col) + u': ' + unicode(size))
else:
col = [(collectionOperation(c), c) for c in getattr(self, collection).values()]
col.sort()
col.reverse()
return [(c, cl) for cl, c in col] if viewTop == 0 else [(c, cl) for cl, c in col][:viewTop]
def prune(self, threshold=5):
"""
Generates a copy of the library with weak members pruned out.
All unrated tracks are pruned. Then, for each collection type, each member with
fewer than threshold tracks are pruned.
"""
self.pr("Pruning library with a threshold of %i..." % threshold)
l2 = Library(messages=self.messages, suppressAutoIML=True)
for track in self.tracks:
if track.rating is not None and track.rating > 0:
t2 = copy(track)
t2.lib = Track.Lib(t2)
t2.setLibrary(l2)
self.pr(" Unrated tracks eliminated...")
for collection in ['albums', 'artists', 'genres']:
toremove = set()
for key, member in getattr(l2, collection).iteritems():
if len(member) < threshold:
toremove.add(key)
elif len([i for i in member if i.rating is not None and i.rating > 0]) < threshold:
toremove.add(key)
coll = getattr(l2, collection)
for key in toremove:
del coll[key]
setattr(l2, collection, coll)
self.pr(" %s with too few ratings eliminated..." % collection)
self.pr()
self.pr("Final cleanup of pruned library... ", False)
newtracks = set()
for collection in ['albums', 'artists', 'genres']:
for member in getattr(l2, collection).values():
for track in member:
newtracks.add(track)
l2.tracks = newtracks
self.pr("Done!")
self.pr(" %i pruned tracks" % len(l2.tracks))
self.pr(" %i genres" % len(l2.genres))
self.pr(" %i artists" % len(l2.artists))
self.pr(" %i albums" % len(l2.albums))
return l2
def main(argv=None):
if argv is None:
argv = sys.argv
iTunesLib = None
if len(argv) > 1:
iTunesLib = argv[1]
lib = Library(iTunesLib)
lib.pr()
l2 = lib.prune()
lib.pr()
#now we just run through some standard stats
lib.pr("Average tracks per artist: ", False)
spa = [len(a) for a in lib.artists]
lib.pr(sum(spa) / len(spa))
lib.pr("Artists with the most tracks:")
lib.most('artists', show=True)
lib.pr()
lib.pr("%i%% of tracks have genres noted" % int(100*(len([i for i in lib.tracks if i.genre is not None])/len(lib.tracks))))
lib.pr("Average tracks per genre: ", False)
spg = [len(g) for g in lib.genres]
lib.pr(sum(spg) / len(spg))
lib.pr("Genres with the most tracks:")
lib.most('genres', show=True)
lib.pr()
lib.pr("Average albums per artist: ", False)
lib.pr(sum((len(set((track.album for track in artist))) for artist in lib.artists.values())) / len(lib.artists))
lib.pr("Artists with the most albums:")
lib.most('artists', lambda artist: len(set((track.album for track in artist))), show=True)
lib.pr()
noratings = len([i for i in lib.tracks if i.rating is not None])
lib.pr("%i%% of tracks have ratings noted" % int(100*(noratings/len(lib.tracks))))
if noratings > 0:
lib.pr("Artists with the best average rating")
lib.most('artists', lambda col: col.average(lambda track: track.rating), show=True)
lib.pr()
lib.pr("Genres with the best average rating")
lib.most('genres', lambda col: col.average(lambda track: track.rating), show=True)
if len(l2.tracks) > 0:
lib.pr()
lib.pr("Considering only categories with at least five samples to compare between:")
lib.pr("Artists with the best average rating")
l2.most('artists', lambda col: col.average(lambda track: track.rating), show=True)
lib.pr()
lib.pr("Albums with the best average rating")
l2.most('albums', lambda col: col.average(lambda track: track.rating), show=True)
lib.pr()
lib.pr("Genres with the best average rating")
l2.most('genres', lambda col: col.average(lambda track: track.rating), show=True)
if __name__ == '__main__':
sys.exit(main())
I encourage you to post your own results.
This is my first time with Python, but I got the following error:
Parsing XML… Traceback (most recent call last):
File “iTunesStats.py”, line 316, in
sys.exit(main())
File “iTunesStats.py”, line 253, in main
lib = Library(iTunesLib)
File “iTunesStats.py”, line 154, in __init__
self.initFromIML(iml)
File “iTunesStats.py”, line 161, in initFromIML
lib = load(iml)
File “iTunesStats.py”, line 35, in load
XMLFilter.parseFilePath(iml, reader, features = reader.getRecommendedFeatures())
AttributeError: class XMLFilter has no attribute ‘parseFilePath’
I see the parseFilePath function in the library, so I’m not sure what’s up.
It seems I have an obsolete version of the XMLFilter library, and he’s gone and changed the interface on me. Change line 15 from
to
and it should work.
Sorry about that!