How to Create Group By Queries With Django ORM

By Vitor Freitas

I'm a passionate software developer and researcher. I write about Python, Django and Web Development on a weekly basis. Read more.

tutorial

How to Create Group By Queries With Django ORM

Dec 6, 2016
9 minutes read
comments
views

(Picture: https://unsplash.com/search/computer?photo=VHpDp_GkGgc)

This tutorial is about how to implement SQL-like group by queries using the Django ORM. It’s a fairly common operation, specially for those who are familiar with SQL. The Django ORM is actually an abstraction layer, that let us play with the database as it was object-oriented but in the end it’s just a relational database and all the operations are translated into SQL statements.

Most of the work can be done retrieving the raw data from the database, and playing with it in the Python side, grouping the data in dictionaries, iterating through it, making sums, averages and what not. But the database is a very powerful tool and do much more than simply storing the data, and often you can do the work much faster directly in the database.

Generally speaking, when you start doing group by queries, you are no longer interested in each model instances (or in a table row) details, but you want extract new information from your dataset, based on some common aspects shared between the model instances.

Let’s have a look in an example:

class Country(models.Model):
    name = models.CharField(max_length=30)

class City(models.Model):
    name = models.CharField(max_length=30)
    country = models.ForeignKey(Country)
    population = models.PositiveIntegerField()

And the raw data stored in the database:

cities
id	name	country_id	population
1	Tokyo	28	36,923,000
2	Shanghai	13	34,000,000
3	Jakarta	19	30,000,000
4	Seoul	21	25,514,000
5	Guangzhou	13	25,000,000
6	Beijing	13	24,900,000
7	Karachi	22	24,300,000
8	Shenzhen	13	23,300,000
9	Delhi	25	21,753,486
10	Mexico City	24	21,339,781
11	Lagos	9	21,000,000
12	São Paulo	1	20,935,204
13	Mumbai	25	20,748,395
14	New York City	20	20,092,883
15	Osaka	28	19,342,000
16	Wuhan	13	19,000,000
17	Chengdu	13	18,100,000
18	Dhaka	4	17,151,925
19	Chongqing	13	17,000,000
20	Tianjin	13	15,400,000
21	Kolkata	25	14,617,882
22	Tehran	11	14,595,904
23	Istanbul	2	14,377,018
24	London	26	14,031,830
25	Hangzhou	13	13,400,000
26	Los Angeles	20	13,262,220
27	Buenos Aires	8	13,074,000
28	Xi'an	13	12,900,000
29	Paris	6	12,405,426
30	Changzhou	13	12,400,000
31	Shantou	13	12,000,000
32	Rio de Janeiro	1	11,973,505
33	Manila	18	11,855,975
34	Nanjing	13	11,700,000
35	Rhine-Ruhr	16	11,470,000
36	Jinan	13	11,000,000
37	Bangalore	25	10,576,167
38	Harbin	13	10,500,000
39	Lima	7	9,886,647
40	Zhengzhou	13	9,700,000
41	Qingdao	13	9,600,000
42	Chicago	20	9,554,598
43	Nagoya	28	9,107,000
44	Chennai	25	8,917,749
45	Bangkok	15	8,305,218
46	Bogotá	27	7,878,783
47	Hyderabad	25	7,749,334
48	Shenyang	13	7,700,000
49	Wenzhou	13	7,600,000
50	Nanchang	13	7,400,000
51	Hong Kong	13	7,298,600
52	Taipei	29	7,045,488
53	Dallas–Fort Worth	20	6,954,330
54	Santiago	14	6,683,852
55	Luanda	23	6,542,944
56	Houston	20	6,490,180
57	Madrid	17	6,378,297
58	Ahmedabad	25	6,352,254
59	Toronto	5	6,055,724
60	Philadelphia	20	6,051,170
61	Washington, D.C.	20	6,033,737
62	Miami	20	5,929,819
63	Belo Horizonte	1	5,767,414
64	Atlanta	20	5,614,323
65	Singapore	12	5,535,000
66	Barcelona	17	5,445,616
67	Munich	16	5,203,738
68	Stuttgart	16	5,200,000
69	Ankara	2	5,150,072
70	Hamburg	16	5,100,000
71	Pune	25	5,049,968
72	Berlin	16	5,005,216
73	Guadalajara	24	4,796,050
74	Boston	20	4,732,161
75	Sydney	10	5,000,500
76	San Francisco	20	4,594,060
77	Surat	25	4,585,367
78	Phoenix	20	4,489,109
79	Monterrey	24	4,477,614
80	Inland Empire	20	4,441,890
81	Rome	3	4,321,244
82	Detroit	20	4,296,611
83	Milan	3	4,267,946
84	Melbourne	10	4,650,000

countries
id	name
1	Brazil
2	Turkey
3	Italy
4	Bangladesh
5	Canada
6	France
7	Peru
8	Argentina
9	Nigeria
10	Australia
11	Iran
12	Singapore
13	China
14	Chile
15	Thailand
16	Germany
17	Spain
18	Philippines
19	Indonesia
20	United States
21	South Korea
22	Pakistan
23	Angola
24	Mexico
25	India
26	United Kingdom
27	Colombia
28	Japan
29	Taiwan

This data is from Wikipedia, and I don’t know to what extent it is correct, but for our example it doesn’t really matter.

Considering the whole dataset, if we wanted to know the total of habitants in all the 84 cities, we could perhaps use an aggregate query:

from django.db.models import Sum

City.objects.aggregate(Sum('population'))
{'population__sum': 970880224}  # 970,880,224

Or the average population in the top 84 cities:

from django.db.models import Avg

City.objects.aggregate(Avg('population'))
{'population__avg': 11558097.904761905}  # 11,558,097.90

What if we now wanted to see the total population, but aggregated by the country instead? Not the whole dataset. In this case we no longer can use aggregate, instead we will be using annotate.

The aggregate clause is terminal, it returns a Python dictionary, meaning you can’t append any queryset methods. Also, it will always return a single result. So if you wanted to get the population sum by country, using aggregate, you would need to do something like this:

Don't

from django.db.models import Sum

for country in Country.objects.all():
    result = City.objects.filter(country=country).aggregate(Sum('population'))
    print '{}: {}'.format(country.name, result['population__sum'])

# Output:
# -------
# Brazil: 38676123
# Turkey: 19527090
# Italy: 8589190
# Bangladesh: 17151925
# Canada: 6055724
# France: 12405426
# Peru: 9886647
# Argentina: 13074000
# Nigeria: 21000000
# Australia: 9650500
# Iran: 14595904
# ...

While the result is correct, we needed to execute 30 different queries in the database. And we’ve lost some of the capabilities of the ORM, such as ordering this result set. Perhaps the data would be more interesting if we could order by the country with the most population for example.

Now a better way to do it is using annotate, which will be translated as a group by query in the database:

Do

City.objects.values('country__name').annotate(Sum('population'))

[
  {'country__name': u'Angola', 'population__sum': 6542944},
  {'country__name': u'Argentina', 'population__sum': 13074000},
  {'country__name': u'Australia', 'population__sum': 9650500},
  {'country__name': u'Bangladesh', 'population__sum': 17151925},
  {'country__name': u'Brazil', 'population__sum': 38676123},
  '...(remaining elements truncated)...'
]

Much better, right?

Now if we wanted to order by the country population, we can use an alias to make it look cleaner and to use in the order_by() clause:

City.objects.values('country__name') \
  .annotate(country_population=Sum('population')) \
  .order_by('-country_population')

[
  {'country__name': u'China', 'country_population': 309898600},
  {'country__name': u'United States', 'country_population': 102537091},
  {'country__name': u'India', 'country_population': 100350602},
  {'country__name': u'Japan', 'country_population': 65372000},
  {'country__name': u'Brazil', 'country_population': 38676123},
  '...(remaining elements truncated)...'
]

Here is how the last SQL query looks like:

  SELECT "core_country"."name", SUM("core_city"."population") AS "country_population"
    FROM "core_city" INNER JOIN "core_country" ON ("core_city"."country_id" = "core_country"."id")
GROUP BY "core_country"."name"
ORDER BY "country_population" DESC

Now an important thing to note here: it only makes sense adding in the values() clause, the data that can be grouped. Every field you add to the values() clause, will be used to create the group by query.

Look at this queryset:

City.objects.values('name', 'country__name').annotate(Sum('population'))

The resulting SQL query would be:

  SELECT "core_city"."name", "core_country"."name", SUM("core_city"."population") AS "population__sum"
    FROM "core_city" INNER JOIN "core_country" ON ("core_city"."country_id" = "core_country"."id")
GROUP BY "core_city"."name", "core_country"."name"

This would have no effect, because all the city names are unique, and they can’t be grouped (the database will try to group it, but each “group” will have only 1 row/instance). We can see it simply by performing a count on each queryset:

City.objects.values('name', 'country__name').annotate(Sum('population')).count()
84

City.objects.values('country__name').annotate(Sum('population')).count()
29

That’s what I meant when I said in the beginning of the post that, you are no longer interested in the details of each row. When we group by country to get the sum of the population, we lost the details of the cities (at least in the query result).

Sometimes it makes sense to have more than one value in the values() clause. For example if our database was composed by City / State / Country. Then we could group by using .values('state__name', 'country__name'). This way you would have the population by country. And you would avoid States from different countries (with the same name) to be grouped together.

The values you generate on the database, using the annotate clause, can also be used to filter data. Usually in the database we use the HAVING function, which makes it very idiomatic. You can read the query like it was plain English. Now, in the Django side, it’s a simple filter.

For example, let’s say we want to see the total population by country, but only those countries where the total population is greater than 50,000,000:

City.objects.values('country__name') \
  .annotate(country_population=Sum('population')) \
  .filter(country_population__gt=50000000) \
  .order_by('-country_population')

[
  {'country__name': u'China', 'country_population': 309898600},
  {'country__name': u'United States', 'country_population': 102537091},
  {'country__name': u'India', 'country_population': 100350602},
  {'country__name': u'Japan', 'country_population': 65372000}
]

And finally the SQL query:

  SELECT "core_country"."name", SUM("core_city"."population") AS "country_population"
    FROM "core_city" INNER JOIN "core_country" ON ("core_city"."country_id" = "core_country"."id")
GROUP BY "core_country"."name" HAVING SUM("core_city"."population") > 50000000
ORDER BY "country_population" DESC