PageView

How to Create Group By Queries With Django ORM

How to Create Group By Queries With Django ORM (Picture: https://unsplash.com/search/computer?photo=VHpDp_GkGgc)

This tutorial is about how to implement SQL-like group by queries using the Django ORM. It’s a fairly common operation, specially for those who are familiar with SQL. The Django ORM is actually an abstraction layer, that let us play with the database as it was object-oriented but in the end it’s just a relational database and all the operations are translated into SQL statements.

Most of the work can be done retrieving the raw data from the database, and playing with it in the Python side, grouping the data in dictionaries, iterating through it, making sums, averages and what not. But the database is a very powerful tool and do much more than simply storing the data, and often you can do the work much faster directly in the database.

Generally speaking, when you start doing group by queries, you are no longer interested in each model instances (or in a table row) details, but you want extract new information from your dataset, based on some common aspects shared between the model instances.

Let’s have a look in an example:

class Country(models.Model):
    name = models.CharField(max_length=30)

class City(models.Model):
    name = models.CharField(max_length=30)
    country = models.ForeignKey(Country)
    population = models.PositiveIntegerField()

And the raw data stored in the database:

cities
id name country_id population
1Tokyo2836,923,000
2Shanghai1334,000,000
3Jakarta1930,000,000
4Seoul2125,514,000
5Guangzhou1325,000,000
6Beijing1324,900,000
7Karachi2224,300,000
8Shenzhen1323,300,000
9Delhi2521,753,486
10Mexico City2421,339,781
11Lagos921,000,000
12São Paulo120,935,204
13Mumbai2520,748,395
14New York City2020,092,883
15Osaka2819,342,000
16Wuhan1319,000,000
17Chengdu1318,100,000
18Dhaka417,151,925
19Chongqing1317,000,000
20Tianjin1315,400,000
21Kolkata2514,617,882
22Tehran1114,595,904
23Istanbul214,377,018
24London2614,031,830
25Hangzhou1313,400,000
26Los Angeles2013,262,220
27Buenos Aires813,074,000
28Xi'an1312,900,000
29Paris612,405,426
30Changzhou1312,400,000
31Shantou1312,000,000
32Rio de Janeiro111,973,505
33Manila1811,855,975
34Nanjing1311,700,000
35Rhine-Ruhr1611,470,000
36Jinan1311,000,000
37Bangalore2510,576,167
38Harbin1310,500,000
39Lima79,886,647
40Zhengzhou139,700,000
41Qingdao139,600,000
42Chicago209,554,598
43Nagoya289,107,000
44Chennai258,917,749
45Bangkok158,305,218
46Bogotá277,878,783
47Hyderabad257,749,334
48Shenyang137,700,000
49Wenzhou137,600,000
50Nanchang137,400,000
51Hong Kong137,298,600
52Taipei297,045,488
53Dallas–Fort Worth206,954,330
54Santiago146,683,852
55Luanda236,542,944
56Houston206,490,180
57Madrid176,378,297
58Ahmedabad256,352,254
59Toronto56,055,724
60Philadelphia206,051,170
61Washington, D.C.206,033,737
62Miami205,929,819
63Belo Horizonte15,767,414
64Atlanta205,614,323
65Singapore125,535,000
66Barcelona175,445,616
67Munich165,203,738
68Stuttgart165,200,000
69Ankara25,150,072
70Hamburg165,100,000
71Pune255,049,968
72Berlin165,005,216
73Guadalajara244,796,050
74Boston204,732,161
75Sydney105,000,500
76San Francisco204,594,060
77Surat254,585,367
78Phoenix204,489,109
79Monterrey244,477,614
80Inland Empire204,441,890
81Rome34,321,244
82Detroit204,296,611
83Milan34,267,946
84Melbourne104,650,000
countries
id name
1Brazil
2Turkey
3Italy
4Bangladesh
5Canada
6France
7Peru
8Argentina
9Nigeria
10Australia
11Iran
12Singapore
13China
14Chile
15Thailand
16Germany
17Spain
18Philippines
19Indonesia
20United States
21South Korea
22Pakistan
23Angola
24Mexico
25India
26United Kingdom
27Colombia
28Japan
29Taiwan

This data is from Wikipedia, and I don’t know to what extent it is correct, but for our example it doesn’t really matter.

Considering the whole dataset, if we wanted to know the total of habitants in all the 84 cities, we could perhaps use an aggregate query:

from django.db.models import Sum

City.objects.aggregate(Sum('population'))
{'population__sum': 970880224}  # 970,880,224

Or the average population in the top 84 cities:

from django.db.models import Avg

City.objects.aggregate(Avg('population'))
{'population__avg': 11558097.904761905}  # 11,558,097.90

What if we now wanted to see the total population, but aggregated by the country instead? Not the whole dataset. In this case we no longer can use aggregate, instead we will be using annotate.

The aggregate clause is terminal, it returns a Python dictionary, meaning you can’t append any queryset methods. Also, it will always return a single result. So if you wanted to get the population sum by country, using aggregate, you would need to do something like this:

Don't
from django.db.models import Sum

for country in Country.objects.all():
    result = City.objects.filter(country=country).aggregate(Sum('population'))
    print '{}: {}'.format(country.name, result['population__sum'])

# Output:
# -------
# Brazil: 38676123
# Turkey: 19527090
# Italy: 8589190
# Bangladesh: 17151925
# Canada: 6055724
# France: 12405426
# Peru: 9886647
# Argentina: 13074000
# Nigeria: 21000000
# Australia: 9650500
# Iran: 14595904
# ...

While the result is correct, we needed to execute 30 different queries in the database. And we’ve lost some of the capabilities of the ORM, such as ordering this result set. Perhaps the data would be more interesting if we could order by the country with the most population for example.

Now a better way to do it is using annotate, which will be translated as a group by query in the database:

Do
City.objects.values('country__name').annotate(Sum('population'))

[
  {'country__name': u'Angola', 'population__sum': 6542944},
  {'country__name': u'Argentina', 'population__sum': 13074000},
  {'country__name': u'Australia', 'population__sum': 9650500},
  {'country__name': u'Bangladesh', 'population__sum': 17151925},
  {'country__name': u'Brazil', 'population__sum': 38676123},
  '...(remaining elements truncated)...'
]

Much better, right?

Now if we wanted to order by the country population, we can use an alias to make it look cleaner and to use in the order_by() clause:

City.objects.values('country__name') \
  .annotate(country_population=Sum('population')) \
  .order_by('-country_population')

[
  {'country__name': u'China', 'country_population': 309898600},
  {'country__name': u'United States', 'country_population': 102537091},
  {'country__name': u'India', 'country_population': 100350602},
  {'country__name': u'Japan', 'country_population': 65372000},
  {'country__name': u'Brazil', 'country_population': 38676123},
  '...(remaining elements truncated)...'
]

Here is how the last SQL query looks like:

  SELECT "core_country"."name", SUM("core_city"."population") AS "country_population"
    FROM "core_city" INNER JOIN "core_country" ON ("core_city"."country_id" = "core_country"."id")
GROUP BY "core_country"."name"
ORDER BY "country_population" DESC

Now an important thing to note here: it only makes sense adding in the values() clause, the data that can be grouped. Every field you add to the values() clause, will be used to create the group by query.

Look at this queryset:

City.objects.values('name', 'country__name').annotate(Sum('population'))

The resulting SQL query would be:

  SELECT "core_city"."name", "core_country"."name", SUM("core_city"."population") AS "population__sum"
    FROM "core_city" INNER JOIN "core_country" ON ("core_city"."country_id" = "core_country"."id")
GROUP BY "core_city"."name", "core_country"."name"

This would have no effect, because all the city names are unique, and they can’t be grouped (the database will try to group it, but each “group” will have only 1 row/instance). We can see it simply by performing a count on each queryset:

City.objects.values('name', 'country__name').annotate(Sum('population')).count()
84

City.objects.values('country__name').annotate(Sum('population')).count()
29

That’s what I meant when I said in the beginning of the post that, you are no longer interested in the details of each row. When we group by country to get the sum of the population, we lost the details of the cities (at least in the query result).

Sometimes it makes sense to have more than one value in the values() clause. For example if our database was composed by City / State / Country. Then we could group by using .values('state__name', 'country__name'). This way you would have the population by country. And you would avoid States from different countries (with the same name) to be grouped together.

The values you generate on the database, using the annotate clause, can also be used to filter data. Usually in the database we use the HAVING function, which makes it very idiomatic. You can read the query like it was plain English. Now, in the Django side, it’s a simple filter.

For example, let’s say we want to see the total population by country, but only those countries where the total population is greater than 50,000,000:

City.objects.values('country__name') \
  .annotate(country_population=Sum('population')) \
  .filter(country_population__gt=50000000) \
  .order_by('-country_population')

[
  {'country__name': u'China', 'country_population': 309898600},
  {'country__name': u'United States', 'country_population': 102537091},
  {'country__name': u'India', 'country_population': 100350602},
  {'country__name': u'Japan', 'country_population': 65372000}
]

And finally the SQL query:

  SELECT "core_country"."name", SUM("core_city"."population") AS "country_population"
    FROM "core_city" INNER JOIN "core_country" ON ("core_city"."country_id" = "core_country"."id")
GROUP BY "core_country"."name" HAVING SUM("core_city"."population") > 50000000
ORDER BY "country_population" DESC

I hope you found this small tutorial helpful! If you have any questions, please leave a comment below!



Sponsored Links