This tutorial is about how to implement SQL-like group by queries using the Django ORM. It’s a fairly common operation, specially for those who are familiar with SQL. The Django ORM is actually an abstraction layer, that let us play with the database as it was object-oriented but in the end it’s just a relational database and all the operations are translated into SQL statements.
Most of the work can be done retrieving the raw data from the database, and playing with it in the Python side, grouping the data in dictionaries, iterating through it, making sums, averages and what not. But the database is a very powerful tool and do much more than simply storing the data, and often you can do the work much faster directly in the database.
Generally speaking, when you start doing group by queries, you are no longer interested in each model instances (or in a table row) details, but you want extract new information from your dataset, based on some common aspects shared between the model instances.
Let’s have a look in an example:
And the raw data stored in the database:
cities | |||
---|---|---|---|
id | name | country_id | population |
1 | Tokyo | 28 | 36,923,000 |
2 | Shanghai | 13 | 34,000,000 |
3 | Jakarta | 19 | 30,000,000 |
4 | Seoul | 21 | 25,514,000 |
5 | Guangzhou | 13 | 25,000,000 |
6 | Beijing | 13 | 24,900,000 |
7 | Karachi | 22 | 24,300,000 |
8 | Shenzhen | 13 | 23,300,000 |
9 | Delhi | 25 | 21,753,486 |
10 | Mexico City | 24 | 21,339,781 |
11 | Lagos | 9 | 21,000,000 |
12 | São Paulo | 1 | 20,935,204 |
13 | Mumbai | 25 | 20,748,395 |
14 | New York City | 20 | 20,092,883 |
15 | Osaka | 28 | 19,342,000 |
16 | Wuhan | 13 | 19,000,000 |
17 | Chengdu | 13 | 18,100,000 |
18 | Dhaka | 4 | 17,151,925 |
19 | Chongqing | 13 | 17,000,000 |
20 | Tianjin | 13 | 15,400,000 |
21 | Kolkata | 25 | 14,617,882 |
22 | Tehran | 11 | 14,595,904 |
23 | Istanbul | 2 | 14,377,018 |
24 | London | 26 | 14,031,830 |
25 | Hangzhou | 13 | 13,400,000 |
26 | Los Angeles | 20 | 13,262,220 |
27 | Buenos Aires | 8 | 13,074,000 |
28 | Xi'an | 13 | 12,900,000 |
29 | Paris | 6 | 12,405,426 |
30 | Changzhou | 13 | 12,400,000 |
31 | Shantou | 13 | 12,000,000 |
32 | Rio de Janeiro | 1 | 11,973,505 |
33 | Manila | 18 | 11,855,975 |
34 | Nanjing | 13 | 11,700,000 |
35 | Rhine-Ruhr | 16 | 11,470,000 |
36 | Jinan | 13 | 11,000,000 |
37 | Bangalore | 25 | 10,576,167 |
38 | Harbin | 13 | 10,500,000 |
39 | Lima | 7 | 9,886,647 |
40 | Zhengzhou | 13 | 9,700,000 |
41 | Qingdao | 13 | 9,600,000 |
42 | Chicago | 20 | 9,554,598 |
43 | Nagoya | 28 | 9,107,000 |
44 | Chennai | 25 | 8,917,749 |
45 | Bangkok | 15 | 8,305,218 |
46 | Bogotá | 27 | 7,878,783 |
47 | Hyderabad | 25 | 7,749,334 |
48 | Shenyang | 13 | 7,700,000 |
49 | Wenzhou | 13 | 7,600,000 |
50 | Nanchang | 13 | 7,400,000 |
51 | Hong Kong | 13 | 7,298,600 |
52 | Taipei | 29 | 7,045,488 |
53 | Dallas–Fort Worth | 20 | 6,954,330 |
54 | Santiago | 14 | 6,683,852 |
55 | Luanda | 23 | 6,542,944 |
56 | Houston | 20 | 6,490,180 |
57 | Madrid | 17 | 6,378,297 |
58 | Ahmedabad | 25 | 6,352,254 |
59 | Toronto | 5 | 6,055,724 |
60 | Philadelphia | 20 | 6,051,170 |
61 | Washington, D.C. | 20 | 6,033,737 |
62 | Miami | 20 | 5,929,819 |
63 | Belo Horizonte | 1 | 5,767,414 |
64 | Atlanta | 20 | 5,614,323 |
65 | Singapore | 12 | 5,535,000 |
66 | Barcelona | 17 | 5,445,616 |
67 | Munich | 16 | 5,203,738 |
68 | Stuttgart | 16 | 5,200,000 |
69 | Ankara | 2 | 5,150,072 |
70 | Hamburg | 16 | 5,100,000 |
71 | Pune | 25 | 5,049,968 |
72 | Berlin | 16 | 5,005,216 |
73 | Guadalajara | 24 | 4,796,050 |
74 | Boston | 20 | 4,732,161 |
75 | Sydney | 10 | 5,000,500 |
76 | San Francisco | 20 | 4,594,060 |
77 | Surat | 25 | 4,585,367 |
78 | Phoenix | 20 | 4,489,109 |
79 | Monterrey | 24 | 4,477,614 |
80 | Inland Empire | 20 | 4,441,890 |
81 | Rome | 3 | 4,321,244 |
82 | Detroit | 20 | 4,296,611 |
83 | Milan | 3 | 4,267,946 |
84 | Melbourne | 10 | 4,650,000 |
countries | |
---|---|
id | name |
1 | Brazil |
2 | Turkey |
3 | Italy |
4 | Bangladesh |
5 | Canada |
6 | France |
7 | Peru |
8 | Argentina |
9 | Nigeria |
10 | Australia |
11 | Iran |
12 | Singapore |
13 | China |
14 | Chile |
15 | Thailand |
16 | Germany |
17 | Spain |
18 | Philippines |
19 | Indonesia |
20 | United States |
21 | South Korea |
22 | Pakistan |
23 | Angola |
24 | Mexico |
25 | India |
26 | United Kingdom |
27 | Colombia |
28 | Japan |
29 | Taiwan |
This data is from Wikipedia, and I don’t know to what extent it is correct, but for our example it doesn’t really matter.
Considering the whole dataset, if we wanted to know the total of habitants in all the 84 cities, we could perhaps use
an aggregate
query:
Or the average population in the top 84 cities:
What if we now wanted to see the total population, but aggregated by the country instead? Not the whole dataset. In
this case we no longer can use aggregate
, instead we will be using annotate
.
The aggregate
clause is terminal, it returns a Python dictionary, meaning you can’t append any queryset methods.
Also, it will always return a single result. So if you wanted to get the population sum by country, using aggregate
,
you would need to do something like this:
While the result is correct, we needed to execute 30 different queries in the database. And we’ve lost some of the capabilities of the ORM, such as ordering this result set. Perhaps the data would be more interesting if we could order by the country with the most population for example.
Now a better way to do it is using annotate
, which will be translated as a group by query in the database:
Much better, right?
Now if we wanted to order by the country population, we can use an alias to make it look cleaner and to use in the
order_by()
clause:
Here is how the last SQL query looks like:
Now an important thing to note here: it only makes sense adding in the values()
clause, the data that can be grouped.
Every field you add to the values()
clause, will be used to create the group by query.
Look at this queryset:
The resulting SQL query would be:
This would have no effect, because all the city names are unique, and they can’t be grouped (the database will try to group it, but each “group” will have only 1 row/instance). We can see it simply by performing a count on each queryset:
That’s what I meant when I said in the beginning of the post that, you are no longer interested in the details of each row. When we group by country to get the sum of the population, we lost the details of the cities (at least in the query result).
Sometimes it makes sense to have more than one value in the values()
clause. For example if our database was composed
by City / State / Country. Then we could group by using .values('state__name', 'country__name')
. This way you would
have the population by country. And you would avoid States from different countries (with the same name) to be grouped
together.
The values you generate on the database, using the annotate
clause, can also be used to filter data. Usually in the
database we use the HAVING
function, which makes it very idiomatic. You can read the query like it was plain English.
Now, in the Django side, it’s a simple filter
.
For example, let’s say we want to see the total population by country, but only those countries where the total population is greater than 50,000,000:
And finally the SQL query:
I hope you found this small tutorial helpful! If you have any questions, please leave a comment below!