Change database structure to be faster, better support multiple repos.
The current sqlite database used by F-Droid to store metadata about apps has a few pitfalls:
It is slower than it could be
One of the main joins we want to do is joining the fdroid_app
table to the fdroid_apk
table. The foreign key on the fdroid_apk
table is a TEXT
field which is the package name of the app in question. These are particularly poor for indexing reasons, because SQL databases usually index TEXT
fields by only taking the first n
characters of a column to build an index (where n
is customizable, but usually on the order of 3-5 characters).
Consider how many packages start with, e.g:
com.google.code.*
com.github.*
net.sourceforge.*
- etc...
These all index poorly, because the unique portion of the string is always on the right hand side (i.e. the least likely to be indexed).
This join happens in almost every single query that we are interested in, and particularly the queries which return large lists of apps.
Solution to slowness
The solution to this particular problem is to change the foreign key on the fdroid_apk
table to be an INTEGER
id which points to the fdroid_app.rowid
. The primary key of fdroid_app
should no longer be fdroid_app.id
(i.e. the Android apps package name), but rather fdroid_app.rowid
. In the process, to avoid confusion, fdroid_app.id
should be renamed fdroid_app.packageName
(see issue #37 (closed)).
Note that this can be done in the absence of fixing the next problem described below. However they are both mentioned in this one issue because solving the second issue about multiple repos may result in database changes which would undo changes done to solve this problem.
Doesn't support multiple repos providing the same app very well
Right now, the name/summary/description of an app is dictated by the last repository that happened to update which included that app. As a result, the data from one repo is forever being stomped on by other repos metadata, and vice-verca. We probably need some way to keep track of all the metadata provided by each repo, and when fetching apps from the fdroid_app
, select the name/summary/description from the one coming from the repo with the highest priority.
Potential (but incomplete) solution 1)
I've had a bit of a think of this, and I'm struggling with how exactly to implement it. The fully-denormalized-theoretically-best-option would be to have the following:
-
fdroid_package
id: int
packageName: string
-
fdroid_repo
id: int
name: string
cert: string
url: string
priority: int
- ...
-
fdroid_app
id: int
packageId: int
repoId: int
name: string
summary: string
description: string
- ...
-
fdroid_apk
id: int
appId: int
hash: string
url: string
- ...
Each repo is free to provide its own metadata about a specific fdroid_app
without overriding the metadata provided by another app. Then, switching the priorities of two repositories will mean the fdroid_app
table does not need to be modified, and the repo indexes don't need to be redownloaded in order to get the appropriate summary/description/etc to show in the client.
However, I am unsure how to query this to get the list of apps such that only the repo with the highest priority is joined onto. It would be something along the lines of:
SELECT ...
FROM
fdroid_package AS package
-- This join will pull back multiple apps, but we only want one
JOIN fdroid_app AS app ON (app.packageId = package.id)
WHERE
-- Restrict to only one app, as defined by the highest priority repo
-- that happens to provide an app with this package name.
app.repoId = (
SELECT innerRepo.id
FROM fdroid_repo AS innerRepo
WHERE innerRepo.priority = (
SELECT MAX(innerInnerRepo.priority)
FROM fdroid_repo as innerInnerRepo
WHERE app.repoId = innerInnerRepo.id
)
)
GROUP BY
package.id
That query is wrong for so many reasons. Apart from the performance implications of sub-sub queries, I'm not even sure if sub-sub queries are possible. Also, I don't think the join condition in the inner-inner query is correct.
If somebody else has a good solution to storing multiple copies of an app with different metadata, but only returning the one belonging to the repo with the highest priority, I'd be appreciative.
Potential (and more realistic) solution 2)
Alternatively, we can keep the fdroid_app
table as the main thing and force an index update when repo priorities change. This means that repos will still stomp on the metadata in the global fdroid_app
table, but we only do so if the repo being updated is a higher priority than the repo which provided the details currently in the table. This fix requires augmenting the fdroid_app
table to provide a metadataFromRepoId
or something. This will tell F-Droid which repo provided the metadata so we know whether or not it should be overriden during the update of a particular repo. Repos with a higher priority than metadataFromRepoId
will overwrite the metadata and then update metadataFromRepoId
to their own id, lower priority repos will leave the metadata.
For completness, there is also the situation whereby a higher priority repo may only provide a limited subset of all metadata, and a lower priority repo provides further details. In this case, I don't know what to do, because there is only really room for one metadataFromRepoId
, and thus storing metadata from two different apps will cause sadness here. Might have to stick with only metadata from the higher priority repo, even if that means showing less metadata to the user. In fact, this would be equally as bad with potential solution 1) above. How would you perform multiple joins onto different copies of an apk from different priority repos, such that the description
is taken from: The highest priority repo if that has a description
, or else the repo with the next highest priority, etc, etc.
"Full Text Search" table
This doesn't materially affect the database design in this issue, but it is worth mentioning that #336 will require an additional table in order to implement full text search. However that should only be an addition table, which joins onto fdroid_app
and nothing more.