In June of 2018, BerlinOnline, the operators of Berlin's open data portal, published the download statistics for the portal's datasets. Given that we are supporting the further development of Berlin's data offerings through our ODIS initiative, we analyzed these statistics to better understand usage of the portal. In this post, we take a look at how the data portal has developed since 2013 and highlight some of the trends in its usage.
DownloadsThe line chart on left shows the download for data sets in Berlin's OpenData portal since April 2013.
Klicke auf weiter um mehr zu erfahren.
The most important insights
1. The overall development of the portal is positive
As the volume of datasets offered in the portal has grown, so too have the visitor numbers. Thus, to little surprise, we can observe that a more expansive offering leads to increased use of the portal. This positive relationship shows that there is indeed a demand for open government data in Berlin and that citizens are acting on this demand.
2. Most of the downloads cover just a handful of datasets
Although a few datasets had several thousand hits, the majority of datasets in the data portal are rarely accessed by site visitors. It's important to remember, however, that the number of hits a given dataset has is not a perfect indicator of success – a single download of a dataset can be what leads to a useful and/or interesting application of the data. Regardless, the distribution of downloads should call attention to questions of data quality and relevance, since it could also be the case that many of these little-accessed datasets are simply unusable in their current forms.
3. The level of interest in a given dataset is highly dependent on outside factors
Unusually high download numbers are almost always tied to outside factors. These factors could be that some online source has linked to the dataset, that a datajournalism project has made use of the data, or that the data are seasonally relevant (e.g., results from a recent election or information or increased interest in swimming locations at the start of summer). Data that has a direct relevance for citizens' everyday lives will generally see higher visitor counts. This also means that if data publishers want to see higher numbers of downloads for their datasets, they should focus on publishing datasets that are most relevant to citizens.
Before we dive into the analysis, there is one important consideration to keep in mind: these figures only represent downloads occuring via Berlin's Open Data Portal. If users are downloading datasets directly via other portals offered by the state of Berlin (such as the FIS Broker for geospatial data, the GSI portal for health data, or the Berlin-Brandenburg Office for Statistics), these downloads will not appear in our statistics. Nevertheless, the following figures offer us an interesting and valuable look into usage of the data portal. If you're interested in learning more about the current status of Open Data in Berlin, we have released a report on Open Data in the Berlin City Administration (available only in German).
Looking at the individudal distribution of downloads makes it clear that a small number of datasets is responsible for the majority of downloads.
Total number of downloads for a single dataset
Average monthly downloads (Mean)
Median monthly downloads
The majority of newly added datasets also see very little activity – only a few datasets have consistently high download numbers.
The first 24 months after the publishing of a dataset
Average monthly download numbers
The number of available datasets has grown continuously over the last five years. One trend noticeable in the first graph below are the periodic, concentrated spikes in the number of new datasets being added – these represent activities like batch uploads and harvesting efforts from other data portals.
Demand for data has also risen steadily. The following figure shows what percentage of the total number of datasets in the portal were accessed in a given month. At the beginning, 15% of all datasets saw some form of activity; today, the number is more than 30% (this trend is illustrated by the black line). Datasets that had been added within the last four months of a given timeframe receive a bit more attention (illustrated by the grey line). An interesting pattern can be observed in this chart: every four months there is a spike in datasets being accessed, suggesting some sort of automated activity. We haven't yet been able to find a clear explanation for this trend. The affected datasets are largely those originating from the GSI portal.
A clear pattern of reduced portal activity in winter can be observed; the trends in summer are less well-defined, however.
In this section, we highlight some examples of datasets that have seen unusually high activity in comparison with the rest of the portal's offerings.
Downloads according to category, data source, and license
The most popular type of data accessed via the portal is spatial data, followed by demographic data. The Berlin-Brandenburg Office for Statistics remains the most important publisher of open data in Berlin. And for licenses, the internationally-respected Creative Commons License is far and above the most used.
Download numbers by category
Download numbers by data publisher
Download numbers by license type
To close, here is a list of the most popular datasets. Oh, and in case you were wondering: the most popular names for babies born in Berlin last year were "Emilia" and "Ben". If you'd like to see the raw data yourself, click here.
Sebastian Meier is a data scientist at the Technologiestiftung Berlin. He graduated in Communication, Interface Design and completed his PhD in Geoinformatics at Potsdam University. His research focus lies on spatial data analytics and visualisation as well as human-centred perspectives on software interfaces.