Naked Science Forum

Non Life Sciences => Geek Speak => Topic started by: chris on 19/01/2020 16:23:19

Title: What's the best database structure for storing results over time?
Post by: chris on 19/01/2020 16:23:19: Dear all

I have written a function analyse the log data for our podcast and webpage downloads.

I want to store this in a database so that we can look up the total amount of traffic for a programme episode, or a webpage, over time but also log a day-by-day record of consumption.

What is the best database structure for this? There are thousands of webpages and podcasts. One easy way would be to enter a line in a table for each episode each day and record the total traffic for that day; but then extracting data for that episode would be very processor hungry wouldn't it? I'd need to pull all the entries for that url and then sum the activity.

Is there a better way to do this?
Title: Re: What's the best database structure for storing results over time?
Post by: PKorda on 29/01/2020 07:12:16: Why not use Google Analytics? Or a free self-hosted solution.
Title: Re: What's the best database structure for storing results over time?
Post by: Hayseed on 31/01/2020 05:47:14: I have never run a website, but wouldn't the server logs have all of the time stamped data that you would need?

You just want numbers per time don't you?

One could plot all stamps to each page of the whole site. Stamps per page per time.

Edit: one could have another routine that looks up the IP addresses. For geographic data......assuming not too many are using VPNs.
Title: Re: What's the best database structure for storing results over time?
Post by: nicephotog on 15/03/2020 02:37:32: You may have forgotten two things, tagging and binary itself.
With tagging you can bring things down to abberative, such as rather than simply "UTC time" or "UTC-time-date" that can be expressed as single number of compulsory character lengths(sizeof primitive data-type) , the "field" in the "database table" can have the size set to take such "data" leaving each piece extractable.(utc time date can be written as a number only of a particular length as a "string/char array" known as "human readable" in the UTC format) so would be a small text DB data type. (note: it is/was possible to obtain EXACT instant UTC time date over the net but you must acquire it only once and write your own converter and ticker program to operate constantly for other programs on your computer, the usual is actuially to set the clock of the server with that UTC call by extracting the information from the format , then obtain all the information in CGI program call instances and convert it to UTC)

In short, as indexing the pieces of a program for the processor memory location, you do not need thhe human readable or machine readable info, you only need to index to a set number between 0 - 255 inclusive a representation of the information in the fields' columns.
e.g. in the "media_type" column , you can set its fields to a "binary int" and "not null" rule where 1 = page , 2 = audio 3 = pic 4 = video
AND e.g.
protocol 1 = http , 2 =https 3 = ssl
AND more complex
"not null" "int" in a "url_name" column simply list a unique number for each specific url in the site.
These numbers returned from the database in a query are simply translated by the CGI program written for them.
That technique passes much lest repetitive data.
Title: Re: What's the best database structure for storing results over time?
Post by: chris on 18/02/2021 00:20:45: I'm circling back to describe how I completed this project, just to close the loop. It took a while because Covid-19 hijacked my life for 12 months, but, craving distraction, I returned to the challenge a few weeks back and finished it.

The challenge was that I am logging consumption of audio files from the site on a daily basis. This means that I need to log thousands of different filenames every day, and my challenge was how to do this in a sane way that was easy to interrogate and efficient to store.

The solution was to write a filter to extract from the logs the relevant entries each day and de-duplicate them (to avoid double-counting) and file this in a database.

I created a pair of tables, one for filenames of the downloaded files and the second for dates and downloads.

Basically, I insert each unique filename into the download table where it is assigned a unique ID number. Then, for each day, I insert a row for the date and the id number of the downloaded file on that date as well as how many copies of that file id were downloaded on that day.

It means I can easily sum total downloads for a file, or daily totals for files etc.

Thanks for those here who made suggestions for how to do this.