Support & Downloads

Quisque actraqum nunc no dolor sit ametaugue dolor. Lorem ipsum dolor sit amet, consyect etur adipiscing elit.

s f

Contact Info
198 West 21th Street, Suite 721
New York, NY 10010
+88 (0) 101 0000 000
Follow Us

Scoring Method: linear combination

Victor Tapissier, Data Scientist, Atayen

1. Introduction 

Today all our actions are recorded, measured, and evaluated. The data from our social networks is analyzed by algorithms to offer us targeted content. In China, individuals are graded according to a social credit system. The evaluation of the quality of an individual or an object can be done via scoring methods, using different metrics.

In this article, we will use data from the YouTube social network to present the different stages of a score calculation. What steps are needed to prepare the data? Are there any special transformations? How does one calculate a score and make it relevant?

2. Data Presentation 

The data used here is similar to YouTube data. It is possible to retrieve it via an API to have exact and complete information. Here I have made up some observations. They nevertheless remain linked to what could actually be used for a scoring method on YouTube.

Subscribers Total views Total likes Videos Reach
Channel 1 120 000 350 000 35 500 27 120
Channel 2 8500 12 000 700 44 27
Channel 3 50 170 10 11 15
Channel 4 880 000 7 000 000 67 000 71 80
Channel 5 2500 12 000 1200 75
Figure 1 – Raw data table 

We therefore have 5 KPIs (Key Performance Indicators) that will be used to calculate the score. These are fairly basic metrics, but they are still indicative of a creator’s position on YouTube.

3. Normalization

It is often necessary to pre-process (imputation, management of anomalies, duplicate variables, etc.) the data before it can be analyzed or modeled. We won’t do that here for the sake of simplicity. In addition to this pre-processing, transforming the raw data can be relevant, in particular for calculating new metrics. For example, here we could calculate the average number of views per video or the level of engagement (reactions to a video / number of views).

Normalization is also a form of pre-processing. Here, the variables are not all on the same scale, and if we want to be able to control the impact of each on the score, we have to normalize them. The choice of normalization is essential and must be adapted to the variables used. The same normalization must be applied to all the metrics used. Here I have chosen to use the following formula:

WvmwmhvY17SB1 2jq9fehaYutmOAhrfKVp3YQYNaUsQ0HFNuKuH7JyKGRcyOX2sVxrurqkbPioWUr4NHwgjQdXu0MIeaP8tlw6ked8QP

Our variables only take positive values and have a positive impact on the score (i.e., the higher the value, the better the score should be). The transformation sets the maximum of each variable to 1 and the rest fall between 0 and 1. The values are simply translated. The “growth” is therefore preserved, and all the variables are on the same order of magnitude.

Here is the data after normalization:

Subscribers Total views Total likes Videos Reach
Channel 1 0.136 0.05 0.53 0.38 1
Channel 2 0.0096 0.0017 0.01 0.62 0.225
Channel 3 1010100.15 0.125
Channel 4 0.66
Channel 5 0.0028 0.0017 0.018 0.07 0.625
Figure 2 – Table of normalized data 

As you can see, the variables now have values between 0 and 1. On the other hand, the presence of very large and very small values in the data has led to a problem of distribution of values, which will be discussed in the conclusion.

4. Scoring Calculation 

Once the data has been prepared and transformed, we can finally calculate the score. There are many scoring methods, the one used here is linear combination. It has the advantage of being simple and easily interpretable. We must choose a set of weights (ω1, …ωn) and calculate the weighted sum of these weights with our variables:  


The choice of weights is essential with this method. This will determine the impact of each variable. Indeed, an increase of 1 for the variable xi, will increase the score of ωi. Everything is multiplied by 100 in order to obtain values between 0 and 100.

Let’s see what happens with our example. Let’s say that total views and likes are the most important variables, then comes the number of subscribers and the reach, and then finally the number of videos. This gives the following set of weights:

ω = (subscribers: 0.15, views: 0.3, likes: 0.3, reach: 0.15, videos: 0.1

And here is the table with the calculated scores: 

Subscribers Views Likes Videos Reach Score
Channel 1 0.136 0.05 0.53 0.38 38.2
Channel 2 0.0096 0.0017 0.01 0.62 0.225 10
Channel 3 1010100.15 0.125 3.4
Channel 4 0.66 94.9
Channel 5 0,0028 0,0017 0,018 0,07 0,625 10,7
Figure 3 – Data table with scores  

We therefore have many scores between 0 and 100. They are, however, quite unequal. This point will be discussed in the conclusion.

5. Conclusion 

We have seen the different steps to assign a score to creators based on part of their data which can be retrieved via APIs. All you have to do is make a request to the platform by describing how the data will be used. 

This data is assumed to be reliable, but an exploration step is always necessary when retrieving data. This makes it possible to better understand the data and to determine how the data should be treated. 

The variables are often on different scales. We can have ratios, averages of other variables, aggregates… They must therefore be put on the same scale. This is normalization. The chosen formula must depend on the type of variables (positive / negative values, positive / negative impact…).

Finally, we can calculate the score. Here it was done by linear combination. It is a simple method, but it requires a decent knowledge of its data to assign weights correctly. They can then be adjusted according to the relevance of the score. Here the scores were quite scattered. This is because the values of the metrics were too. A solution to better distribute the scores could have been to segment the data according to a variable before normalization (for example, the number of subscribers). We would have had several groups that would have been standardized differently. The scores could thus have been calculated in each of the groups and be more homogeneous.

Post a Comment