Tiktok Project - Machine Learning

Classifying videos using machine learning

Overview:

The TikTok data team aims to create a machine-learning model to distinguish between videos labeled as claims and opinions. Through prior analysis of the available data, they discovered that the engagement levels of videos strongly correlate with their claim status. As a result, the team is optimistic that the developed model will meet all performance criteria successfully.

ISSUE / PROBLEM:


TikTok encounters a significant volume of user-reported videos for various reasons. However, due to the sheer volume, not all reported videos can be manually reviewed by human moderators. To address this issue, TikTok aims to find a solution to identify videos that make claims (as opposed to opinions) since such videos are more likely to potentially violate the platform's terms of service. By prioritizing the detection of claim-based videos, TikTok can ensure a more focused and efficient review process to maintain a safer and compliant environment on the platform.

Pace: Plan Stage

Stakeholders: Project Management Officer, Data Science Lead, President.

Purpose and goal: to find a solution to identify videos that make claims.

Initial observations:

  • There are very few missing values relative to the number of samples in the dataset. Therefore, observations with missing values can be dropped.

  • There are no duplicate observations in the data.

  • Approximately 50.3% of the dataset represents claims and 49.7% represents opinions, so the outcome variable is balanced.

pAce: Analyze Stage

Letter count distributions for both claims and opinions are approximately normal with a slight right skew. Claim videos tend to have more characters—about 13 more on average, as indicated in a previous cell.

paCe: Construct Stage

This forest model performs exceptionally well, with an average recall score of 0.995 across the five cross-validation folds. After checking the precision score to be sure the model is not classifying all samples as claims, it is clear that this model is making almost perfect classifications.

This XGBoost model also performs exceptionally well. Although its recall score is very slightly lower than the random forest model's, its precision score is perfect.

The upper-left quadrant displays the number of true negatives: the number of opinions that the model accurately classified as so.

The upper-right quadrant displays the number of false positives: the number of opinions that the model misclassified as claims.

The lower-left quadrant displays the number of false negatives: the number of claims that the model misclassified as opinions.

The lower-right quadrant displays the number of true positives: the number of claims that the model accurately classified as so.

A perfect model would yield all true negatives and true positives, and no false negatives or false positives.

As the above confusion matrix shows, this model does not produce any false negatives.

The classification report above shows that the random forest model scores were nearly perfect. The confusion matrix indicates that there were 10 misclassifications—five false postives and five false negatives.

The results of the XGBoost model were also nearly perfect. However, its errors tended to be false negatives. Identifying claims was the priority, so it's important that the model be good at capturing all actual claim videos. The random forest model has a better recall score, and is therefore the champion model.

The most predictive features all were related to engagement levels generated by the video. This is not unexpected, as analysis from prior EDA pointed to this conclusion.

Note:

  • Both model architectures—random forest (RF) and XGBoost—performed exceptionally well. The RF model had a better recall score (0.995) and was selected as the champion.

  • Performance on the test holdout data yielded near-perfect scores, with only five misclassified samples out of 3,817.

  • Subsequent analysis indicated that, as expected, the primary predictors were all related to video engagement levels, with video view count, like count, share count, and download count accounting for nearly all predictive signal in the data. With these results, we can conclude that videos with higher user engagement levels were much more likely to be claims. In fact, no opinion video had more than 10,000 views.



pacE: Execute Stage

Conclusion:

  1. One can recommend this model because it performed well on both the validation and test holdout data. Furthermore, both precision and F1 scores were consistently high. The model very successfully classified claims and opinions.

  2. The model's most predictive features were all related to the user engagement levels associated with each video. It was classifying videos based on how many views, likes, shares, and downloads they received.

  3. Because the model currently performs nearly perfectly, there is no need to engineer any new features. The current version of the model does not need any new features. However, it would be helpful to have the number of times the video was reported. It would also be useful to have the total number of user reports for all videos posted by each author.

Next Steps:

As noted, the model performed exceptionally well on the test holdout data. Before deploying the model, the data team recommends further evaluation using additional subsets of user data. Furthermore, the data team recommends monitoring the distribution of video engagement levels to ensure that the model remains robust to fluctuations in its most predictive features.


Previous
Previous

Salifort Motors

Next
Next

Tiktok - Regression Model