Tiktok Project - Regression Model
Tiktok claims classification project
Overview:
The TikTok data team's objective is to create a machine-learning model to aid in distinguishing between claims and opinions in user submissions. They noticed a significant pattern where verified users are more inclined to post opinions. To achieve their ultimate goal of classifying claims and opinions effectively, they focused on developing a logistic regression model that can predict the verified status of accounts. This prediction is crucial since verified accounts tend to be associated with a higher likelihood of posting opinions. By leveraging this model, the data team can gain valuable insights into user behavior and improve the accuracy of claim and opinion classification for a safer and more informative TikTok platform.
Pace: Plan Stage
Stakeholders: Project Management Officer, Data Science Lead, President.
Purpose and goal: to build and evaluate logistic regression models.
Initial Observations:
There are no duplicates in the data
There are very few missing values relative to the number of samples in the dataset. Therefore, observations with missing values can be dropped.
pAce: Analyze Stage
Approximately 94.2% of the dataset represents videos posted by unverified accounts and 5.8% represents videos posted by verified accounts. So the outcome variable is not very balanced.
Letter count distributions for both verified and not verified are approximately normal with a slight right skew
paCe: Construct Stage
The above heatmap shows that the following pair of variables are strongly correlated: video_view_count and video_like_count (0.85 correlation coefficient).
The upper-left quadrant displays the number of true negatives: the number of videos posted by unverified accounts that the model accurately classified as so.
The upper-right quadrant displays the number of false positives: the number of videos posted by unverified accounts that the model misclassified as posted by verified accounts.
The lower-left quadrant displays the number of false negatives: the number of videos posted by verified accounts that the model misclassified as posted by unverified accounts.
The lower-right quadrant displays the number of true positives: the number of videos posted by verified accounts that the model accurately classified as so.
A perfect model would yield all true negatives and true positives, and no false negatives or false positives.
The classification report above shows that the logistic regression model achieved a precision of 69% and a recall of 66% (weighted averages), and it achieved an accuracy of 66%.
Note:
According to the logistic regression's estimated model coefficients, there is a strong association between longer videos and higher odds of the user being verified. In other words, verified accounts on TikTok are more likely to post longer videos compared to non-verified accounts.
However, the model's estimated coefficients for other video features are relatively small, indicating that their connection to verified status is minor. Consequently, attributes such as video content, engagement levels, and posting frequency do not appear to significantly influence an account's likelihood of being verified.
In summary, the analysis suggests that video length plays a significant role in determining verification status, while other video features do not have a substantial impact on whether an account is verified or not.
pacE: Execute Stage
Conclusion:
The dataset has a few strongly correlated variables, which might lead to multicollinearity issues when fitting a logistic regression model. We decided to drop video_like_count from the model building.
Based on the logistic regression model, each additional second of the video is associated with 0.01 increase in the log-odds of the user having a verified status.
The logistic regression model had decent predictive power: a precision of 69% and a recall of 66% (weighted averages), and it achieved an accuracy of 66%.
We developed a logistic regression model for verified status based on video features. The model had decent predictive power (69% precision and 66% recall). Based on the estimated model coefficients from the logistic regression, longer videos tend to be associated with higher odds of the user being verified. Other video features have small estimated coefficients in the model, so their association with verified status seems to be small.
Next steps:
Now that the TikTok data team has successfully built the logistic regression model to predict the verified status of user accounts, the next crucial step is to construct a classification model specifically designed to predict the status of claims made by users. This classification model will serve as the final project and fulfill the original expectation of the TikTok team.
With the data team's thorough analysis of user behavior, considering factors such as verified status being linked to a higher tendency to post opinions, there is now enough valuable information to analyze the results of the claim classification model with a helpful context around user behavior.
By examining the outcomes of this model in light of user behavior patterns, the TikTok team can gain deeper insights into how claims are made on the platform and identify any correlations with certain user types or behaviors. This comprehensive understanding will enable TikTok to enhance content moderation, prioritize reviews, and create a safer and more engaging environment for its users.
Overall, the completion of this classification model and the insightful analysis of its results will be crucial steps towards TikTok's goal of effectively identifying and managing claims and opinions on its platform.