thanks for the answer. I've developed this as a solution for the problem i have to solve
//PART 1: Aggregate on temporal dimension and obtain percentage of posts classified as NSFW
//Posts are :(id,subreddit.id,subreddit.name,subreddit.nsfw,created_utc,permalink,domain,url,selftext,title,score)
//(x._1, x._3, x._4)
val percentageNSFWPosts = rddPosts.map(x => (x._5, x._4)) // (created_utc, nsfw flag)
.groupByKey()
.mapValues({case (nsfwCount) =>
val totalPostsAtTime = nsfwCount.size
val totNSFWPost = nsfwCount.count(el => el == true)
((totNSFWPost * 100) / totalPostsAtTime).toDouble
})
By doing this, i'm able to get everything in a certain temporal dimension and get the percentage of all posts that are considered NSFW. Since you have mentioned usage of reduce
(which is also a method they have suggested me to make this more optimized), could you help me to optimize this work by using reduce
or reduceByKey
?