Spark url extractor

4/20/2023

So we can go through each SchemaRDD and saveAsParquet to disk urlsDStream. Use SparkSQL to implicit convert a RDD into a Schema RDD: val sqlContext = new .SQLContext(ssc.sparkContext) Val tweets = TwitterUtils.createStream(ssc, None, followingList)įor each Tweet that contains a URL, extract it and if there are more than one url, extracts only the first: // Consider only 1st URL on the Tweet Val ssc = new StreamingContext(new SparkConf(), Seconds(300)) Setup a StreamingContext with a 5 minutes window, load the accounts and create the Twitter Stream // Setup the Streaming Context You can check the activity, using Spark UI Internals urls/999999999/ the numbers represent the unix timestamp, rounded down to minute. If everything is working properly, each 5 minutes you going to see a new folders at. Target/scala-2.10/ \Įdit following.txt adding accounts that you find interesting! Now edit src/main/resources/ adding your credentials and rename it to src/main/resources/twitter4j.properties: = // I use Eclipse Scala and typesafe plugin sbteclipse to create an Eclipse project. Run a batch job to process the data from previous stage and create a top 10 list SolutionĬollect tweets from the stream, analize them and store those tweets whthat contains a link, expanding the link to its final destination (removing shortening and click counters) The code is available at Github, just create a valid Twitter API credentials and you can run it. So I decided to create a PoC of Twitter’s top stories using Apache Spark.ĭISCLAIMER: this is a PoC, mainly focused on learning Spark, this architecture doesn’t represent a production level product neither I consider recommending stories for only one user as a big data problem. I always thought it would be fun try to build something similar. These two operations are sufficient to process all the data available on the web, while also providing enough flexibility to extract meaningful information. Since I got to know it surprises me with its simplicity and yet power to recommend me the best stories to read. Simulation results of adjusted DSL instances compared to measurement results show accurate predictions errors below 15% based upon averages for response times and resource utilization.Home | Talks

0 Comments

Spark url extractor

Leave a Reply.

Author

Archives

Categories