{"id":2161,"date":"2017-03-09T11:47:22","date_gmt":"2017-03-09T10:47:22","guid":{"rendered":"https:\/\/blog.mi.hdm-stuttgart.de\/?p=2161"},"modified":"2023-06-07T15:28:02","modified_gmt":"2023-06-07T13:28:02","slug":"of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-5-spark-applications-in-pia-project","status":"publish","type":"post","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-5-spark-applications-in-pia-project\/","title":{"rendered":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 5 &#8211; Spark applications in PIA project"},"content":{"rendered":"<p>The main reason for choosing Spark was a second project which we developed for the course \u201cProgramming Intelligent Applications\u201d. For this project we wanted to implement a framework which is able to monitor important events (e.g. terror, natural disasters) on the world through Twitter.<\/p>\n<p>To separate important tweets from others we use Latent Dirichlet Allocation (LDA) which is an algorithm for topic modelling. LDA is able to extract distinct topics through relations between words. An example for such a relation would be that the word \u201cterror\u201d is likely to be used with the word \u201cattack\u201d. This information is enough to generate separate topics for a bunch of documents, in our case tweets. Therefore, the need to annotate training data to learn word distributions for topics is not necessary and LDA is an algorithm which can learn completely unsupervised. For most machine learning applications the performance is getting better with more training data, and this also applies for our application.<\/p>\n<p><!--more--><\/p>\n<p>The most important part was to set up a pipeline which is able to fetch tweets and regularly start a new LDA training with all the collected data. After every finished training the best available model should be evaluated through handpicked evaluation data and used for future filtering of important tweets. Completely autonomous, without any need of human supervision.<\/p>\n<p>However, a vast amount of data need to be processed. Tweets need to be preprocessed for the training and after that the training should be as fast as possible to generate new and better models in feasible time. Both the preprocessing and the training with LDA are algorithms which can be distributed and therefore they are very suitable for clusters.<\/p>\n<h3>Spark Applications<\/h3>\n<h4>How many Tweets have location information?<\/h4>\n<p>At the beginning of the project we had to choose the strategy how we want to extract the location of a tweet. One possibility was to use the location information which was sent with the tweet. To evaluate if this is a expressive metric, we checked how many tweets actually have information about their location. In this case, location means the location of the event the tweet refers to and not the users location.<\/p>\n<p>For the training of our model we collected about 40GB of tweets. We uploaded this data into our HDFS of IBM Bluemix and with the help of SparkSQL we could analyze the tweets very fast and easily. The result was that only ~17% of our tweets from the trainings data had an information about the location which was not meaningful enough. So we had to look for an alternative way to extract the location of tweets. The source code can be recognized on <a href=\"https:\/\/github.com\/bh051\/SysEng_Spark_WS1617\" target=\"_blank\" rel=\"noopener\">our GitHub repository<\/a>.<\/p>\n<h4>Topic Modelling<\/h4>\n<p>In our project in PIA we implemented topic modelling (LDA) with the python library gensim. For test cases we tried to reimplement the project with Apache Spark to calculate our model on a computer cluster. Therefore we used the Apache Spark machine learing library (MLlib). Since we don\u2019t want to go into the depths of the machine learning algorithm, we explain some code in the following and outline why the respective step is necessary.<\/p>\n<p>The usage of the distributed implementation of LDA is quite simple.<\/p>\n<pre class=\"prettyprint lang-java\" data-start-line=\"1\" data-visibility=\"visible\" data-highlight=\"\" data-caption=\"\">int numTopics = 3;\nLDA lda = new LDA().setK(numTopics).setMaxIterations(100);\nLDAModel model = lda.run(documents);<\/pre>\n<p>There is a class called LDA in the Spark machine learning library (MLlib) which has to be instantiated. Then we could set the parameters like the number of topics which should be found and the maximum amount of iterations. On this object we call a method run which expects the training data as parameters. This method returns the machine learning model, which then can be used to classify a tweet to a certain topic. The challenge was the editing of the data (the variable documents) which will be passed to the run method as parameter. Through the functional style programming API in Apache Spark this could be realized very elegantly. The data we had to preprocess were tweets stored as json files. A tweet json object has a lot of information like the actual text, user, location, date etc.<\/p>\n<p>Firstly, we had to create a corpus of the tweets of the training data. The corpus consists of tuples. A tuple again consists of a word which is occurred in the training data and the number of occurrences in the training data. To achieve this each tweet has to be tokenized and preprocessed.<\/p>\n<pre class=\"prettyprint lang-java\" data-start-line=\"1\" data-visibility=\"visible\" data-highlight=\"\" data-caption=\"\">DataFrame dfTweets = sqlContext.read().json(\"hdfs:\/\/...\/user\/tweets\/*.json\");\ndfTweets.registerTempTable(\"tweets\");\nDataFrame dfTweetsText = sqlContext.sql(\"SELECT text FROM tweets\");\n\n\/\/ preprocess tweets\nJavaRDD&lt;List&lt;String&gt;&gt; tokenized = dfTweetsText.javaRDD()\n    .map(row -&gt; {\n        \/\/ get text and transform to lowercase\n        String text = row.getString(0).toLowerCase();\n        \/\/ text remove html markup\n        text = Jsoup.parse(text).text();\n        \/\/ remove emojis\n        text = unicodeOutliers.matcher(text).replaceAll(\" \");\n        return text;\n    })\n    .map(text -&gt; Arrays.asList(text.trim().split(\" \"))) \/\/ split tweet in words\n    .filter(words -&gt; words.size() &gt; 3); \/\/remove tweets with less\/eq than 3 words<\/pre>\n<p>The code above is reduced version of the original, because this would go beyond the scope. First, we read the json files from the HDFS via the Spark SQL Context and register a table \u201ctweets\u201d. Like mentioned earlier a tweet json has many attributes. Hence the tweets data frame has a lot of columns. Therefore we create a new data frame with only the text column and save it in the variable dfTweetsText (line 3). On this object we call the method javaRDD which leads to that the data sets of the data frame will be transformed to an RDD which can be processed with high-level functions. In the first map transformation the string will be transformed to lowercase as well as html mark-up and emojis will be removed. In the next step each tweet will be split into words. Finally, we filter out the tweets that have less than three words after preprocessing, because these tweets have no semantic value anymore. The result of this preprocessing is an RDD consisting of lists of strings. A list of the RDD represents a tweet and the entries of a list are the words of a tweet.<br \/>\nIn the next step we create the corpus. Like mentioned above the corpus consists of key-value pairs where the key is the word and the value the occurrences of the word in all tweets. We transform the in the previous code snippet created tokenized RDD.<\/p>\n<pre class=\"prettyprint lang-java\" data-start-line=\"1\" data-visibility=\"visible\" data-highlight=\"\" data-caption=\"\">\/\/ create corpus\nJavaPairRDD&lt;String, Integer&gt; corpus = tokenized.flatMap(words -&gt; words)\n                                  .mapToPair(word -&gt; new Tuple2&lt;String, Integer&gt;(word,1)) \/\/ build pair of word and integer\n                                  .reduceByKey((a,b) -&gt; a+b) \/\/ sum up the word occurences\n                                  .cache();<\/pre>\n<p>With the method flatMap the rdd with lists of words will be <i>flattened <\/i>to a stream of words. In step 2 tuples will be formed with the method mapToPair. In the reduceByKey step the values of all tuples with the same key will be accumulated. The result is the desired corpus which will be cached. In the next step the vector representation of the documents (tweets) which the LDA algorithm expects could be created with the corpus and the tokenized object. As we have done could be seen on <a href=\"https:\/\/github.com\/bh051\/SysEng_Spark_WS1617\" target=\"_blank\" rel=\"noopener\">our GitHub project<\/a>, because at this point we just want to gave a short view of implementing machine learning applications with Spark. We find Spark is very pleasant for implementing distributed machine learning applications.<\/p>\n<p>In the finally part of this article series we discuss Apache Spark and Apache Hadoop and how they relate to each other.<\/p>\n<p><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-6-apache-spark-andvs-apache-hadoop\/\">Part 6 &#8211; Apache Spark and\/vs Hadoop?<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The main reason for choosing Spark was a second project which we developed for the course \u201cProgramming Intelligent Applications\u201d. For this project we wanted to implement a framework which is able to monitor important events (e.g. terror, natural disasters) on the world through Twitter. To separate important tweets from others we use Latent Dirichlet Allocation [&hellip;]<\/p>\n","protected":false},"author":49,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[22,651,2],"tags":[],"ppma_author":[721],"class_list":["post-2161","post","type-post","status-publish","format-standard","hentry","category-student-projects","category-system-designs","category-system-engineering"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":2143,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services\/","url_meta":{"origin":2161,"position":0},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 1 &#8211; Introduction","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"As part of the lecture \u201cSystem Engineering and Management\u201d in the winter semester 2016\/17, we run a project with Apache Spark and the Apache Hadoop Ecosystem. In this article series firstly we want to introduce Apache Spark and the Apache Hadoop Ecosystem. Furthermore we want to give an overview of\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2153,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-3-what-is-apache-spark\/","url_meta":{"origin":2161,"position":1},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 3 &#8211; What is Apache Spark?","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"Apache Spark is a framework for fast processing of large data on computer clusters. Spark applications can be written in Scala, Java, Python or R and can be executed in the cloud or on Hadoop (YARN) or Mesos cluster managers. It is also possible to run Spark applications standalone, that\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":2157,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-4-big-data-engineering\/","url_meta":{"origin":2161,"position":2},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 4 &#8211; Big Data Engineering","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"Our objective in this project was to build an environment that could be practical. So we set up a virtual Hadoop test cluster with virtual machines. Our production environment was a Hadoop Cluster in the IBM Bluemix cloud which we could use for free with our student accounts. We developed\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":2165,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-6-apache-spark-andvs-apache-hadoop\/","url_meta":{"origin":2161,"position":3},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 6 &#8211; Apache Spark and\/vs Apache Hadoop?","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"At the beginning of this article series we introduced the core concepts of Hadoop and Spark in a nutshell. Both, Apache Spark and Apache Hadoop are frameworks for efficient processing of large data on computer clusters. The question arises how they differ or relate to each other. Hereof it seems\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2151,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-2-apache-hadoop-ecosystem\/","url_meta":{"origin":2161,"position":4},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 2 &#8211; Apache Hadoop Ecosystem","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"In our project we primarily implemented Spark applications, but we used components of Apache Hadoop like the Hadoop distributed file system or the cluster manager Hadoop YARN. For our discussion in the last part of this blog article it is moreover necessary to understand Hadoop MapReduce for comparison to Apache\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":10289,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/","url_meta":{"origin":2161,"position":5},"title":"Distributed stream processing frameworks &#8211; what they are and how they perform","author":"Alexander Merker","date":"9. March 2020","format":false,"excerpt":"An overview on stream processing, common frameworks as well as some insights on performance based on benchmarking data","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/storm_arch.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/storm_arch.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/storm_arch.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]}],"jetpack_sharing_enabled":true,"authors":[{"term_id":721,"user_id":49,"is_guest":0,"slug":"bh051","display_name":"bh051, cz022, ds168","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/6e0cfeb23e37b530d4d35d4e46d3e6f39969124f52f6474b4cf0f23b6ff524ac?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2161","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/users\/49"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/comments?post=2161"}],"version-history":[{"count":7,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2161\/revisions"}],"predecessor-version":[{"id":24710,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2161\/revisions\/24710"}],"wp:attachment":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/media?parent=2161"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/categories?post=2161"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/tags?post=2161"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/ppma_author?post=2161"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}