The world’s Largest Sharp Brain Virtual Experts Marketplace Just a click Away
Levels Tought:
Elementary,Middle School,High School,College,University,PHD
| Teaching Since: | Apr 2017 |
| Last Sign in: | 103 Weeks Ago, 3 Days Ago |
| Questions Answered: | 4870 |
| Tutorials Posted: | 4863 |
MBA IT, Mater in Science and Technology
Devry
Jul-1996 - Jul-2000
Professor
Devry University
Mar-2010 - Oct-2016
Objectives: Refresh and sharpen your coding skills Become familiar with semi-structured data processing and data transformationDescription:In this assignment, you are a given a dataset of approximately 20,000 news documents collected from aset of newsgroups (mailing lists). The set of documents (email messages) is partitioned almost evenlyacross 20 different topics such as sport, electronics, politics, etc. The documents of each newsgroup arestored in one directory. Each news document is stored in a text file in a semi-structured format.
Each document starts with the email header which contains a number of attributes such as “From:”,“Subject:”, “Organization:”, etc. followed by the message body.In this assignment, you are asked to do the following tasks:1. Download the dataset and decompress the files2. Parse the documents and extract key information3. Compute some statsThe details follow:1. Download the Dataset and Decompress the FilesDownload the 20-newsgroups.zip file from the assignment page on blackboard. Decompress the.zip file anywhere you want. That should give you a folder named “20-newsgroups” whichcontains 20 subfolders, one subfolder for each category.2. Task 1: Parse the Documents, Clean them from Noise, & Extract Key InformationYour first task is to write a Java program to process the documents in the dataset and parse thesemi-structured content of each document to extract the following information: Category: Document category can be extracted from the name of the folder in which thedocument is stored (e.g. rec.sport.hockey, rec.autos, comp.sys.mac.hardware, etc.) Sender (From): This can be extracted from the “From:” field of message header(highlighted in yellow in the example above). Sender Affiliation (Organization): This can be extracted from the “Organization:” field ofthe message header. Some documents will have no Organization field. For those onesyou can set the Organization value to “N/A”. Subject: This can be extracted from the “Subject:” field of the message header. Document Body: This is the message body (highlighted in green in the example above.)It’s the text that comes after all the header fields in the document. None of the headerfields (such as “Lines:”, “Keywords”, “Article-ID”, etc.) should be included in the body.Data Pre-processing/Cleaning:As we mentioned in the class, Big Data can be noisy and, therefore, some pre-processing mightbe needed to clean the data and improve its quality. Your code should clean the data as follows: All the fields that are not mentioned above (and highlighted in yellow in the example)should be discarded. For example, the fields “Article-I.D.”, “Lines”, etc. from theexample above should be discarded. All quoted text should be removed from the message body. In the example above, allthe text highlighted in red should be discarded. Quoted text usually starts with a linethat says something like: .X@Org.org (X Y Z) writes:” followed by one or more linesthat start with ‘>’. All such lines should be discarded.Hint: For simplicity, assume that any line that contains “:” should be discarded except thelines that correspond to the required header fields (e.g. Form, Organization, etc.). This shouldhelp take care of the two previous preprocessing requirements. In the message body (highlighted in green in the example above) there might be sometab ‘t’ characters or repeated spaces (which uses extra space for no useful reason).Make sure to clean the message body by replacing tab characters and repeated spaceswith a single space character. You can use Regex to achieve this. Something like:body = body.replaceAll(“\s+”, “ ”); Since the body of most documents contains multiple lines, we can’t store the bodycontent as it’s in the tsv file; otherwise, it will break the file format. That’s becauseaccording to our file format a new line means a new document. Therefore, you shouldreplace each new line character in the “body” with some other indicatorInput/Output:Your program should take the absolute path of the “20-newsgroups” folder from Task 1 andtraverse all the subfolders and process the files in each folder. Your code should parse each fileto extract the information mentioned above.The output of this task should be one huge file that contains all the data in Tab Separated Values(.tsv) format. This is format is commonly used to store Big Data. Your output file should include aline for each document in the set. Each line should include the aforementioned information inthe same order mentioned above (i.e. Category, From, Organization, Subject, Body). The valuesin each line should separated by tabs (the ‘t’ character).In the assignment documents, you’ll find a sample input/output for you to check and test yourcode.3. Task 2: Compute statsYour code in this task should take the “.tsv” file generated in Task 1 as input and compute thefollowing stats: The total number of documents (which should be same as the number of lines in thefile) The average word count of the document “body” in the data set. To get this, you needto compute the number of words in the “body” column for each document (line), thencompute the average. For simplicity, assume that words are separated by single spaces. Compute the average number of documents per category. To compute this you need tocompute the number of documents that belong to each category and then compute theaverage value. Find the category with the maximum average “body” word count. Find the category with the minimum average “body” word count.The output of this task should be another Tab Separated Values (.tsv) file. Your file will containone line only with tab separated values for the requested stats in the same order that theyappear above.What to submit?1. Put all Task 1 source code in a folder named “task1-src”. Compress the folder and submit a filenamed “task1-src.zip”2. Compile and export your Task 1 code to a single executable JAR file. Submit a file namedtask1.jar. It should be possible to execute your jar file using something like the followingcommandjava -jar task1.jar DataSetFolderPath Output.tsv3. Put all Task 2 source code in a folder named “task2-src”. Compress the folder and submit a filenamed “task2-src.zip”4. Compile and export your Task 1 code to a single executable JAR file. Submit a file namedtask1.jar. It should be possible to execute your jar file using something like the followingcommand:java -jar task2.jar input.tsv Output.tsv
Attachments: