Lab 3: Analyzer

Released Wednesday, September 25
Due Wednesday, October 2 at 11:59am ET

Objectives

  • Write a command-line Node.js program.
  • Read, parse, and analyze text files.
  • Store and retrieve information with CSVs.
  • Use file hashes to optimize your program’s performance.
  • Implement a simple server that sends data that can be processed client-side.

Overview

In this lab, you will analyze texts (a sample set is provided from Project Gutenberg, a repository of over 50,000 books, poems, and plays, many of which are in the public domain) by first determining how many words, characters, and sentences appear in the file to be analyzed. Then, you will feed that data in to two commonly-used formulas for determining a text’s reading level (i.e., what grade level a student should have reached in order to be able to read the text without difficulty), called the Coleman-Liau index and the Automated readability index.

Not only should your program output information about the text to your program’s user (as by console.log) but also store that information in a CSV file (we’ll soon tackle more robust storage in a SQL database, beginning with Lab 4). In addition, because such textual analysis can sometimes take quite a long time if the text file is large enough, you will implement file hashing to check whether you’ve previously analyzed a file, and if so, instead output the data you’ve previously obtained by retrieving it from the previously-stored data, rather than re-analyzing that file.

Collaboration

Per the syllabus it is permissible to collaborate with one other classmate on this lab. If you do, both partners must still submit their own repositories, and both partners must turn in identical work. You should use the README.md file to note that you worked with a partner and state clearly who your partner was (name and GitHub username).

Getting Ready

  1. Click here to go to the GitHub Classroom page for starting the assignment.
  2. Click the green “Accept this assignment” button. This will create a GitHub repository for your project.
  3. Click on the link that follows “Your assignment has been created here”, which will direct you to the GitHub repository page for your project. It may take a few seconds for GitHub to finish creating your repository. You should then be able to clone on your machine the cs276/lab3-username repository (CS276 is the course identifier for this course at Yale, by the way, if curious where that organization name comes from!): always push your code to your cs276/lab3-username repository.
  4. If not already installed on your computer, download and install the latest LTS version of Node.js, which as of the time of this writing is version 10.16.3. When you install Node, it will come bundled with npm, the Node Package Manager, which will allow you to include packages in your server application.
  5. Follow the instructions atop README.md.

Requirements

  • Files: Your readability analysis (command-line) application should be contained within a single Node.js project, readability.js. Your web front-end application should be contained within app.js.
  • ES6: Your code should use only the ES6 flavor of JavaScript, and in particular:
    • You should use let or const instead of var to declare variables.
    • You are welcome to use require() in your Node.js code, as we’ve shown in class, instead of or alongside ES6’s import directive.
    • You should use arrow functions (=>) whenever possible.
  • Additional Packages: You may use additional Node.js packages, installed via npm as needed. Be sure to use --save when installing, so that your package.json updates the dependencies list!

Steps

Below is a recommended ordering; you need not follow it, but things may flow a bit easier if you take this approach.

  1. Command-Line Arguments: readability.js should open and analyze the file provided as a command line argument, which will be stored in process.argv[2]. You’ll notice that the distribution code already checks to make sure such an argument is present. You can open the file using the readFile method from the fs package.
  2. Tokenizing: Complete the implementations of countChars, countWords, and countSentences, per their descriptions, so that their outputs can be fed into the provided Coleman-Liau and Automated Readability Index formulas that the staff wrote. You are encouraged to refer to the documentation, given above, for text-tokenizer and tokenize-english. The latter, in particular, will be helpful for tokenizing sentences.
    • Note: The documentation for tokenize-english is not completely correct. The behavior is actually closer to that of text-tokenizer for words and characters. Thus, the example it provides of

        var tokens = tokenizeEnglish.sentences("On Jan. 20, former Sen. Barack Obama became the 44th President of the U.S. Millions attended the Inauguration.");
      

      should actually be

        const tokens = tokenizeEnglish.sentences()("On Jan. 20, former Sen. Barack Obama became the 44th President of the U.S. Millions attended the Inauguration.");
      
      • Note also that tokenize-english, when tokenizing sentences will also treat a newline (\n) as the end of a sentence, so you may want to get rid of those by replacing them with single spaces. Given a chunk of text text, the below would accomplish this:

        const nonewlines = text.split(/\n/).join(' ');
        
      • Don’t worry if your character/word/sentence counts differ from the staff’s by just one or two percent or so, it likely is a reflection of us deciding to count characters slightly differently!
      • After testing your tokenizing, you should call those functions near the bottom of the file (inside of the if (require.main === module) part), and test that they behave as expected, as by console.loging their output.
  3. Logging Output: Your application should log a summary of the analyzed text to the user before exiting. Complete the implementation of printResults to have that effect. Below is one potential format for logging this data, but yours may vary. You should log the scores to at least three decimal places, though you may decide to include more. You may also consider now factoring out your individual test calls to the countChars, countWords, and countSentences functions and instead start working on the implementation of calculateResults here.

     REPORT for ./texts/magnacarta.txt
     78901 characters
     17815 words
     504 sentences
     ------------------
     Coleman-Liau Score: 9.405
     Automated Readability Index: 17.104
    

We strongly recommend you complete all of the above before proceeding to the next step.

  1. CSV Storage: In addition to logging the data for the user, you should store the results of your analysis in a comma-separated values file (a results.csv file, a template of which we’ve provided for you). Complete the implementation of saveResults so that it does exactly this, and you can then call saveResults as part of calculateResults. Most spreadsheet programs can handle viewing these files, as can most text editors. We have provided you with all of the columns you should need to store in our template, but you may add others at your discretion. Be sure to consult documentation for fast-csv and our in-class examples for tips on handling CSV files. At this time, you might as well also hash the file itself, obtaining an MD5 hash (check the md5-file package documentation to remind yourself how!), and make that part of what you’re writing to the CSV in that row. An example of what your CSV might look like after a few entries have been recorded is below:

     filename,hash,characters,words,sentences,cl,ari
     texts/magnacarta.txt,8ed7a3ebcf8dcd595a840ae4356b33f3,78901,17815,504,9.405,17.104
     texts/constitution.txt,d85098a7af117a696cbdc8d227ad86c4,33766,7109,193,11.325,19.358
     texts/wuthering.txt,986fd38bc2f3b9fd5dabc57dd559f084,513419,122541,5197,7.581,10.093
    

We again strongly recommend you complete the above before proceeding to the next step.

  1. File Hashing: You’ve perhaps noticed that as the text files get larger, the time the program spends analyzing the files also increases sharply, and not necessarily linearly. This isn’t an issue (after all bigger files will take more time), you probably want to avoid analyzing the same file twice if you can. Leverage that MD5 hash that you calculated and stored by, before analyzing the file, checking to see whether you’ve already analyzed and logged it. This will require completing the implementation of parseCSV, as you’ll need to dig through the rows of the CSV file to find if that hash exists in some row already. If you find a row with that hash in the database already, instead simply retrieve all of the data on that file rather than re-analyzing it. There should ultimately be no duplicated rows in the CSV.

And only once all of the above has been completed should you turn your attention to these final steps.

  1. Web Application: In app.js inside of the readability directory, you should write a web application that responds only to a user visiting the index page (/) of the app, and when a user does so, they are presented with an HTML table displaying the readability results of all of the texts that have been analyzed and saved to results.csv.
    • Notice that this is done by having the client visit index.html, which in turn includes scripts.js, which includes a fetch call to /results. Therefore, inside of where Express handles the routing for /results, you should be parsing the contents of the results.csv and returning them to the client. Because we exported it in readability.js, you have access to your parseCSV function! We have already written the code that generates the table for you; you just need to supply the proper information. A screenshot of what this might look like after you’ve gotten it properly configured appears below:

readability

  1. Documentation: Your code should be well-commented so we can understand what it does and why. You should elaborate on design choices you make by adding a section to the README.md file provided.

Hints

Testing

Since it will probably be among the first things you do, to test your implementations of countChars, countWords, countSentences, you can simply pass in strings, you don’t need to even have opened the file(s) yet. A good place to get started!

Regular Expressions

  • After reading the documentation for tokenize, if you decide you want to use regular expressions to analyze the text for alphabetic characters and numbers, you can use the expressions:
    • [A-Za-z] for letters, and
    • [0-9] for numbers.
  • Note that while the Coleman-Liau index technically should only count letters, it’s quite alright for this lab to treat that as including both letters and numbers, as the staff’s implementations do.

Debugging

  • The easiest way to debug your Node.js code is through Chrome. See the Debugging Guide for instructions.

How to Submit

Step 1 of 2

  1. Make sure all of your latest changes have been committed and pushed to your cs276/lab3-username repository.
  2. If you collaborated with a partner on this lab, be sure you’ve clearly identified who your partner is in the README.md text file, and be sure each of you individually submits, where your submissions should (other than the README.md file) be identical.

Step 2 of 2

Fill out this form.

Congratulations! You’ve completed Lab 3.