Lab 3: Analyzer
Released | Wednesday, September 25 |
Due | Wednesday, October 2 at 11:59am ET |
Objectives
- Write a command-line Node.js program.
- Read, parse, and analyze text files.
- Store and retrieve information with CSVs.
- Use file hashes to optimize your program’s performance.
- Implement a simple server that sends data that can be processed client-side.
Overview
In this lab, you will analyze texts (a sample set is provided from Project Gutenberg, a repository of over 50,000 books, poems, and plays, many of which are in the public domain) by first determining how many words, characters, and sentences appear in the file to be analyzed. Then, you will feed that data in to two commonly-used formulas for determining a text’s reading level (i.e., what grade level a student should have reached in order to be able to read the text without difficulty), called the Coleman-Liau index and the Automated readability index.
Not only should your program output information about the text to your program’s user (as by console.log
) but also store that information in a CSV file (we’ll soon tackle more robust storage in a SQL database, beginning with Lab 4). In addition, because such textual analysis can sometimes take quite a long time if the text file is large enough, you will implement file hashing to check whether you’ve previously analyzed a file, and if so, instead output the data you’ve previously obtained by retrieving it from the previously-stored data, rather than re-analyzing that file.
Collaboration
Per the syllabus it is permissible to collaborate with one other classmate on this lab. If you do, both partners must still submit their own repositories, and both partners must turn in identical work. You should use the README.md
file to note that you worked with a partner and state clearly who your partner was (name and GitHub username).
Recommended Reading
- For general background on Node.js, see the course’s wiki page or their online documentation.
- For examples of using the
tokenize-text
package, see its online documentation. - For examples of using the
tokenize-english
package, see its online documentation. - For examples of using the
md5-file
package, see its online documentation.
Getting Ready
- Click here to go to the GitHub Classroom page for starting the assignment.
- Click the green “Accept this assignment” button. This will create a GitHub repository for your project.
- Click on the link that follows “Your assignment has been created here”, which will direct you to the GitHub repository page for your project. It may take a few seconds for GitHub to finish creating your repository. You should then be able to clone on your machine the cs276/lab3-username repository (CS276 is the course identifier for this course at Yale, by the way, if curious where that organization name comes from!): always push your code to your cs276/lab3-username repository.
- If not already installed on your computer, download and install the latest LTS version of Node.js, which as of the time of this writing is version 10.16.3. When you install Node, it will come bundled with
npm
, the Node Package Manager, which will allow you to include packages in your server application. - Follow the instructions atop
README.md
.
Requirements
- Files: Your readability analysis (command-line) application should be contained within a single Node.js project,
readability.js
. Your web front-end application should be contained withinapp.js
. - ES6: Your code should use only the ES6 flavor of JavaScript, and in particular:
- You should use
let
orconst
instead ofvar
to declare variables. - You are welcome to use
require()
in your Node.js code, as we’ve shown in class, instead of or alongside ES6’simport
directive. - You should use arrow functions (
=>
) whenever possible.
- You should use
- Additional Packages: You may use additional Node.js packages, installed via
npm
as needed. Be sure to use--save
when installing, so that yourpackage.json
updates the dependencies list!
Steps
Below is a recommended ordering; you need not follow it, but things may flow a bit easier if you take this approach.
- Command-Line Arguments:
readability.js
should open and analyze the file provided as a command line argument, which will be stored inprocess.argv[2]
. You’ll notice that the distribution code already checks to make sure such an argument is present. You can open the file using thereadFile
method from thefs
package. - Tokenizing: Complete the implementations of
countChars
,countWords
, andcountSentences
, per their descriptions, so that their outputs can be fed into the provided Coleman-Liau and Automated Readability Index formulas that the staff wrote. You are encouraged to refer to the documentation, given above, fortext-tokenizer
andtokenize-english
. The latter, in particular, will be helpful for tokenizing sentences.-
Note: The documentation for
tokenize-english
is not completely correct. The behavior is actually closer to that oftext-tokenizer
for words and characters. Thus, the example it provides ofvar tokens = tokenizeEnglish.sentences("On Jan. 20, former Sen. Barack Obama became the 44th President of the U.S. Millions attended the Inauguration.");
should actually be
const tokens = tokenizeEnglish.sentences()("On Jan. 20, former Sen. Barack Obama became the 44th President of the U.S. Millions attended the Inauguration.");
-
Note also that
tokenize-english
, when tokenizing sentences will also treat a newline (\n
) as the end of a sentence, so you may want to get rid of those by replacing them with single spaces. Given a chunk of texttext
, the below would accomplish this:const nonewlines = text.split(/\n/).join(' ');
- Don’t worry if your character/word/sentence counts differ from the staff’s by just one or two percent or so, it likely is a reflection of us deciding to count characters slightly differently!
- After testing your tokenizing, you should call those functions near the bottom of the file (inside of the
if (require.main === module)
part), and test that they behave as expected, as byconsole.log
ing their output.
-
-
-
Logging Output: Your application should log a summary of the analyzed text to the user before exiting. Complete the implementation of
printResults
to have that effect. Below is one potential format for logging this data, but yours may vary. You should log the scores to at least three decimal places, though you may decide to include more. You may also consider now factoring out your individual test calls to thecountChars
,countWords
, andcountSentences
functions and instead start working on the implementation ofcalculateResults
here.REPORT for ./texts/magnacarta.txt 78901 characters 17815 words 504 sentences ------------------ Coleman-Liau Score: 9.405 Automated Readability Index: 17.104
We strongly recommend you complete all of the above before proceeding to the next step.
-
CSV Storage: In addition to logging the data for the user, you should store the results of your analysis in a comma-separated values file (a
results.csv
file, a template of which we’ve provided for you). Complete the implementation ofsaveResults
so that it does exactly this, and you can then callsaveResults
as part ofcalculateResults
. Most spreadsheet programs can handle viewing these files, as can most text editors. We have provided you with all of the columns you should need to store in our template, but you may add others at your discretion. Be sure to consult documentation forfast-csv
and our in-class examples for tips on handling CSV files. At this time, you might as well also hash the file itself, obtaining an MD5 hash (check themd5-file
package documentation to remind yourself how!), and make that part of what you’re writing to the CSV in that row. An example of what your CSV might look like after a few entries have been recorded is below:filename,hash,characters,words,sentences,cl,ari texts/magnacarta.txt,8ed7a3ebcf8dcd595a840ae4356b33f3,78901,17815,504,9.405,17.104 texts/constitution.txt,d85098a7af117a696cbdc8d227ad86c4,33766,7109,193,11.325,19.358 texts/wuthering.txt,986fd38bc2f3b9fd5dabc57dd559f084,513419,122541,5197,7.581,10.093
We again strongly recommend you complete the above before proceeding to the next step.
- File Hashing: You’ve perhaps noticed that as the text files get larger, the time the program spends analyzing the files also increases sharply, and not necessarily linearly. This isn’t an issue (after all bigger files will take more time), you probably want to avoid analyzing the same file twice if you can. Leverage that MD5 hash that you calculated and stored by, before analyzing the file, checking to see whether you’ve already analyzed and logged it. This will require completing the implementation of
parseCSV
, as you’ll need to dig through the rows of the CSV file to find if that hash exists in some row already. If you find a row with that hash in the database already, instead simply retrieve all of the data on that file rather than re-analyzing it. There should ultimately be no duplicated rows in the CSV.
And only once all of the above has been completed should you turn your attention to these final steps.
- Web Application: In
app.js
inside of thereadability
directory, you should write a web application that responds only to a user visiting the index page (/
) of the app, and when a user does so, they are presented with an HTML table displaying the readability results of all of the texts that have been analyzed and saved toresults.csv
.- Notice that this is done by having the client visit
index.html
, which in turn includesscripts.js
, which includes afetch
call to/results
. Therefore, inside of where Express handles the routing for/results
, you should be parsing the contents of theresults.csv
and returning them to the client. Because we exported it inreadability.js
, you have access to yourparseCSV
function! We have already written the code that generates the table for you; you just need to supply the proper information. A screenshot of what this might look like after you’ve gotten it properly configured appears below:
- Notice that this is done by having the client visit
- Documentation: Your code should be well-commented so we can understand what it does and why. You should elaborate on design choices you make by adding a section to the
README.md
file provided.
Hints
Testing
Since it will probably be among the first things you do, to test your implementations of countChars
, countWords
, countSentences
, you can simply pass in strings, you don’t need to even have opened the file(s) yet. A good place to get started!
Regular Expressions
- After reading the documentation for
tokenize
, if you decide you want to use regular expressions to analyze the text for alphabetic characters and numbers, you can use the expressions:[A-Za-z]
for letters, and[0-9]
for numbers.
- Note that while the Coleman-Liau index technically should only count letters, it’s quite alright for this lab to treat that as including both letters and numbers, as the staff’s implementations do.
Debugging
- The easiest way to debug your Node.js code is through Chrome. See the Debugging Guide for instructions.
How to Submit
Step 1 of 2
- Make sure all of your latest changes have been committed and pushed to your cs276/lab3-username repository.
- If you collaborated with a partner on this lab, be sure you’ve clearly identified who your partner is in the
README.md
text file, and be sure each of you individually submits, where your submissions should (other than theREADME.md
file) be identical.
Step 2 of 2
Fill out this form.
Congratulations! You’ve completed Lab 3.