28 skills found
decalage2 / Oletoolsoletools - python tools to analyze MS OLE2 files (Structured Storage, Compound File Binary Format) and MS Office documents, for malware analysis, forensics and debugging.
ManojKumarPatnaik / Major Project ListA list of practical projects that anyone can solve in any programming language (See solutions). These projects are divided into multiple categories, and each category has its own folder. To get started, simply fork this repo. CONTRIBUTING See ways of contributing to this repo. You can contribute solutions (will be published in this repo) to existing problems, add new projects, or remove existing ones. Make sure you follow all instructions properly. Solutions You can find implementations of these projects in many other languages by other users in this repo. Credits Problems are motivated by the ones shared at: Martyr2’s Mega Project List Rosetta Code Table of Contents Numbers Classic Algorithms Graph Data Structures Text Networking Classes Threading Web Files Databases Graphics and Multimedia Security Numbers Find PI to the Nth Digit - Enter a number and have the program generate PI up to that many decimal places. Keep a limit to how far the program will go. Find e to the Nth Digit - Just like the previous problem, but with e instead of PI. Enter a number and have the program generate e up to that many decimal places. Keep a limit to how far the program will go. Fibonacci Sequence - Enter a number and have the program generate the Fibonacci sequence to that number or to the Nth number. Prime Factorization - Have the user enter a number and find all Prime Factors (if there are any) and display them. Next Prime Number - Have the program find prime numbers until the user chooses to stop asking for the next one. Find Cost of Tile to Cover W x H Floor - Calculate the total cost of the tile it would take to cover a floor plan of width and height, using a cost entered by the user. Mortgage Calculator - Calculate the monthly payments of a fixed-term mortgage over given Nth terms at a given interest rate. Also, figure out how long it will take the user to pay back the loan. For added complexity, add an option for users to select the compounding interval (Monthly, Weekly, Daily, Continually). Change Return Program - The user enters a cost and then the amount of money given. The program will figure out the change and the number of quarters, dimes, nickels, pennies needed for the change. Binary to Decimal and Back Converter - Develop a converter to convert a decimal number to binary or a binary number to its decimal equivalent. Calculator - A simple calculator to do basic operators. Make it a scientific calculator for added complexity. Unit Converter (temp, currency, volume, mass, and more) - Converts various units between one another. The user enters the type of unit being entered, the type of unit they want to convert to, and then the value. The program will then make the conversion. Alarm Clock - A simple clock where it plays a sound after X number of minutes/seconds or at a particular time. Distance Between Two Cities - Calculates the distance between two cities and allows the user to specify a unit of distance. This program may require finding coordinates for the cities like latitude and longitude. Credit Card Validator - Takes in a credit card number from a common credit card vendor (Visa, MasterCard, American Express, Discoverer) and validates it to make sure that it is a valid number (look into how credit cards use a checksum). Tax Calculator - Asks the user to enter a cost and either a country or state tax. It then returns the tax plus the total cost with tax. Factorial Finder - The Factorial of a positive integer, n, is defined as the product of the sequence n, n-1, n-2, ...1, and the factorial of zero, 0, is defined as being 1. Solve this using both loops and recursion. Complex Number Algebra - Show addition, multiplication, negation, and inversion of complex numbers in separate functions. (Subtraction and division operations can be made with pairs of these operations.) Print the results for each operation tested. Happy Numbers - A happy number is defined by the following process. Starting with any positive integer, replace the number by the sum of the squares of its digits, and repeat the process until the number equals 1 (where it will stay), or it loops endlessly in a cycle which does not include 1. Those numbers for which this process ends in 1 are happy numbers, while those that do not end in 1 are unhappy numbers. Display an example of your output here. Find the first 8 happy numbers. Number Names - Show how to spell out a number in English. You can use a preexisting implementation or roll your own, but you should support inputs up to at least one million (or the maximum value of your language's default bounded integer type if that's less). Optional: Support for inputs other than positive integers (like zero, negative integers, and floating-point numbers). Coin Flip Simulation - Write some code that simulates flipping a single coin however many times the user decides. The code should record the outcomes and count the number of tails and heads. Limit Calculator - Ask the user to enter f(x) and the limit value, then return the value of the limit statement Optional: Make the calculator capable of supporting infinite limits. Fast Exponentiation - Ask the user to enter 2 integers a and b and output a^b (i.e. pow(a,b)) in O(LG n) time complexity. Classic Algorithms Collatz Conjecture - Start with a number n > 1. Find the number of steps it takes to reach one using the following process: If n is even, divide it by 2. If n is odd, multiply it by 3 and add 1. Sorting - Implement two types of sorting algorithms: Merge sort and bubble sort. Closest pair problem - The closest pair of points problem or closest pair problem is a problem of computational geometry: given n points in metric space, find a pair of points with the smallest distance between them. Sieve of Eratosthenes - The sieve of Eratosthenes is one of the most efficient ways to find all of the smaller primes (below 10 million or so). Graph Graph from links - Create a program that will create a graph or network from a series of links. Eulerian Path - Create a program that will take as an input a graph and output either an Eulerian path or an Eulerian cycle, or state that it is not possible. An Eulerian path starts at one node and traverses every edge of a graph through every node and finishes at another node. An Eulerian cycle is an eulerian Path that starts and finishes at the same node. Connected Graph - Create a program that takes a graph as an input and outputs whether every node is connected or not. Dijkstra’s Algorithm - Create a program that finds the shortest path through a graph using its edges. Minimum Spanning Tree - Create a program that takes a connected, undirected graph with weights and outputs the minimum spanning tree of the graph i.e., a subgraph that is a tree, contains all the vertices, and the sum of its weights is the least possible. Data Structures Inverted index - An Inverted Index is a data structure used to create full-text search. Given a set of text files, implement a program to create an inverted index. Also, create a user interface to do a search using that inverted index which returns a list of files that contain the query term/terms. The search index can be in memory. Text Fizz Buzz - Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”. Reverse a String - Enter a string and the program will reverse it and print it out. Pig Latin - Pig Latin is a game of alterations played in the English language game. To create the Pig Latin form of an English word the initial consonant sound is transposed to the end of the word and an ay is affixed (Ex.: "banana" would yield anana-bay). Read Wikipedia for more information on rules. Count Vowels - Enter a string and the program counts the number of vowels in the text. For added complexity have it report a sum of each vowel found. Check if Palindrome - Checks if the string entered by the user is a palindrome. That is that it reads the same forwards as backward like “racecar” Count Words in a String - Counts the number of individual words in a string. For added complexity read these strings in from a text file and generate a summary. Text Editor - Notepad-style application that can open, edit, and save text documents. Optional: Add syntax highlighting and other features. RSS Feed Creator - Given a link to RSS/Atom Feed, get all posts and display them. Quote Tracker (market symbols etc) - A program that can go out and check the current value of stocks for a list of symbols entered by the user. The user can set how often the stocks are checked. For CLI, show whether the stock has moved up or down. Optional: If GUI, the program can show green up and red down arrows to show which direction the stock value has moved. Guestbook / Journal - A simple application that allows people to add comments or write journal entries. It can allow comments or not and timestamps for all entries. Could also be made into a shoutbox. Optional: Deploy it on Google App Engine or Heroku or any other PaaS (if possible, of course). Vigenere / Vernam / Ceasar Ciphers - Functions for encrypting and decrypting data messages. Then send them to a friend. Regex Query Tool - A tool that allows the user to enter a text string and then in a separate control enter a regex pattern. It will run the regular expression against the source text and return any matches or flag errors in the regular expression. Networking FTP Program - A file transfer program that can transfer files back and forth from a remote web sever. Bandwidth Monitor - A small utility program that tracks how much data you have uploaded and downloaded from the net during the course of your current online session. See if you can find out what periods of the day you use more and less and generate a report or graph that shows it. Port Scanner - Enter an IP address and a port range where the program will then attempt to find open ports on the given computer by connecting to each of them. On any successful connections mark the port as open. Mail Checker (POP3 / IMAP) - The user enters various account information include web server and IP, protocol type (POP3 or IMAP), and the application will check for email at a given interval. Country from IP Lookup - Enter an IP address and find the country that IP is registered in. Optional: Find the Ip automatically. Whois Search Tool - Enter an IP or host address and have it look it up through whois and return the results to you. Site Checker with Time Scheduling - An application that attempts to connect to a website or server every so many minute or a given time and check if it is up. If it is down, it will notify you by email or by posting a notice on the screen. Classes Product Inventory Project - Create an application that manages an inventory of products. Create a product class that has a price, id, and quantity on hand. Then create an inventory class that keeps track of various products and can sum up the inventory value. Airline / Hotel Reservation System - Create a reservation system that books airline seats or hotel rooms. It charges various rates for particular sections of the plane or hotel. For example, first class is going to cost more than a coach. Hotel rooms have penthouse suites which cost more. Keep track of when rooms will be available and can be scheduled. Company Manager - Create a hierarchy of classes - abstract class Employee and subclasses HourlyEmployee, SalariedEmployee, Manager, and Executive. Everyone's pay is calculated differently, research a bit about it. After you've established an employee hierarchy, create a Company class that allows you to manage the employees. You should be able to hire, fire, and raise employees. Bank Account Manager - Create a class called Account which will be an abstract class for three other classes called CheckingAccount, SavingsAccount, and BusinessAccount. Manage credits and debits from these accounts through an ATM-style program. Patient / Doctor Scheduler - Create a patient class and a doctor class. Have a doctor that can handle multiple patients and set up a scheduling program where a doctor can only handle 16 patients during an 8 hr workday. Recipe Creator and Manager - Create a recipe class with ingredients and put them in a recipe manager program that organizes them into categories like desserts, main courses, or by ingredients like chicken, beef, soups, pies, etc. Image Gallery - Create an image abstract class and then a class that inherits from it for each image type. Put them in a program that displays them in a gallery-style format for viewing. Shape Area and Perimeter Classes - Create an abstract class called Shape and then inherit from it other shapes like diamond, rectangle, circle, triangle, etc. Then have each class override the area and perimeter functionality to handle each shape type. Flower Shop Ordering To Go - Create a flower shop application that deals in flower objects and use those flower objects in a bouquet object which can then be sold. Keep track of the number of objects and when you may need to order more. Family Tree Creator - Create a class called Person which will have a name, when they were born, and when (and if) they died. Allow the user to create these Person classes and put them into a family tree structure. Print out the tree to the screen. Threading Create A Progress Bar for Downloads - Create a progress bar for applications that can keep track of a download in progress. The progress bar will be on a separate thread and will communicate with the main thread using delegates. Bulk Thumbnail Creator - Picture processing can take a bit of time for some transformations. Especially if the image is large. Create an image program that can take hundreds of images and converts them to a specified size in the background thread while you do other things. For added complexity, have one thread handling re-sizing, have another bulk renaming of thumbnails, etc. Web Page Scraper - Create an application that connects to a site and pulls out all links, or images, and saves them to a list. Optional: Organize the indexed content and don’t allow duplicates. Have it put the results into an easily searchable index file. Online White Board - Create an application that allows you to draw pictures, write notes and use various colors to flesh out ideas for projects. Optional: Add a feature to invite friends to collaborate on a whiteboard online. Get Atomic Time from Internet Clock - This program will get the true atomic time from an atomic time clock on the Internet. Use any one of the atomic clocks returned by a simple Google search. Fetch Current Weather - Get the current weather for a given zip/postal code. Optional: Try locating the user automatically. Scheduled Auto Login and Action - Make an application that logs into a given site on a schedule and invokes a certain action and then logs out. This can be useful for checking webmail, posting regular content, or getting info for other applications and saving it to your computer. E-Card Generator - Make a site that allows people to generate their own little e-cards and send them to other people. Do not use Flash. Use a picture library and perhaps insightful mottos or quotes. Content Management System - Create a content management system (CMS) like Joomla, Drupal, PHP Nuke, etc. Start small. Optional: Allow for the addition of modules/addons. Web Board (Forum) - Create a forum for you and your buddies to post, administer and share thoughts and ideas. CAPTCHA Maker - Ever see those images with letters numbers when you signup for a service and then ask you to enter what you see? It keeps web bots from automatically signing up and spamming. Try creating one yourself for online forms. Files Quiz Maker - Make an application that takes various questions from a file, picked randomly, and puts together a quiz for students. Each quiz can be different and then reads a key to grade the quizzes. Sort Excel/CSV File Utility - Reads a file of records, sorts them, and then writes them back to the file. Allow the user to choose various sort style and sorting based on a particular field. Create Zip File Maker - The user enters various files from different directories and the program zips them up into a zip file. Optional: Apply actual compression to the files. Start with Huffman Algorithm. PDF Generator - An application that can read in a text file, HTML file, or some other file and generates a PDF file out of it. Great for a web-based service where the user uploads the file and the program returns a PDF of the file. Optional: Deploy on GAE or Heroku if possible. Mp3 Tagger - Modify and add ID3v1 tags to MP3 files. See if you can also add in the album art into the MP3 file’s header as well as other ID3v2 tags. Code Snippet Manager - Another utility program that allows coders to put in functions, classes, or other tidbits to save for use later. Organized by the type of snippet or language the coder can quickly lookup code. Optional: For extra practice try adding syntax highlighting based on the language. Databases SQL Query Analyzer - A utility application in which a user can enter a query and have it run against a local database and look for ways to make it more efficient. Remote SQL Tool - A utility that can execute queries on remote servers from your local computer across the Internet. It should take in a remote host, user name, and password, run the query and return the results. Report Generator - Create a utility that generates a report based on some tables in a database. Generates sales reports based on the order/order details tables or sums up the day's current database activity. Event Scheduler and Calendar - Make an application that allows the user to enter a date and time of an event, event notes, and then schedule those events on a calendar. The user can then browse the calendar or search the calendar for specific events. Optional: Allow the application to create re-occurrence events that reoccur every day, week, month, year, etc. Budget Tracker - Write an application that keeps track of a household’s budget. The user can add expenses, income, and recurring costs to find out how much they are saving or losing over a period of time. Optional: Allow the user to specify a date range and see the net flow of money in and out of the house budget for that time period. TV Show Tracker - Got a favorite show you don’t want to miss? Don’t have a PVR or want to be able to find the show to then PVR it later? Make an application that can search various online TV Guide sites, locate the shows/times/channels and add them to a database application. The database/website then can send you email reminders that a show is about to start and which channel it will be on. Travel Planner System - Make a system that allows users to put together their own little travel itinerary and keep track of the airline/hotel arrangements, points of interest, budget, and schedule. Graphics and Multimedia Slide Show - Make an application that shows various pictures in a slide show format. Optional: Try adding various effects like fade in/out, star wipe, and window blinds transitions. Stream Video from Online - Try to create your own online streaming video player. Mp3 Player - A simple program for playing your favorite music files. Add features you think are missing from your favorite music player. Watermarking Application - Have some pictures you want copyright protected? Add your own logo or text lightly across the background so that no one can simply steal your graphics off your site. Make a program that will add this watermark to the picture. Optional: Use threading to process multiple images simultaneously. Turtle Graphics - This is a common project where you create a floor of 20 x 20 squares. Using various commands you tell a turtle to draw a line on the floor. You have moved forward, left or right, lift or drop the pen, etc. Do a search online for "Turtle Graphics" for more information. Optional: Allow the program to read in the list of commands from a file. GIF Creator A program that puts together multiple images (PNGs, JPGs, TIFFs) to make a smooth GIF that can be exported. Optional: Make the program convert small video files to GIFs as well. Security Caesar cipher - Implement a Caesar cipher, both encoding, and decoding. The key is an integer from 1 to 25. This cipher rotates the letters of the alphabet (A to Z). The encoding replaces each letter with the 1st to 25th next letter in the alphabet (wrapping Z to A). So key 2 encrypts "HI" to "JK", but key 20 encrypts "HI" to "BC". This simple "monoalphabetic substitution cipher" provides almost no security, because an attacker who has the encoded message can either use frequency analysis to guess the key, or just try all 25 keys.
stanfordnlp / Pdf StructLogical structure analysis for visually structured documents
abhishek305 / PyBot A ChatBot For Answering Python Queries Using NLPPybot can change the way learners try to learn python programming language in a more interactive way. This chatbot will try to solve or provide answer to almost every python related issues or queries that the user is asking for. We are implementing NLP for improving the efficiency of the chatbot. We will include voice feature for more interactivity to the user. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation. NLTK has been called “a wonderful tool for teaching and working in, computational linguistics using Python,” and “an amazing library to play with natural language.The main issue with text data is that it is all in text format (strings). However, the Machine learning algorithms need some sort of numerical feature vector in order to perform the task. So before we start with any NLP project we need to pre-process it to make it ideal for working. Converting the entire text into uppercase or lowercase, so that the algorithm does not treat the same words in different cases as different Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.Removing Noise i.e everything that isn’t in a standard number or letter.Removing Stop words. Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. Example if we were to stem the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”. A slight variant of stemming is lemmatization. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma. Examples of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same.
jiangnanboy / JiaJiaOCRBuilding on the existing general text recognition capabilities, new features such as handwritten OCR, layout detection, and table detection and recognition have been added, covering all scenarios involving printed text, handwritten text, and document structure analysis.在原通用文本识别基础上,新增手写 OCR、版面检测、表格检测与识别功能,覆盖印刷体、手写体、文档结构解析全场景。
microsoft / CompHRDocDatasets and Evaluation Scripts for CompHRDoc
baulbo / DiardFrom document (PDF) or document images to analysis ready semi-structured data.
huridocs / Pdf Table Of Contents ExtractorThis project aims to extract Table of Contents (TOC) information from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of identifying and structuring the document's TOC.
reddyprasade / Machine Learning Interview PreparationPrepare to Technical Skills Here are the essential skills that a Machine Learning Engineer needs, as mentioned Read me files. Within each group are topics that you should be familiar with. Study Tip: Copy and paste this list into a document and save to your computer for easy referral. Computer Science Fundamentals and Programming Topics Data structures: Lists, stacks, queues, strings, hash maps, vectors, matrices, classes & objects, trees, graphs, etc. Algorithms: Recursion, searching, sorting, optimization, dynamic programming, etc. Computability and complexity: P vs. NP, NP-complete problems, big-O notation, approximate algorithms, etc. Computer architecture: Memory, cache, bandwidth, threads & processes, deadlocks, etc. Probability and Statistics Topics Basic probability: Conditional probability, Bayes rule, likelihood, independence, etc. Probabilistic models: Bayes Nets, Markov Decision Processes, Hidden Markov Models, etc. Statistical measures: Mean, median, mode, variance, population parameters vs. sample statistics etc. Proximity and error metrics: Cosine similarity, mean-squared error, Manhattan and Euclidean distance, log-loss, etc. Distributions and random sampling: Uniform, normal, binomial, Poisson, etc. Analysis methods: ANOVA, hypothesis testing, factor analysis, etc. Data Modeling and Evaluation Topics Data preprocessing: Munging/wrangling, transforming, aggregating, etc. Pattern recognition: Correlations, clusters, trends, outliers & anomalies, etc. Dimensionality reduction: Eigenvectors, Principal Component Analysis, etc. Prediction: Classification, regression, sequence prediction, etc.; suitable error/accuracy metrics. Evaluation: Training-testing split, sequential vs. randomized cross-validation, etc. Applying Machine Learning Algorithms and Libraries Topics Models: Parametric vs. nonparametric, decision tree, nearest neighbor, neural net, support vector machine, ensemble of multiple models, etc. Learning procedure: Linear regression, gradient descent, genetic algorithms, bagging, boosting, and other model-specific methods; regularization, hyperparameter tuning, etc. Tradeoffs and gotchas: Relative advantages and disadvantages, bias and variance, overfitting and underfitting, vanishing/exploding gradients, missing data, data leakage, etc. Software Engineering and System Design Topics Software interface: Library calls, REST APIs, data collection endpoints, database queries, etc. User interface: Capturing user inputs & application events, displaying results & visualization, etc. Scalability: Map-reduce, distributed processing, etc. Deployment: Cloud hosting, containers & instances, microservices, etc. Move on to the final lesson of this course to find lots of sample practice questions for each topic!
Aniket-a14 / SRAAI-powered Software Requirements Analysis (SRA) system that evaluates requirement quality, detects ambiguities, and suggests structured improvements for more reliable and complete SRS documents.
yufanchen96 / GraphDocGraph-based Document Structure Analysis
qyhou / Curated Document Layout AnalysisA curated list of resources on Document Layout Analysis
jashwanth / Remote Code PublisherRemote-Code-Publisher Purpose: A Code Repository is a Program responsible for managing source code resources, e.g., files and documents. A fully developed Repository will support file persistance, managment of versions, and the acquisition and publication of source and document files. A Remote Repository adds the capability to access the Repository's functionality over a communication channel, e.g., interprocess communication, inter-network communication, and communication across the internet. In this project we will focus on the publication functionality of a Remote Repository. We will develop a remote code publisher, local client, and communication channel that supports client access to the publisher from any internet enabled processor. The communication channel will use sockets and support an HTTP like message structure. The channel will support: HTTP style request/response transactions One-way communication, allowing asynchronous messaging between any two endpoints that are capable of listening for connection requests and connecting to a remote listener. Transmission of byte streams that are set up with one or more negotiation messages followed by transmission of a stream of bytes of specified stream size2. The Remote Code Publisher will: Support publishing web pages that are small wrappers around C++ source code files, just as we did in Project #3. Accept source code text files, sent from a local client. Support building dependency relationships between code files saved in specific repository folders, based on the functionality you provided in Project #2 and used in Project #3. Support HTML file creation for all the files in a specified repository folder1, including linking information that displays dependency relationships, and supports and navigation based on dependency relationships. Delete stored files, as requested by a local client. Clients of the Remote Code Publisher will provide a Graphical User Interface (GUI) with means to: Upload one or more source code text files to the Remote Publisher, specifying a category with which those files are associated1. Display file categories, based on the directory structure supported by the Repository. Display all the files in any category. Display all of the files in any category that have no parents. Display the web page for any file in that file list by clicking within a GUI control. This implies that the client will download the appropriate webpages, scripts, and style sheets and display, by starting a browser with a file cited on the command line2. On starting, will download style sheet and JavaScript files from the Repository. Note that your client does not need to supply the functionality to display web pages. It simply starts a browser to do that. Browsers will accept a file name, which probably includes a relative path to display a web page from the local directory. You could also start IIS web server and provide an appropriate URL to the browser on startup. Either approach is acceptable. If you use IIS, you won't have to download files, but you are obligated to show that you can do that. Requirements: Your Remote Repository: (2) Shall use Visual Studio 2015 and its C++ Windows console projects, as provided in the ECS computer labs. You must also use Windows Presentation Foundation (WPF) to provide a required client Graphical User Interface (GUI). (1) Shall use the C++ standard library's streams for all console I/O and new and delete for all heap-based memory management. (3) Shall provide a Repository program that provides functionality to publish, as linked web pages, the contents of a set of C++ source code files. (4) Shall, for the publishing process, satisfy the requirements of CodePublisher developed in Project #3. (4) Shall provide a Client program that can upload files3, and view Repository contents, as described in the Purpose section, above. (3) Shall provide a message-passing communication system, based on Sockets, used to access the Repository's functionality from another process or machine. (2) The communication system shall provide support for passing HTTP style messages using either synchronous request/response or asynchronous one-way messaging. (1) The communication system shall also support sending and receiving streams of bytes6. Streams will be established with an initial exchange of messages. (5) Shall include an automated unit test suite that demonstrates you meet all the requirements of this project4 including the transmission of files. (5 point bonus) Shall optionally use a lazy download strategy, that, when presented with a name of a source code web page, will download that file and all the files it links to. This allows you to demonstrate your project using local webpages instead of downloading the entire contents of the Code Publisher for demonstration. (5 point bonus) Shall optionally have the publisher accept a path, on the commandline, to a virtual directory on the server. Then support browsing directly from the server by supplying a url to that path when you start a browser. This works only if you setup IIS on your machine and make the path a virtual directory. The TAs will do that on the grading machines. Categories are the names of folders in which the Repository stores its source code and web files. You may define Categories in any way that seems sensible. For example, they could simply be the namespace(s) for the uploaded files, or a Client supplied name. You will find a demonstration of how to programmatically start an application here. The stream capablity is intended to send files, which could be either text or binary format. Stream size will be the file size. Transmitting and receiving byte streams will be used to send and receive files in either text or binary format. This is in addition to the construction tests you include as part of every package you submit. Project 3 statement: Purpose: A Code Repository is a Program responsible for managing source code resources, e.g., files and documents. A fully developed Repository will support file persistance, managment of versions, and the acquisition and publication of source and document files. This project focuses on just the publishing functionality of a repository. In this project we will develop means to display source code files as web pages with embedded child links. Each link refers to a code file that the displayed code file depends on. There are several things you need to know in order to complete this project: Each file to be published is a C++ source file. Our publisher will generate, for each of these, an HTML file, with most of the contents drawn from the code file. The pages we will generate have only static content, with the exception of some embedded JavaScript and styling, so we won't need a web server. We will need to preserve the white space structure of the displayed source code. That can be done embedding all the code between the tags <pre> and </pre> or by using the CSS white-space property with value "pre" to style a div with all the code in its contents. Any markup characters in the code text will have to be escaped, e.g., replace < with < and > with >. File dependencies are displayed in the web page with embedded links, which are implemented in HTML5 with anchor elements: <a href="[url of referenced html page]">source code file name</a> For each class, we will, optionally, implement outlining, similar to the visual studio outlining feature. To do that we will use the CSS display property, with values: normal or none, to control whether the contents of a div are visible or not. The Code Publisher will be embedded in a mock Repository with almost no functionality except to support publishing of source code as web pages. Specifically you are not expected to provide support for: package checkin or checkout versioning You are expected to support: Dependency analysis of the C++ source code files you will publish, using the analyzer you developed in Project #2. The ability to specify, on the command line, files to be published, by providing command line arguments for path and file patterns. The ability to display any file cited on the command line, by starting a process that runs a browser of your choice, naming the specification of the file you want to display. Note that the CodePublisher project creates a code generator. Its inputs are C++ code and its outputs are HTML code. Requirements: Your CodePublisher Project: (1) Shall use Visual Studio 2015 and its C++ Windows console projects, as provided in the ECS computer labs. (2) Shall use the C++ standard library's streams for all console I/O and new and delete for all heap-based memory management1. (4) Shall provide a Publisher program that provides for creation of web pages each of which captures the content of a single C++ source code file, e.g., *.h or *.cpp. (10) Shall, optionally2 provide the facility to expand or collapse class bodies, methods, and global functions using JavaScript and CSS properties. (2) Shall provide a CSS style sheet that the Publisher uses to style its generated pages and (if you are implementing the previous optional requirement) a JavaScript file that provides functionality to hide and unhide sections of code for outlining, using mouse clicks. (2) Shall embed in each web page's <head> section links to the style sheet and JavaScript file. (4) Shall embedd HTML5 links to dependent files with a label, at the top of the web page. Publisher shall use functionality from your Project #2 to discover package dependencies within the published set of source files. (2) Shall develop command line processing to define the files to publish by specifying path and file patterns. (3) Shall demonstrate the CodePublisher functionality by publishing all the important packages in your Project #3. (5) Shall include an automated unit test suite that demonstrates you meet all the requirements of this project2. That means that you are not allowed to use any of the C language I/0, e.g., printf, scanf, etc, nor the C memory management, e.g., calloc, malloc, or free. This optional requirement will take a significant amount of work to complete successfully. You should get everything else working before attempting this additional effort. This is in addition to the construction tests you include as part of every package you submit.
jlira5418 / Web Scraping Challenge## Step 1 - Scraping Complete your initial scraping using Jupyter Notebook, BeautifulSoup, Pandas, and Requests/Splinter. * Create a Jupyter Notebook file called `mission_to_mars.ipynb` and use this to complete all of your scraping and analysis tasks. The following outlines what you need to scrape. ### NASA Mars News * Scrape the [Mars News Site](https://redplanetscience.com/) and collect the latest News Title and Paragraph Text. Assign the text to variables that you can reference later. ```python # Example: news_title = "NASA's Next Mars Mission to Investigate Interior of Red Planet" news_p = "Preparation of NASA's next spacecraft to Mars, InSight, has ramped up this summer, on course for launch next May from Vandenberg Air Force Base in central California -- the first interplanetary launch in history from America's West Coast." ``` ### JPL Mars Space Images - Featured Image * Visit the url for the Featured Space Image site [here](https://spaceimages-mars.com). * Use splinter to navigate the site and find the image url for the current Featured Mars Image and assign the url string to a variable called `featured_image_url`. * Make sure to find the image url to the full size `.jpg` image. * Make sure to save a complete url string for this image. ```python # Example: featured_image_url = 'https://spaceimages-mars.com/image/featured/mars2.jpg' ``` ### Mars Facts * Visit the Mars Facts webpage [here](https://galaxyfacts-mars.com) and use Pandas to scrape the table containing facts about the planet including Diameter, Mass, etc. * Use Pandas to convert the data to a HTML table string. ### Mars Hemispheres * Visit the astrogeology site [here](https://marshemispheres.com/) to obtain high resolution images for each of Mar's hemispheres. * You will need to click each of the links to the hemispheres in order to find the image url to the full resolution image. * Save both the image url string for the full resolution hemisphere image, and the Hemisphere title containing the hemisphere name. Use a Python dictionary to store the data using the keys `img_url` and `title`. * Append the dictionary with the image url string and the hemisphere title to a list. This list will contain one dictionary for each hemisphere. ```python # Example: hemisphere_image_urls = [ {"title": "Valles Marineris Hemisphere", "img_url": "..."}, {"title": "Cerberus Hemisphere", "img_url": "..."}, {"title": "Schiaparelli Hemisphere", "img_url": "..."}, {"title": "Syrtis Major Hemisphere", "img_url": "..."}, ] ``` - - - ## Step 2 - MongoDB and Flask Application Use MongoDB with Flask templating to create a new HTML page that displays all of the information that was scraped from the URLs above. * Start by converting your Jupyter notebook into a Python script called `scrape_mars.py` with a function called `scrape` that will execute all of your scraping code from above and return one Python dictionary containing all of the scraped data. * Next, create a route called `/scrape` that will import your `scrape_mars.py` script and call your `scrape` function. * Store the return value in Mongo as a Python dictionary. * Create a root route `/` that will query your Mongo database and pass the mars data into an HTML template to display the data. * Create a template HTML file called `index.html` that will take the mars data dictionary and display all of the data in the appropriate HTML elements. Use the following as a guide for what the final product should look like, but feel free to create your own design.  - - - ## Step 3 - Submission To submit your work to BootCampSpot, create a new GitHub repository and upload the following: 1. The Jupyter Notebook containing the scraping code used. 2. Screenshots of your final application. 3. Submit the link to your new repository to BootCampSpot. 4. Ensure your repository has regular commits and a thorough README.md file ## Hints * Use Splinter to navigate the sites when needed and BeautifulSoup to help find and parse out the necessary data. * Use Pymongo for CRUD applications for your database. For this homework, you can simply overwrite the existing document each time the `/scrape` url is visited and new data is obtained. * Use Bootstrap to structure your HTML template.
Tanguy9862 / Medical OCR Data ExtractionAn object-oriented Python script for extracting structured data from medical documents. Successfully processed 2,000+ files, combining OCR technology to output clean datasets for analytics. Includes collaboration with medical professionals and statistical analysis via RMarkdown.
pvaviloff / Php GuidelinesThis document outlines strategies for scaling development teams and structuring projects with a focus on writing clear code and documentation. It emphasizes the importance of thorough expert analysis and maintaining communication to ensure smooth project development.
staniszewskimarek / Sot VisualizerClaude skill for document analysis using Structure of Thought method
shaadclt / PDF Data Extraction PyMuPDF4LLMThis repository demonstrates how to extract text, images, and structured content from PDF documents using pymupdf4llm in Google Colab. It also includes data preparation for LlamaIndex for further document analysis and information extraction.
dleerdefi / Epstein Network DataMaking Epstein network documents searchable through data science. Extracts and structures data from the 50th Birthday Book, Black Book, and Flight Logs for Neo4j graph analysis and natural language querying
GantaVenkataKousik / Dsa PatternsThis repository is a curated collection of key Data Structures and Algorithms (DSA) problems, emphasizing the identification and analysis of recurring patterns. As we solve various DSA challenges, we will document significant patterns and their sub-patterns to aid in efficient problem-solving and revision.