33 skills found · Page 1 of 2
chonkie-inc / Chonkie🦛 CHONK docs with Chonkie ✨ — The lightweight ingestion library for fast, efficient and robust RAG pipelines
chonkie-inc / Chonkiejs🦛 CHONK your texts with Chonkie ✨ Type-friendly, light-weight, fast and super-simple chunking library
nlfiedler / Fastcdc RsFastCDC implementation in Rust
jotfs / Fastcdc GoGo implementation of the FastCDC content-defined chunking algorithm
iscc / Fastcdc PyFastCDC implementation in Python https://pypi.org/project/fastcdc/
GiovanniPasq / ChunkyConvert and validate your Markdown, then choose the best chunking strategy for your RAG pipeline.
Michael-F-Bryan / Noise GateA simple Noise Gate algorithm for splitting an audio stream into chunks based on volume/silence
tigerwill90 / FastcdcThis package implements the FastCDC content defined chunking algorithm
leetal / AWFileHashAn Objective-C implementation of CRC32b checksum, MD5, SHA1 and SHA512 hash algorithms. Performs it chunked and consumes almost no memory while running, making it suitable to both OSX and iOS.
etherceo1x1 / CodesBUILD YOUR OWN BLOCKCHAIN: A PYTHON TUTORIAL Download the full Jupyter/iPython notebook from Github here Build Your Own Blockchain – The Basics¶ This tutorial will walk you through the basics of how to build a blockchain from scratch. Focusing on the details of a concrete example will provide a deeper understanding of the strengths and limitations of blockchains. For a higher-level overview, I’d recommend this excellent article from BitsOnBlocks. Transactions, Validation, and updating system state¶ At its core, a blockchain is a distributed database with a set of rules for verifying new additions to the database. We’ll start off by tracking the accounts of two imaginary people: Alice and Bob, who will trade virtual money with each other. We’ll need to create a transaction pool of incoming transactions, validate those transactions, and make them into a block. We’ll be using a hash function to create a ‘fingerprint’ for each of our transactions- this hash function links each of our blocks to each other. To make this easier to use, we’ll define a helper function to wrap the python hash function that we’re using. In [1]: import hashlib, json, sys def hashMe(msg=""): # For convenience, this is a helper function that wraps our hashing algorithm if type(msg)!=str: msg = json.dumps(msg,sort_keys=True) # If we don't sort keys, we can't guarantee repeatability! if sys.version_info.major == 2: return unicode(hashlib.sha256(msg).hexdigest(),'utf-8') else: return hashlib.sha256(str(msg).encode('utf-8')).hexdigest() Next, we want to create a function to generate exchanges between Alice and Bob. We’ll indicate withdrawals with negative numbers, and deposits with positive numbers. We’ll construct our transactions to always be between the two users of our system, and make sure that the deposit is the same magnitude as the withdrawal- i.e. that we’re neither creating nor destroying money. In [2]: import random random.seed(0) def makeTransaction(maxValue=3): # This will create valid transactions in the range of (1,maxValue) sign = int(random.getrandbits(1))*2 - 1 # This will randomly choose -1 or 1 amount = random.randint(1,maxValue) alicePays = sign * amount bobPays = -1 * alicePays # By construction, this will always return transactions that respect the conservation of tokens. # However, note that we have not done anything to check whether these overdraft an account return {u'Alice':alicePays,u'Bob':bobPays} Now let’s create a large set of transactions, then chunk them into blocks. In [3]: txnBuffer = [makeTransaction() for i in range(30)] Next step: making our very own blocks! We’ll take the first k transactions from the transaction buffer, and turn them into a block. Before we do that, we need to define a method for checking the valididty of the transactions we’ve pulled into the block. For bitcoin, the validation function checks that the input values are valid unspent transaction outputs (UTXOs), that the outputs of the transaction are no greater than the input, and that the keys used for the signatures are valid. In Ethereum, the validation function checks that the smart contracts were faithfully executed and respect gas limits. No worries, though- we don’t have to build a system that complicated. We’ll define our own, very simple set of rules which make sense for a basic token system: The sum of deposits and withdrawals must be 0 (tokens are neither created nor destroyed) A user’s account must have sufficient funds to cover any withdrawals If either of these conditions are violated, we’ll reject the transaction. In [4]: def updateState(txn, state): # Inputs: txn, state: dictionaries keyed with account names, holding numeric values for transfer amount (txn) or account balance (state) # Returns: Updated state, with additional users added to state if necessary # NOTE: This does not not validate the transaction- just updates the state! # If the transaction is valid, then update the state state = state.copy() # As dictionaries are mutable, let's avoid any confusion by creating a working copy of the data. for key in txn: if key in state.keys(): state[key] += txn[key] else: state[key] = txn[key] return state In [5]: def isValidTxn(txn,state): # Assume that the transaction is a dictionary keyed by account names # Check that the sum of the deposits and withdrawals is 0 if sum(txn.values()) is not 0: return False # Check that the transaction does not cause an overdraft for key in txn.keys(): if key in state.keys(): acctBalance = state[key] else: acctBalance = 0 if (acctBalance + txn[key]) < 0: return False return True Here are a set of sample transactions, some of which are fraudulent- but we can now check their validity! In [6]: state = {u'Alice':5,u'Bob':5} print(isValidTxn({u'Alice': -3, u'Bob': 3},state)) # Basic transaction- this works great! print(isValidTxn({u'Alice': -4, u'Bob': 3},state)) # But we can't create or destroy tokens! print(isValidTxn({u'Alice': -6, u'Bob': 6},state)) # We also can't overdraft our account. print(isValidTxn({u'Alice': -4, u'Bob': 2,'Lisa':2},state)) # Creating new users is valid print(isValidTxn({u'Alice': -4, u'Bob': 3,'Lisa':2},state)) # But the same rules still apply! True False False True False Each block contains a batch of transactions, a reference to the hash of the previous block (if block number is greater than 1), and a hash of its contents and the header Building the Blockchain: From Transactions to Blocks¶ We’re ready to start making our blockchain! Right now, there’s nothing on the blockchain, but we can get things started by defining the ‘genesis block’ (the first block in the system). Because the genesis block isn’t linked to any prior block, it gets treated a bit differently, and we can arbitrarily set the system state. In our case, we’ll create accounts for our two users (Alice and Bob) and give them 50 coins each. In [7]: state = {u'Alice':50, u'Bob':50} # Define the initial state genesisBlockTxns = [state] genesisBlockContents = {u'blockNumber':0,u'parentHash':None,u'txnCount':1,u'txns':genesisBlockTxns} genesisHash = hashMe( genesisBlockContents ) genesisBlock = {u'hash':genesisHash,u'contents':genesisBlockContents} genesisBlockStr = json.dumps(genesisBlock, sort_keys=True) Great! This becomes the first element from which everything else will be linked. In [8]: chain = [genesisBlock] For each block, we want to collect a set of transactions, create a header, hash it, and add it to the chain In [9]: def makeBlock(txns,chain): parentBlock = chain[-1] parentHash = parentBlock[u'hash'] blockNumber = parentBlock[u'contents'][u'blockNumber'] + 1 txnCount = len(txns) blockContents = {u'blockNumber':blockNumber,u'parentHash':parentHash, u'txnCount':len(txns),'txns':txns} blockHash = hashMe( blockContents ) block = {u'hash':blockHash,u'contents':blockContents} return block Let’s use this to process our transaction buffer into a set of blocks: In [10]: blockSizeLimit = 5 # Arbitrary number of transactions per block- # this is chosen by the block miner, and can vary between blocks! while len(txnBuffer) > 0: bufferStartSize = len(txnBuffer) ## Gather a set of valid transactions for inclusion txnList = [] while (len(txnBuffer) > 0) & (len(txnList) < blockSizeLimit): newTxn = txnBuffer.pop() validTxn = isValidTxn(newTxn,state) # This will return False if txn is invalid if validTxn: # If we got a valid state, not 'False' txnList.append(newTxn) state = updateState(newTxn,state) else: print("ignored transaction") sys.stdout.flush() continue # This was an invalid transaction; ignore it and move on ## Make a block myBlock = makeBlock(txnList,chain) chain.append(myBlock) In [11]: chain[0] Out[11]: {'contents': {'blockNumber': 0, 'parentHash': None, 'txnCount': 1, 'txns': [{'Alice': 50, 'Bob': 50}]}, 'hash': '7c88a4312054f89a2b73b04989cd9b9e1ae437e1048f89fbb4e18a08479de507'} In [12]: chain[1] Out[12]: {'contents': {'blockNumber': 1, 'parentHash': '7c88a4312054f89a2b73b04989cd9b9e1ae437e1048f89fbb4e18a08479de507', 'txnCount': 5, 'txns': [{'Alice': 3, 'Bob': -3}, {'Alice': -1, 'Bob': 1}, {'Alice': 3, 'Bob': -3}, {'Alice': -2, 'Bob': 2}, {'Alice': 3, 'Bob': -3}]}, 'hash': '7a91fc8206c5351293fd11200b33b7192e87fad6545504068a51aba868bc6f72'} As expected, the genesis block includes an invalid transaction which initiates account balances (creating tokens out of thin air). The hash of the parent block is referenced in the child block, which contains a set of new transactions which affect system state. We can now see the state of the system, updated to include the transactions: In [13]: state Out[13]: {'Alice': 72, 'Bob': 28} Checking Chain Validity¶ Now that we know how to create new blocks and link them together into a chain, let’s define functions to check that new blocks are valid- and that the whole chain is valid. On a blockchain network, this becomes important in two ways: When we initially set up our node, we will download the full blockchain history. After downloading the chain, we would need to run through the blockchain to compute the state of the system. To protect against somebody inserting invalid transactions in the initial chain, we need to check the validity of the entire chain in this initial download. Once our node is synced with the network (has an up-to-date copy of the blockchain and a representation of system state) it will need to check the validity of new blocks that are broadcast to the network. We will need three functions to facilitate in this: checkBlockHash: A simple helper function that makes sure that the block contents match the hash checkBlockValidity: Checks the validity of a block, given its parent and the current system state. We want this to return the updated state if the block is valid, and raise an error otherwise. checkChain: Check the validity of the entire chain, and compute the system state beginning at the genesis block. This will return the system state if the chain is valid, and raise an error otherwise. In [14]: def checkBlockHash(block): # Raise an exception if the hash does not match the block contents expectedHash = hashMe( block['contents'] ) if block['hash']!=expectedHash: raise Exception('Hash does not match contents of block %s'% block['contents']['blockNumber']) return In [15]: def checkBlockValidity(block,parent,state): # We want to check the following conditions: # - Each of the transactions are valid updates to the system state # - Block hash is valid for the block contents # - Block number increments the parent block number by 1 # - Accurately references the parent block's hash parentNumber = parent['contents']['blockNumber'] parentHash = parent['hash'] blockNumber = block['contents']['blockNumber'] # Check transaction validity; throw an error if an invalid transaction was found. for txn in block['contents']['txns']: if isValidTxn(txn,state): state = updateState(txn,state) else: raise Exception('Invalid transaction in block %s: %s'%(blockNumber,txn)) checkBlockHash(block) # Check hash integrity; raises error if inaccurate if blockNumber!=(parentNumber+1): raise Exception('Hash does not match contents of block %s'%blockNumber) if block['contents']['parentHash'] != parentHash: raise Exception('Parent hash not accurate at block %s'%blockNumber) return state In [16]: def checkChain(chain): # Work through the chain from the genesis block (which gets special treatment), # checking that all transactions are internally valid, # that the transactions do not cause an overdraft, # and that the blocks are linked by their hashes. # This returns the state as a dictionary of accounts and balances, # or returns False if an error was detected ## Data input processing: Make sure that our chain is a list of dicts if type(chain)==str: try: chain = json.loads(chain) assert( type(chain)==list) except: # This is a catch-all, admittedly crude return False elif type(chain)!=list: return False state = {} ## Prime the pump by checking the genesis block # We want to check the following conditions: # - Each of the transactions are valid updates to the system state # - Block hash is valid for the block contents for txn in chain[0]['contents']['txns']: state = updateState(txn,state) checkBlockHash(chain[0]) parent = chain[0] ## Checking subsequent blocks: These additionally need to check # - the reference to the parent block's hash # - the validity of the block number for block in chain[1:]: state = checkBlockValidity(block,parent,state) parent = block return state We can now check the validity of the state: In [17]: checkChain(chain) Out[17]: {'Alice': 72, 'Bob': 28} And even if we are loading the chain from a text file, e.g. from backup or loading it for the first time, we can check the integrity of the chain and create the current state: In [18]: chainAsText = json.dumps(chain,sort_keys=True) checkChain(chainAsText) Out[18]: {'Alice': 72, 'Bob': 28} Putting it together: The final Blockchain Architecture¶ In an actual blockchain network, new nodes would download a copy of the blockchain and verify it (as we just did above), then announce their presence on the peer-to-peer network and start listening for transactions. Bundling transactions into a block, they then pass their proposed block on to other nodes. We’ve seen how to verify a copy of the blockchain, and how to bundle transactions into a block. If we recieve a block from somewhere else, verifying it and adding it to our blockchain is easy. Let’s say that the following code runs on Node A, which mines the block: In [19]: import copy nodeBchain = copy.copy(chain) nodeBtxns = [makeTransaction() for i in range(5)] newBlock = makeBlock(nodeBtxns,nodeBchain) Now assume that the newBlock is transmitted to our node, and we want to check it and update our state if it is a valid block: In [20]: print("Blockchain on Node A is currently %s blocks long"%len(chain)) try: print("New Block Received; checking validity...") state = checkBlockValidity(newBlock,chain[-1],state) # Update the state- this will throw an error if the block is invalid! chain.append(newBlock) except: print("Invalid block; ignoring and waiting for the next block...") print("Blockchain on Node A is now %s blocks long"%len(chain)) Blockchain on Node A is currently 7 blocks long New Block Received; checking validity... Blockchain on Node A is now 8 blocks long
codyps / Hash RollA variety of content chunking algorithms with a common API in rust
patyork / AutomaticSpeechChunkerFrom a large speech audio file and its corresponding body of text, automatically chunk the audio and text into (phrase, audio_snippet) pairs. For use with the Connectionist Temporal Classification (CTC) cost algorithm.
netleibi / FastchunkingFast text chunking algorithms for Python
ayush585 / SmartChunkSmartChunk is a lightweight, structure-aware semantic chunking toolkit designed to supercharge RAG (Retrieval-Augmented Generation) and LLM pipelines. Unlike naive splitters that break text arbitrarily, SmartChunk respects document structure (headings, lists, tables, code blocks) and semantic flow, ensuring cleaner, more coherent chunks.
Jai-Agarwal-04 / Sentiment Analysis With InsightsSentiment Analysis with Insights using NLP and Dash This project show the sentiment analysis of text data using NLP and Dash. I used Amazon reviews dataset to train the model and further scrap the reviews from Etsy.com in order to test my model. Prerequisites: Python3 Amazon Dataset (3.6GB) Anaconda How this project was made? This project has been built using Python3 to help predict the sentiments with the help of Machine Learning and an interactive dashboard to test reviews. To start, I downloaded the dataset and extracted the JSON file. Next, I took out a portion of 7,92,000 reviews equally distributed into chunks of 24000 reviews using pandas. The chunks were then combined into a single CSV file called balanced_reviews.csv. This balanced_reviews.csv served as the base for training my model which was filtered on the basis of review greater than 3 and less than 3. Further, this filtered data was vectorized using TF_IDF vectorizer. After training the model to a 90% accuracy, the reviews were scrapped from Etsy.com in order to test our model. Finally, I built a dashboard in which we can check the sentiments based on input given by the user or can check the sentiments of reviews scrapped from the website. What is CountVectorizer? CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis). CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample. What is TF-IDF Vectorizer? TF-IDF stands for Term Frequency - Inverse Document Frequency and is a statistic that aims to better define how important a word is for a document, while also taking into account the relation to other documents from the same corpus. This is performed by looking at how many times a word appears into a document while also paying attention to how many times the same word appears in other documents in the corpus. The rationale behind this is the following: a word that frequently appears in a document has more relevancy for that document, meaning that there is higher probability that the document is about or in relation to that specific word a word that frequently appears in more documents may prevent us from finding the right document in a collection; the word is relevant either for all documents or for none. Either way, it will not help us filter out a single document or a small subset of documents from the whole set. So then TF-IDF is a score which is applied to every word in every document in our dataset. And for every word, the TF-IDF value increases with every appearance of the word in a document, but is gradually decreased with every appearance in other documents. What is Plotly Dash? Dash is a productive Python framework for building web analytic applications. Written on top of Flask, Plotly.js, and React.js, Dash is ideal for building data visualization apps with highly custom user interfaces in pure Python. It's particularly suited for anyone who works with data in Python. Dash apps are rendered in the web browser. You can deploy your apps to servers and then share them through URLs. Since Dash apps are viewed in the web browser, Dash is inherently cross-platform and mobile ready. Dash is an open source library, released under the permissive MIT license. Plotly develops Dash and offers a platform for managing Dash apps in an enterprise environment. What is Web Scrapping? Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Running the project Step 1: Download the dataset and extract the JSON data in your project folder. Make a folder filtered_chunks and run the data_extraction.py file. This will extract data from the JSON file into equal sized chunks and then combine them into a single CSV file called balanced_reviews.csv. Step 2: Run the data_cleaning_preprocessing_and_vectorizing.py file. This will clean and filter out the data. Next the filtered data will be fed to the TF-IDF Vectorizer and then the model will be pickled in a trained_model.pkl file and the Vocabulary of the trained model will be stored as vocab.pkl. Keep these two files in a folder named model_files. Step 3: Now run the etsy_review_scrapper.py file. Adjust the range of pages and product to be scrapped as it might take a long long time to process. A small sized data is sufficient to check the accuracy of our model. The scrapped data will be stored in csv as well as db file. Step 4: Finally, run the app.py file that will start up the Dash server and we can check the working of our model either by typing or either by selecting the preloaded scrapped reviews.
HenryWJL / Act PytorchThe reproduction code of the imitation learning algorithm Action Chunking with Transformer (ACT)
SibaMishra / Clustering Glossary Terms Extracted From Large Sized Software Requirements Using FastTextThis repository contains the results of automatic glossary terms extraction and their clustering considering two important qualitative attributes, i.e. feature and benefit of the original CrowdRE requirement specifications dataset. In the original CrowdRE dataset, each entry has 6 attributes, i.e., role, feature, benefit, domain, tags and date-time of creation. Since, we are interested in extracting domain-specific terms from this dataset, we only focus on feature and benefit attributes of this dataset. The dataset used in our experiments containing only the feature and benefit attributes of the original CrowdRE dataset can be viewed in the file named "CrowdRE Requirements Dataset.csv". However, the original CrowdRE dataset is devloped by P. K. Murukannaiah et al. and can be accessed from "The smarthome crowd requirements dataset", https://crowdre.github.io/murukannaiah-smarthome-requirements-dataset/, April, 2017. We have computed and reported the ground truth set for a random subset of 100 requirement specifications of the used CrowdRE dataset. In total, we have manually identified a total of 120 ground truth glossary terms with 30 overlapping clusters. The ground truth glossary terms have been calculated from the best intuition of the people (s) involved in this project in an unbiased manner, as there exists no benchmark or gold standard related to the ground truth extraction and clustering for the CrowdRE dataset. The file named "Ground Truth Clusters.docx" shows the ground truth glossary terms along with the manually formulated semantically similar clusters. Note: the clusters are separated with (######) symbol in the file. Further, the manually identified 120 glossary terms in the ground truth set are shown in the third column of the file named as "Extracted Glossary Terms (With and Without WordNet Removal) and Ground Truth Glossary Terms.csv". We have extracted a total of 143 and 292 glossary terms from the CrowdRE dataset with or without removing some words specified in the WordNet lexical database (https://wordnet.princeton.edu/) using a mature text chunking approach. The results are shown in the first and second column of the file named "Extracted Glossary Terms (With and Without WordNet Removal) and Ground Truth Glossary Terms.csv". The extracted glossary terms are trained with the help of a domain specific corpora that is most related to used CrowdRE dataset, i.e. (Wikipedia Home Automation Category for a maximum depth of two, "https://en.wikipedia.org/wiki/Category:Home_automation") and with a pre-trained word vectors UMBC webbase corpus and statmt.org news dataset trained with subwords information in wikipedia 2017 (T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. Advances in Pre-Training Distributed Word Representations) using FastText word embedding vectors (https://fasttext.cc/docs/en/english-vectors.html). The main purpose of the training is to deduce the clusters by forming a the similarity matrix for the extracted glossary terms. For this, we have used two clustering algorithms, viz. K-Means and EM clustering algorithms. The similarity matrix have been developed using the computed semantic similarity scores (cosine similarity) between the word vectors using the word embedding based FastText model. The results in terms of automated formulated clusters for the random subset of 100 requirement specifications of the CrowdRE dataset for which the ground truth glossary terms are calculated are shown in the files named "Automated Ideal (Ground Truth) Clusters.docx" and "Automated Extraction and Clustering.docx" respectively. Note: there exists a maximum of n/2 clusters for n glossary terms. For evaluating the efficacy of the clustering algorithms, we used some commonly used performance evaluation metrics like (precision, recall, f-scores). The evaluation graphs utilizing the area under curve plots (AUC) and evaluating the normalized AUC scores for all the used clustering algorithms are trained on two different datasets and the evaluation results are shown in the two separate files namely, "Cluster Plots.docx" and "Extraction +Clustering Plots.docx" respectively.
Piletskii-Oleg / Rust ChunkingContent Based Chunking algorithms implemented in Rust.
jedisct1 / Zig UltracdcUltraCDC, a fast content-defined chunking algorithm for data deduplication.
withLinda / MaxMin Contextual ChunkingA Rust implementation of the MaxMin semantic text chunking algorithm using Voyage AI's voyage-context-3 API for contextual embeddings.