* init deepseek model file
* temp removal of flash attention implementation
* shapes and proper, can make a pass
* query, key, value have good cosine similarity, but the max diff is a bit high
* Attention block is working! ** with eager for now, have not added the mask line
* Attention block is working! ** with eager for now, have not added the mask line
* working MoE at around 0.95 cosine sim
* added cosine similarity function
* Starting end to end structure
* Trying (and failing) to get rope to work, going to test full thing on tater
* running on tater36... just not the right outputs
* we have the right values for rope... but its still not working?
* chnage Extrapolation Factor to 1
* removed adding residuals twice, removed normalization from shared expert, refactored Norms (Attention, MLP) to be outside the (Attention, MLP) blocks and in the Transformer block instead, add cache setLayer
* Temporary modelfiles for cpu
* change kpass intermediate step to kv, two layer outputs [0,1] look fine
* this calls for 16 chicken nuggets
* whoops
* cleaning up code
* delete stuff we dont need
* getting rid of debug statements for llama cpp
* working with long contexts
* fix long context view error
* reverting some changes I made for files that are not apart of pr
* Added proper tokenizer for deeepseek3
* clean up model and go test
* remove Modelfile
* not passing the tests
* whoops
* how to pass the ci tests
* resolving some of the comments
* rename
* linted and renamed deepseek3 -> deepseek2
* remove name go
* addressed changes - main change was adopting qwen3 naming scheme
* I cannot with linters
* clean up logs
* clean up logs
---------
Co-authored-by: Grace Guo <graceguo@Graces-MBP.localdomain>
Co-authored-by: Grace Guo <graceguo@Graces-MacBook-Pro.local>
Co-authored-by: graceguo <graceguo@tater36.localdomain>