diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index dcc4d5488..2505a6665 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -6,7 +6,7 @@ As an open source project in a rapidly changing space, we welcome all contributi ## 💃 Guidelines ### Contribution Opportunities -The [GitHub issues](https://github.com/danswer-ai/danswer/issues) page is a great place to start for contribution ideas. +The [GitHub Issues](https://github.com/danswer-ai/danswer/issues) page is a great place to start for contribution ideas. Issues that have been explicitly approved by the maintainers (aligned with the direction of the project) will be marked with the `approved by maintainers` label. @@ -19,7 +19,9 @@ If you have a new/different contribution in mind, we'd love to hear about it! Your input is vital to making sure that Danswer moves in the right direction. Before starting on implementation, please raise a GitHub issue. -And always feel free to message us (Chris Weaver / Yuhong Sun) on Slack / Discord directly about anything at all. +And always feel free to message us (Chris Weaver / Yuhong Sun) on +[Slack](https://join.slack.com/t/danswer/shared_invite/zt-1u3h3ke3b-VGh1idW19R8oiNRiKBYv2w) / +[Discord](https://discord.gg/TDJ59cGV2X) directly about anything at all. ### Contributing Code @@ -44,8 +46,8 @@ We would love to see you there! ## Get Started 🚀 -Danswer being a fully functional app, relies on several external pieces of software, specifically: -- Postgres (Relational DB) +Danswer being a fully functional app, relies on some external pieces of software, specifically: +- [Postgres](https://www.postgresql.org/) (Relational DB) - [Vespa](https://vespa.ai/) (Vector DB/Search Engine) This guide provides instructions to set up the Danswer specific services outside of Docker because it's easier for @@ -54,11 +56,9 @@ development purposes but also feel free to just use the containers and update wi ### Local Set Up -We've tested primarily with Python versions >= 3.11 but the code should work with Python >= 3.9. +It is recommended to use Python versions >= 3.11. -This guide skips a few optional features for simplicity, reach out if you need any of these: -- User Authentication feature -- File Connector background job +This guide skips setting up User Authentication for the purpose of simplicity #### Installing Requirements @@ -93,18 +93,11 @@ playwright install #### Dependent Docker Containers -First navigate to `danswer/deployment/docker_compose`, then start up the containers with: - -Postgres: +First navigate to `danswer/deployment/docker_compose`, then start up Vespa and Postgres with: ```bash -docker compose -f docker-compose.dev.yml -p danswer-stack up -d relational_db +docker compose -f docker-compose.dev.yml -p danswer-stack up -d document_index relational_db ``` - -Vespa: -```bash -docker compose -f docker-compose.dev.yml -p danswer-stack up -d index -``` - +(document_index refers to Vespa and relational_db refers to Postgres) #### Running Danswer @@ -115,27 +108,33 @@ mkdir dynamic_config_storage To start the frontend, navigate to `danswer/web` and run: ```bash -AUTH_TYPE=disabled npm run dev -``` -_for Windows, run:_ -```bash -(SET "AUTH_TYPE=disabled" && npm run dev) +npm run dev ``` +Package the Vespa schema. This will only need to be done when the Vespa schema is updated locally. -The first time running Danswer, you will need to run the DB migrations for Postgres. -Navigate to `danswer/backend` and with the venv active, run: -```bash -alembic upgrade head -``` - -Additionally, we have to package the Vespa schema deployment: Nagivate to `danswer/backend/danswer/datastores/vespa/app_config` and run: ```bash zip -r ../vespa-app.zip . ``` - Note: If you don't have the `zip` utility, you will need to install it prior to running the above +The first time running Danswer, you will also need to run the DB migrations for Postgres. +After the first time, this is no longer required unless the DB models change. + +Navigate to `danswer/backend` and with the venv active, run: +```bash +alembic upgrade head +``` + +Next, start the task queue which orchestrates the background jobs. +Jobs that take more time are run async from the API server. + +Still in `danswer/backend`, run: +```bash +python ./scripts/dev_run_background_jobs.py +``` + To run the backend API server, navigate back to `danswer/backend` and run: ```bash AUTH_TYPE=disabled \ @@ -153,33 +152,6 @@ powershell -Command " " ``` -To run the background job to check for connector updates and index documents, navigate to `danswer/backend` and run: -```bash -PYTHONPATH=. DYNAMIC_CONFIG_DIR_PATH=./dynamic_config_storage python danswer/background/update.py -``` -_For Windows:_ -```bash -powershell -Command " $env:PYTHONPATH='.'; $env:DYNAMIC_CONFIG_DIR_PATH='./dynamic_config_storage'; python danswer/background/update.py " -``` - -To run the background job to check for periodically check for document set updates, navigate to `danswer/backend` and run: -```bash -PYTHONPATH=. DYNAMIC_CONFIG_DIR_PATH=./dynamic_config_storage python danswer/background/document_set_sync_script.py -``` -_For Windows:_ -```bash -powershell -Command " $env:PYTHONPATH='.'; $env:DYNAMIC_CONFIG_DIR_PATH='./dynamic_config_storage'; python danswer/background/document_set_sync_script.py " -``` - -To run Celery, which handles deletion of connectors + syncing of document sets, navigate to `danswer/backend` and run: -```bash -PYTHONPATH=. DYNAMIC_CONFIG_DIR_PATH=./dynamic_config_storage celery -A danswer.background.celery worker --loglevel=info --concurrency=1 -``` -_For Windows:_ -```bash -powershell -Command " $env:PYTHONPATH='.'; $env:DYNAMIC_CONFIG_DIR_PATH='./dynamic_config_storage'; celery -A danswer.background.celery worker --loglevel=info --concurrency=1 " -``` - Note: if you need finer logging, add the additional environment variable `LOG_LEVEL=DEBUG` to the relevant services. ### Formatting and Linting diff --git a/backend/scripts/dev_run_celery.py b/backend/scripts/dev_run_background_jobs.py similarity index 51% rename from backend/scripts/dev_run_celery.py rename to backend/scripts/dev_run_background_jobs.py index da3cb7142..30fb4bf6f 100644 --- a/backend/scripts/dev_run_celery.py +++ b/backend/scripts/dev_run_background_jobs.py @@ -1,4 +1,5 @@ -# This file is purely for development use, not included in any builds +import argparse +import os import subprocess import threading @@ -16,18 +17,20 @@ def monitor_process(process_name: str, process: subprocess.Popen) -> None: break -def run_celery() -> None: +def run_jobs(exclude_indexing: bool) -> None: cmd_worker = [ "celery", "-A", "danswer.background.celery", "worker", + "--pool=threads", + "--autoscale=3,10", "--loglevel=INFO", "--concurrency=1", ] + cmd_beat = ["celery", "-A", "danswer.background.celery", "beat", "--loglevel=INFO"] - # Redirect stderr to stdout for both processes worker_process = subprocess.Popen( cmd_worker, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True ) @@ -35,7 +38,6 @@ def run_celery() -> None: cmd_beat, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True ) - # Monitor outputs using threads worker_thread = threading.Thread( target=monitor_process, args=("WORKER", worker_process) ) @@ -44,10 +46,37 @@ def run_celery() -> None: worker_thread.start() beat_thread.start() - # Wait for threads to finish + if not exclude_indexing: + update_env = os.environ.copy() + update_env["PYTHONPATH"] = "." + update_env["DYNAMIC_CONFIG_DIR_PATH"] = "./dynamic_config_storage" + update_env["FILE_CONNECTOR_TMP_STORAGE_PATH"] = "./dynamic_config_storage" + cmd_indexing = ["python", "danswer/background/update.py"] + + indexing_process = subprocess.Popen( + cmd_indexing, + env=update_env, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + text=True, + ) + + indexing_thread = threading.Thread( + target=monitor_process, args=("INDEXING", indexing_process) + ) + + indexing_thread.start() + indexing_thread.join() + worker_thread.join() beat_thread.join() if __name__ == "__main__": - run_celery() + parser = argparse.ArgumentParser(description="Run background jobs.") + parser.add_argument( + "--no-indexing", action="store_true", help="Do not run indexing process" + ) + args = parser.parse_args() + + run_jobs(args.no_indexing) diff --git a/deployment/docker_compose/docker-compose.dev.yml b/deployment/docker_compose/docker-compose.dev.yml index 2f2e12422..2cc80f01f 100644 --- a/deployment/docker_compose/docker-compose.dev.yml +++ b/deployment/docker_compose/docker-compose.dev.yml @@ -11,7 +11,7 @@ services: uvicorn danswer.main:app --host 0.0.0.0 --port 8080" depends_on: - relational_db - - index + - document_index restart: always ports: - "8080:8080" @@ -23,7 +23,7 @@ services: - GEN_AI_HOST_TYPE=${GEN_AI_HOST_TYPE:-} - NUM_DOCUMENT_TOKENS_FED_TO_GENERATIVE_MODEL=${NUM_DOCUMENT_TOKENS_FED_TO_GENERATIVE_MODEL:-} - POSTGRES_HOST=relational_db - - VESPA_HOST=index + - VESPA_HOST=document_index - AUTH_TYPE=${AUTH_TYPE:-disabled} - QA_TIMEOUT=${QA_TIMEOUT:-} - VALID_EMAIL_DOMAINS=${VALID_EMAIL_DOMAINS:-} @@ -60,7 +60,7 @@ services: command: /usr/bin/supervisord depends_on: - relational_db - - index + - document_index restart: always environment: - INTERNAL_MODEL_VERSION=${INTERNAL_MODEL_VERSION:-openai-chat-completion} @@ -69,7 +69,7 @@ services: - GEN_AI_ENDPOINT=${GEN_AI_ENDPOINT:-} - GEN_AI_HOST_TYPE=${GEN_AI_HOST_TYPE:-} - POSTGRES_HOST=relational_db - - VESPA_HOST=index + - VESPA_HOST=document_index - API_BASE_OPENAI=${API_BASE_OPENAI:-} - API_TYPE_OPENAI=${API_TYPE_OPENAI:-} - API_VERSION_OPENAI=${API_VERSION_OPENAI:-} @@ -129,7 +129,7 @@ services: - "5432:5432" volumes: - db_volume:/var/lib/postgresql/data - index: + document_index: image: vespaengine/vespa:8 restart: always ports: diff --git a/deployment/docker_compose/docker-compose.prod.yml b/deployment/docker_compose/docker-compose.prod.yml index 9736a46e6..5a5ee8122 100644 --- a/deployment/docker_compose/docker-compose.prod.yml +++ b/deployment/docker_compose/docker-compose.prod.yml @@ -11,14 +11,14 @@ services: uvicorn danswer.main:app --host 0.0.0.0 --port 8080" depends_on: - relational_db - - index + - document_index restart: always env_file: - .env environment: - AUTH_TYPE=${AUTH_TYPE:-google_oauth} - POSTGRES_HOST=relational_db - - VESPA_HOST=index + - VESPA_HOST=document_index volumes: - local_dynamic_storage:/home/storage - file_connector_tmp_storage:/home/file_connector_storage @@ -33,14 +33,14 @@ services: command: /usr/bin/supervisord depends_on: - relational_db - - index + - document_index restart: always env_file: - .env environment: - AUTH_TYPE=${AUTH_TYPE:-google_oauth} - POSTGRES_HOST=relational_db - - VESPA_HOST=index + - VESPA_HOST=document_index volumes: - local_dynamic_storage:/home/storage - file_connector_tmp_storage:/home/file_connector_storage @@ -69,7 +69,7 @@ services: - .env volumes: - db_volume:/var/lib/postgresql/data - index: + document_index: image: vespaengine/vespa:8 restart: always ports: diff --git a/deployment/docker_compose/env.prod.template b/deployment/docker_compose/env.prod.template index 0631aa5f4..a652d7e37 100644 --- a/deployment/docker_compose/env.prod.template +++ b/deployment/docker_compose/env.prod.template @@ -38,7 +38,6 @@ SESSION_EXPIRE_TIME_SECONDS=86400 # The following are for configuring User Authentication, supported flows are: # disabled -# simple (email/password + user account creation in Danswer) # google_oauth (login with google/gmail account) # oidc (only in Danswer enterprise edition) # saml (only in Danswer enterprise edition)