Updated Contributing for Celery (#629)

2025-09-21 14:12:42 +02:00 · 2023-10-25 18:26:02 -07:00
parent fbb05e630d
commit 9a51745fc9
5 changed files with 74 additions and 74 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -6,7 +6,7 @@ As an open source project in a rapidly changing space, we welcome all contributi
 ## 💃 Guidelines
 ### Contribution Opportunities
-The [GitHub issues](https://github.com/danswer-ai/danswer/issues) page is a great place to start for contribution ideas.
+The [GitHub Issues](https://github.com/danswer-ai/danswer/issues) page is a great place to start for contribution ideas.
 Issues that have been explicitly approved by the maintainers (aligned with the direction of the project)
 will be marked with the `approved by maintainers` label.
@@ -19,7 +19,9 @@ If you have a new/different contribution in mind, we'd love to hear about it!
 Your input is vital to making sure that Danswer moves in the right direction.
 Before starting on implementation, please raise a GitHub issue.
-And always feel free to message us (Chris Weaver / Yuhong Sun) on Slack / Discord directly about anything at all. 
+And always feel free to message us (Chris Weaver / Yuhong Sun) on 
 [Slack](https://join.slack.com/t/danswer/shared_invite/zt-1u3h3ke3b-VGh1idW19R8oiNRiKBYv2w) / 
 [Discord](https://discord.gg/TDJ59cGV2X) directly about anything at all. 
 ### Contributing Code
@@ -44,8 +46,8 @@ We would love to see you there!
 ## Get Started 🚀
-Danswer being a fully functional app, relies on several external pieces of software, specifically:
+Danswer being a fully functional app, relies on some external pieces of software, specifically:
- Postgres (Relational DB)
+- [Postgres](https://www.postgresql.org/) (Relational DB)
 - [Vespa](https://vespa.ai/) (Vector DB/Search Engine)
 This guide provides instructions to set up the Danswer specific services outside of Docker because it's easier for
@@ -54,11 +56,9 @@ development purposes but also feel free to just use the containers and update wi
 ### Local Set Up
-We've tested primarily with Python versions >= 3.11 but the code should work with Python >= 3.9.
+It is recommended to use Python versions >= 3.11.
-This guide skips a few optional features for simplicity, reach out if you need any of these:
+This guide skips setting up User Authentication for the purpose of simplicity
 - User Authentication feature
 - File Connector background job
 #### Installing Requirements
@@ -93,18 +93,11 @@ playwright install
 #### Dependent Docker Containers
-First navigate to `danswer/deployment/docker_compose`, then start up the containers with:
+First navigate to `danswer/deployment/docker_compose`, then start up Vespa and Postgres with:
 Postgres:
 ```bash
-docker compose -f docker-compose.dev.yml -p danswer-stack up -d relational_db
+docker compose -f docker-compose.dev.yml -p danswer-stack up -d document_index relational_db
 ```
-
+(document_index refers to Vespa and relational_db refers to Postgres)
 Vespa:
 ```bash
 docker compose -f docker-compose.dev.yml -p danswer-stack up -d index
 ```
 #### Running Danswer
@@ -115,27 +108,33 @@ mkdir dynamic_config_storage
 To start the frontend, navigate to `danswer/web` and run:
 ```bash
-AUTH_TYPE=disabled npm run dev
+npm run dev
 ```
 _for Windows, run:_
 ```bash
 (SET "AUTH_TYPE=disabled" && npm run dev)
 ```
 Package the Vespa schema. This will only need to be done when the Vespa schema is updated locally.
 The first time running Danswer, you will need to run the DB migrations for Postgres.
 Navigate to `danswer/backend` and with the venv active, run:
 ```bash
 alembic upgrade head
 ```
 Additionally, we have to package the Vespa schema deployment:
 Nagivate to `danswer/backend/danswer/datastores/vespa/app_config` and run:
 ```bash
 zip -r ../vespa-app.zip .
 ```
 - Note: If you don't have the `zip` utility, you will need to install it prior to running the above
 The first time running Danswer, you will also need to run the DB migrations for Postgres.
 After the first time, this is no longer required unless the DB models change.
 Navigate to `danswer/backend` and with the venv active, run:
 ```bash
 alembic upgrade head
 ```
 Next, start the task queue which orchestrates the background jobs.
 Jobs that take more time are run async from the API server.
 Still in `danswer/backend`, run:
 ```bash
 python ./scripts/dev_run_background_jobs.py
 ```
 To run the backend API server, navigate back to `danswer/backend` and run:
 ```bash
 AUTH_TYPE=disabled \
@@ -153,33 +152,6 @@ powershell -Command "
 "
 ```
 To run the background job to check for connector updates and index documents, navigate to `danswer/backend` and run:
 ```bash
 PYTHONPATH=. DYNAMIC_CONFIG_DIR_PATH=./dynamic_config_storage python danswer/background/update.py
 ```
 _For Windows:_
 ```bash
 powershell -Command " $env:PYTHONPATH='.'; $env:DYNAMIC_CONFIG_DIR_PATH='./dynamic_config_storage'; python danswer/background/update.py "
 ```
 To run the background job to check for periodically check for document set updates, navigate to `danswer/backend` and run:
 ```bash
 PYTHONPATH=. DYNAMIC_CONFIG_DIR_PATH=./dynamic_config_storage python danswer/background/document_set_sync_script.py
 ```
 _For Windows:_
 ```bash
 powershell -Command " $env:PYTHONPATH='.'; $env:DYNAMIC_CONFIG_DIR_PATH='./dynamic_config_storage'; python danswer/background/document_set_sync_script.py "
 ```
 To run Celery, which handles deletion of connectors + syncing of document sets, navigate to `danswer/backend` and run:
 ```bash
 PYTHONPATH=. DYNAMIC_CONFIG_DIR_PATH=./dynamic_config_storage celery -A  danswer.background.celery worker --loglevel=info --concurrency=1
 ```
 _For Windows:_
 ```bash
 powershell -Command " $env:PYTHONPATH='.'; $env:DYNAMIC_CONFIG_DIR_PATH='./dynamic_config_storage'; celery -A  danswer.background.celery worker --loglevel=info --concurrency=1 "
 ```
 Note: if you need finer logging, add the additional environment variable `LOG_LEVEL=DEBUG` to the relevant services.
 ### Formatting and Linting
--- a/backend/scripts/dev_run_background_jobs.py
+++ b/backend/scripts/dev_run_background_jobs.py
@@ -1,4 +1,5 @@
-# This file is purely for development use, not included in any builds
+import argparse
 import os
 import subprocess
 import threading
@@ -16,18 +17,20 @@ def monitor_process(process_name: str, process: subprocess.Popen) -> None:
            break
-def run_celery() -> None:
+def run_jobs(exclude_indexing: bool) -> None:
    cmd_worker = [
        "celery",
        "-A",
        "danswer.background.celery",
        "worker",
        "--pool=threads",
        "--autoscale=3,10",
        "--loglevel=INFO",
        "--concurrency=1",
    ]
    cmd_beat = ["celery", "-A", "danswer.background.celery", "beat", "--loglevel=INFO"]
    # Redirect stderr to stdout for both processes
    worker_process = subprocess.Popen(
        cmd_worker, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True
    )
@@ -35,7 +38,6 @@ def run_celery() -> None:
        cmd_beat, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True
    )
    # Monitor outputs using threads
    worker_thread = threading.Thread(
        target=monitor_process, args=("WORKER", worker_process)
    )
@@ -44,10 +46,37 @@ def run_celery() -> None:
    worker_thread.start()
    beat_thread.start()
-    # Wait for threads to finish
+    if not exclude_indexing:
        update_env = os.environ.copy()
        update_env["PYTHONPATH"] = "."
        update_env["DYNAMIC_CONFIG_DIR_PATH"] = "./dynamic_config_storage"
        update_env["FILE_CONNECTOR_TMP_STORAGE_PATH"] = "./dynamic_config_storage"
        cmd_indexing = ["python", "danswer/background/update.py"]
        indexing_process = subprocess.Popen(
            cmd_indexing,
            env=update_env,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            text=True,
        )
        indexing_thread = threading.Thread(
            target=monitor_process, args=("INDEXING", indexing_process)
        )
        indexing_thread.start()
        indexing_thread.join()
    worker_thread.join()
    beat_thread.join()
 if __name__ == "__main__":
-    run_celery()
+    parser = argparse.ArgumentParser(description="Run background jobs.")
    parser.add_argument(
        "--no-indexing", action="store_true", help="Do not run indexing process"
    )
    args = parser.parse_args()
    run_jobs(args.no_indexing)
--- a/deployment/docker_compose/docker-compose.dev.yml
+++ b/deployment/docker_compose/docker-compose.dev.yml
@@ -11,7 +11,7 @@ services:
      uvicorn danswer.main:app --host 0.0.0.0 --port 8080"
    depends_on:
      - relational_db
-      - index
+      - document_index
    restart: always
    ports:
      - "8080:8080"
@@ -23,7 +23,7 @@ services:
      - GEN_AI_HOST_TYPE=${GEN_AI_HOST_TYPE:-}
      - NUM_DOCUMENT_TOKENS_FED_TO_GENERATIVE_MODEL=${NUM_DOCUMENT_TOKENS_FED_TO_GENERATIVE_MODEL:-}
      - POSTGRES_HOST=relational_db
-      - VESPA_HOST=index
+      - VESPA_HOST=document_index
      - AUTH_TYPE=${AUTH_TYPE:-disabled}
      - QA_TIMEOUT=${QA_TIMEOUT:-}
      - VALID_EMAIL_DOMAINS=${VALID_EMAIL_DOMAINS:-}
@@ -60,7 +60,7 @@ services:
    command: /usr/bin/supervisord
    depends_on:
      - relational_db
-      - index
+      - document_index
    restart: always
    environment:
      - INTERNAL_MODEL_VERSION=${INTERNAL_MODEL_VERSION:-openai-chat-completion}
@@ -69,7 +69,7 @@ services:
      - GEN_AI_ENDPOINT=${GEN_AI_ENDPOINT:-}
      - GEN_AI_HOST_TYPE=${GEN_AI_HOST_TYPE:-}
      - POSTGRES_HOST=relational_db
-      - VESPA_HOST=index
+      - VESPA_HOST=document_index
      - API_BASE_OPENAI=${API_BASE_OPENAI:-}
      - API_TYPE_OPENAI=${API_TYPE_OPENAI:-}
      - API_VERSION_OPENAI=${API_VERSION_OPENAI:-}
@@ -129,7 +129,7 @@ services:
      - "5432:5432"
    volumes:
      - db_volume:/var/lib/postgresql/data
-  index:
+  document_index:
    image: vespaengine/vespa:8
    restart: always
    ports:
--- a/deployment/docker_compose/docker-compose.prod.yml
+++ b/deployment/docker_compose/docker-compose.prod.yml
@@ -11,14 +11,14 @@ services:
      uvicorn danswer.main:app --host 0.0.0.0 --port 8080"
    depends_on:
      - relational_db
-      - index
+      - document_index
    restart: always
    env_file:
      - .env
    environment:
      - AUTH_TYPE=${AUTH_TYPE:-google_oauth}
      - POSTGRES_HOST=relational_db
-      - VESPA_HOST=index
+      - VESPA_HOST=document_index
    volumes:
      - local_dynamic_storage:/home/storage
      - file_connector_tmp_storage:/home/file_connector_storage
@@ -33,14 +33,14 @@ services:
    command: /usr/bin/supervisord
    depends_on:
      - relational_db
-      - index
+      - document_index
    restart: always
    env_file:
      - .env
    environment:
      - AUTH_TYPE=${AUTH_TYPE:-google_oauth}
      - POSTGRES_HOST=relational_db
-      - VESPA_HOST=index
+      - VESPA_HOST=document_index
    volumes:
      - local_dynamic_storage:/home/storage
      - file_connector_tmp_storage:/home/file_connector_storage
@@ -69,7 +69,7 @@ services:
      - .env
    volumes:
      - db_volume:/var/lib/postgresql/data
-  index:
+  document_index:
    image: vespaengine/vespa:8
    restart: always
    ports:
--- a/deployment/docker_compose/env.prod.template
+++ b/deployment/docker_compose/env.prod.template
@@ -38,7 +38,6 @@ SESSION_EXPIRE_TIME_SECONDS=86400
 # The following are for configuring User Authentication, supported flows are:
 # disabled
 # simple (email/password + user account creation in Danswer)
 # google_oauth (login with google/gmail account)
 # oidc (only in Danswer enterprise edition)
 # saml (only in Danswer enterprise edition)