Updated Contributing for Celery (#629)

This commit is contained in:
Yuhong Sun
2023-10-25 18:26:02 -07:00
committed by GitHub
parent fbb05e630d
commit 9a51745fc9
5 changed files with 74 additions and 74 deletions

View File

@@ -6,7 +6,7 @@ As an open source project in a rapidly changing space, we welcome all contributi
## 💃 Guidelines ## 💃 Guidelines
### Contribution Opportunities ### Contribution Opportunities
The [GitHub issues](https://github.com/danswer-ai/danswer/issues) page is a great place to start for contribution ideas. The [GitHub Issues](https://github.com/danswer-ai/danswer/issues) page is a great place to start for contribution ideas.
Issues that have been explicitly approved by the maintainers (aligned with the direction of the project) Issues that have been explicitly approved by the maintainers (aligned with the direction of the project)
will be marked with the `approved by maintainers` label. will be marked with the `approved by maintainers` label.
@@ -19,7 +19,9 @@ If you have a new/different contribution in mind, we'd love to hear about it!
Your input is vital to making sure that Danswer moves in the right direction. Your input is vital to making sure that Danswer moves in the right direction.
Before starting on implementation, please raise a GitHub issue. Before starting on implementation, please raise a GitHub issue.
And always feel free to message us (Chris Weaver / Yuhong Sun) on Slack / Discord directly about anything at all. And always feel free to message us (Chris Weaver / Yuhong Sun) on
[Slack](https://join.slack.com/t/danswer/shared_invite/zt-1u3h3ke3b-VGh1idW19R8oiNRiKBYv2w) /
[Discord](https://discord.gg/TDJ59cGV2X) directly about anything at all.
### Contributing Code ### Contributing Code
@@ -44,8 +46,8 @@ We would love to see you there!
## Get Started 🚀 ## Get Started 🚀
Danswer being a fully functional app, relies on several external pieces of software, specifically: Danswer being a fully functional app, relies on some external pieces of software, specifically:
- Postgres (Relational DB) - [Postgres](https://www.postgresql.org/) (Relational DB)
- [Vespa](https://vespa.ai/) (Vector DB/Search Engine) - [Vespa](https://vespa.ai/) (Vector DB/Search Engine)
This guide provides instructions to set up the Danswer specific services outside of Docker because it's easier for This guide provides instructions to set up the Danswer specific services outside of Docker because it's easier for
@@ -54,11 +56,9 @@ development purposes but also feel free to just use the containers and update wi
### Local Set Up ### Local Set Up
We've tested primarily with Python versions >= 3.11 but the code should work with Python >= 3.9. It is recommended to use Python versions >= 3.11.
This guide skips a few optional features for simplicity, reach out if you need any of these: This guide skips setting up User Authentication for the purpose of simplicity
- User Authentication feature
- File Connector background job
#### Installing Requirements #### Installing Requirements
@@ -93,18 +93,11 @@ playwright install
#### Dependent Docker Containers #### Dependent Docker Containers
First navigate to `danswer/deployment/docker_compose`, then start up the containers with: First navigate to `danswer/deployment/docker_compose`, then start up Vespa and Postgres with:
Postgres:
```bash ```bash
docker compose -f docker-compose.dev.yml -p danswer-stack up -d relational_db docker compose -f docker-compose.dev.yml -p danswer-stack up -d document_index relational_db
``` ```
(document_index refers to Vespa and relational_db refers to Postgres)
Vespa:
```bash
docker compose -f docker-compose.dev.yml -p danswer-stack up -d index
```
#### Running Danswer #### Running Danswer
@@ -115,27 +108,33 @@ mkdir dynamic_config_storage
To start the frontend, navigate to `danswer/web` and run: To start the frontend, navigate to `danswer/web` and run:
```bash ```bash
AUTH_TYPE=disabled npm run dev npm run dev
```
_for Windows, run:_
```bash
(SET "AUTH_TYPE=disabled" && npm run dev)
``` ```
Package the Vespa schema. This will only need to be done when the Vespa schema is updated locally.
The first time running Danswer, you will need to run the DB migrations for Postgres.
Navigate to `danswer/backend` and with the venv active, run:
```bash
alembic upgrade head
```
Additionally, we have to package the Vespa schema deployment:
Nagivate to `danswer/backend/danswer/datastores/vespa/app_config` and run: Nagivate to `danswer/backend/danswer/datastores/vespa/app_config` and run:
```bash ```bash
zip -r ../vespa-app.zip . zip -r ../vespa-app.zip .
``` ```
- Note: If you don't have the `zip` utility, you will need to install it prior to running the above - Note: If you don't have the `zip` utility, you will need to install it prior to running the above
The first time running Danswer, you will also need to run the DB migrations for Postgres.
After the first time, this is no longer required unless the DB models change.
Navigate to `danswer/backend` and with the venv active, run:
```bash
alembic upgrade head
```
Next, start the task queue which orchestrates the background jobs.
Jobs that take more time are run async from the API server.
Still in `danswer/backend`, run:
```bash
python ./scripts/dev_run_background_jobs.py
```
To run the backend API server, navigate back to `danswer/backend` and run: To run the backend API server, navigate back to `danswer/backend` and run:
```bash ```bash
AUTH_TYPE=disabled \ AUTH_TYPE=disabled \
@@ -153,33 +152,6 @@ powershell -Command "
" "
``` ```
To run the background job to check for connector updates and index documents, navigate to `danswer/backend` and run:
```bash
PYTHONPATH=. DYNAMIC_CONFIG_DIR_PATH=./dynamic_config_storage python danswer/background/update.py
```
_For Windows:_
```bash
powershell -Command " $env:PYTHONPATH='.'; $env:DYNAMIC_CONFIG_DIR_PATH='./dynamic_config_storage'; python danswer/background/update.py "
```
To run the background job to check for periodically check for document set updates, navigate to `danswer/backend` and run:
```bash
PYTHONPATH=. DYNAMIC_CONFIG_DIR_PATH=./dynamic_config_storage python danswer/background/document_set_sync_script.py
```
_For Windows:_
```bash
powershell -Command " $env:PYTHONPATH='.'; $env:DYNAMIC_CONFIG_DIR_PATH='./dynamic_config_storage'; python danswer/background/document_set_sync_script.py "
```
To run Celery, which handles deletion of connectors + syncing of document sets, navigate to `danswer/backend` and run:
```bash
PYTHONPATH=. DYNAMIC_CONFIG_DIR_PATH=./dynamic_config_storage celery -A danswer.background.celery worker --loglevel=info --concurrency=1
```
_For Windows:_
```bash
powershell -Command " $env:PYTHONPATH='.'; $env:DYNAMIC_CONFIG_DIR_PATH='./dynamic_config_storage'; celery -A danswer.background.celery worker --loglevel=info --concurrency=1 "
```
Note: if you need finer logging, add the additional environment variable `LOG_LEVEL=DEBUG` to the relevant services. Note: if you need finer logging, add the additional environment variable `LOG_LEVEL=DEBUG` to the relevant services.
### Formatting and Linting ### Formatting and Linting

View File

@@ -1,4 +1,5 @@
# This file is purely for development use, not included in any builds import argparse
import os
import subprocess import subprocess
import threading import threading
@@ -16,18 +17,20 @@ def monitor_process(process_name: str, process: subprocess.Popen) -> None:
break break
def run_celery() -> None: def run_jobs(exclude_indexing: bool) -> None:
cmd_worker = [ cmd_worker = [
"celery", "celery",
"-A", "-A",
"danswer.background.celery", "danswer.background.celery",
"worker", "worker",
"--pool=threads",
"--autoscale=3,10",
"--loglevel=INFO", "--loglevel=INFO",
"--concurrency=1", "--concurrency=1",
] ]
cmd_beat = ["celery", "-A", "danswer.background.celery", "beat", "--loglevel=INFO"] cmd_beat = ["celery", "-A", "danswer.background.celery", "beat", "--loglevel=INFO"]
# Redirect stderr to stdout for both processes
worker_process = subprocess.Popen( worker_process = subprocess.Popen(
cmd_worker, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True cmd_worker, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True
) )
@@ -35,7 +38,6 @@ def run_celery() -> None:
cmd_beat, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True cmd_beat, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True
) )
# Monitor outputs using threads
worker_thread = threading.Thread( worker_thread = threading.Thread(
target=monitor_process, args=("WORKER", worker_process) target=monitor_process, args=("WORKER", worker_process)
) )
@@ -44,10 +46,37 @@ def run_celery() -> None:
worker_thread.start() worker_thread.start()
beat_thread.start() beat_thread.start()
# Wait for threads to finish if not exclude_indexing:
update_env = os.environ.copy()
update_env["PYTHONPATH"] = "."
update_env["DYNAMIC_CONFIG_DIR_PATH"] = "./dynamic_config_storage"
update_env["FILE_CONNECTOR_TMP_STORAGE_PATH"] = "./dynamic_config_storage"
cmd_indexing = ["python", "danswer/background/update.py"]
indexing_process = subprocess.Popen(
cmd_indexing,
env=update_env,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
)
indexing_thread = threading.Thread(
target=monitor_process, args=("INDEXING", indexing_process)
)
indexing_thread.start()
indexing_thread.join()
worker_thread.join() worker_thread.join()
beat_thread.join() beat_thread.join()
if __name__ == "__main__": if __name__ == "__main__":
run_celery() parser = argparse.ArgumentParser(description="Run background jobs.")
parser.add_argument(
"--no-indexing", action="store_true", help="Do not run indexing process"
)
args = parser.parse_args()
run_jobs(args.no_indexing)

View File

@@ -11,7 +11,7 @@ services:
uvicorn danswer.main:app --host 0.0.0.0 --port 8080" uvicorn danswer.main:app --host 0.0.0.0 --port 8080"
depends_on: depends_on:
- relational_db - relational_db
- index - document_index
restart: always restart: always
ports: ports:
- "8080:8080" - "8080:8080"
@@ -23,7 +23,7 @@ services:
- GEN_AI_HOST_TYPE=${GEN_AI_HOST_TYPE:-} - GEN_AI_HOST_TYPE=${GEN_AI_HOST_TYPE:-}
- NUM_DOCUMENT_TOKENS_FED_TO_GENERATIVE_MODEL=${NUM_DOCUMENT_TOKENS_FED_TO_GENERATIVE_MODEL:-} - NUM_DOCUMENT_TOKENS_FED_TO_GENERATIVE_MODEL=${NUM_DOCUMENT_TOKENS_FED_TO_GENERATIVE_MODEL:-}
- POSTGRES_HOST=relational_db - POSTGRES_HOST=relational_db
- VESPA_HOST=index - VESPA_HOST=document_index
- AUTH_TYPE=${AUTH_TYPE:-disabled} - AUTH_TYPE=${AUTH_TYPE:-disabled}
- QA_TIMEOUT=${QA_TIMEOUT:-} - QA_TIMEOUT=${QA_TIMEOUT:-}
- VALID_EMAIL_DOMAINS=${VALID_EMAIL_DOMAINS:-} - VALID_EMAIL_DOMAINS=${VALID_EMAIL_DOMAINS:-}
@@ -60,7 +60,7 @@ services:
command: /usr/bin/supervisord command: /usr/bin/supervisord
depends_on: depends_on:
- relational_db - relational_db
- index - document_index
restart: always restart: always
environment: environment:
- INTERNAL_MODEL_VERSION=${INTERNAL_MODEL_VERSION:-openai-chat-completion} - INTERNAL_MODEL_VERSION=${INTERNAL_MODEL_VERSION:-openai-chat-completion}
@@ -69,7 +69,7 @@ services:
- GEN_AI_ENDPOINT=${GEN_AI_ENDPOINT:-} - GEN_AI_ENDPOINT=${GEN_AI_ENDPOINT:-}
- GEN_AI_HOST_TYPE=${GEN_AI_HOST_TYPE:-} - GEN_AI_HOST_TYPE=${GEN_AI_HOST_TYPE:-}
- POSTGRES_HOST=relational_db - POSTGRES_HOST=relational_db
- VESPA_HOST=index - VESPA_HOST=document_index
- API_BASE_OPENAI=${API_BASE_OPENAI:-} - API_BASE_OPENAI=${API_BASE_OPENAI:-}
- API_TYPE_OPENAI=${API_TYPE_OPENAI:-} - API_TYPE_OPENAI=${API_TYPE_OPENAI:-}
- API_VERSION_OPENAI=${API_VERSION_OPENAI:-} - API_VERSION_OPENAI=${API_VERSION_OPENAI:-}
@@ -129,7 +129,7 @@ services:
- "5432:5432" - "5432:5432"
volumes: volumes:
- db_volume:/var/lib/postgresql/data - db_volume:/var/lib/postgresql/data
index: document_index:
image: vespaengine/vespa:8 image: vespaengine/vespa:8
restart: always restart: always
ports: ports:

View File

@@ -11,14 +11,14 @@ services:
uvicorn danswer.main:app --host 0.0.0.0 --port 8080" uvicorn danswer.main:app --host 0.0.0.0 --port 8080"
depends_on: depends_on:
- relational_db - relational_db
- index - document_index
restart: always restart: always
env_file: env_file:
- .env - .env
environment: environment:
- AUTH_TYPE=${AUTH_TYPE:-google_oauth} - AUTH_TYPE=${AUTH_TYPE:-google_oauth}
- POSTGRES_HOST=relational_db - POSTGRES_HOST=relational_db
- VESPA_HOST=index - VESPA_HOST=document_index
volumes: volumes:
- local_dynamic_storage:/home/storage - local_dynamic_storage:/home/storage
- file_connector_tmp_storage:/home/file_connector_storage - file_connector_tmp_storage:/home/file_connector_storage
@@ -33,14 +33,14 @@ services:
command: /usr/bin/supervisord command: /usr/bin/supervisord
depends_on: depends_on:
- relational_db - relational_db
- index - document_index
restart: always restart: always
env_file: env_file:
- .env - .env
environment: environment:
- AUTH_TYPE=${AUTH_TYPE:-google_oauth} - AUTH_TYPE=${AUTH_TYPE:-google_oauth}
- POSTGRES_HOST=relational_db - POSTGRES_HOST=relational_db
- VESPA_HOST=index - VESPA_HOST=document_index
volumes: volumes:
- local_dynamic_storage:/home/storage - local_dynamic_storage:/home/storage
- file_connector_tmp_storage:/home/file_connector_storage - file_connector_tmp_storage:/home/file_connector_storage
@@ -69,7 +69,7 @@ services:
- .env - .env
volumes: volumes:
- db_volume:/var/lib/postgresql/data - db_volume:/var/lib/postgresql/data
index: document_index:
image: vespaengine/vespa:8 image: vespaengine/vespa:8
restart: always restart: always
ports: ports:

View File

@@ -38,7 +38,6 @@ SESSION_EXPIRE_TIME_SECONDS=86400
# The following are for configuring User Authentication, supported flows are: # The following are for configuring User Authentication, supported flows are:
# disabled # disabled
# simple (email/password + user account creation in Danswer)
# google_oauth (login with google/gmail account) # google_oauth (login with google/gmail account)
# oidc (only in Danswer enterprise edition) # oidc (only in Danswer enterprise edition)
# saml (only in Danswer enterprise edition) # saml (only in Danswer enterprise edition)