import React, {useContext, useState} from 'react'
import {
    Space,
    Input,
    Table,
    message,
    Modal,
    Spin,
    Alert,
    Button,
    Typography,
} from 'antd';
import SyntaxHighlighter from "react-syntax-highlighter";
import GlobalContext from "contexts/GlobalContext";


const {TextArea} = Input;
const {Paragraph, Link, Title, Text} = Typography;


const HFAssign = () => {
    const {codeStyle, VilmedicTag} = useContext(GlobalContext);

    return (
        <>
            <Space direction={"vertical"}>
                <div style={{width: 900}}>
                    <center>
                        <Paragraph>
                            <Title level={3}><img style={{"width": 14}} src={"/images/hf.png"} alt={"hf"}/> Founding
                                Decentralized ML
                                Research Engineer Take-Home Exercise</Title>
                        </Paragraph>
                    </center>
                    <Title level={4}>Project #1: Seamless restricted dataset access through Huggingface</Title>
                    <Paragraph>
                        The contemporary landscape of machine learning research is plagued by accessibility issues
                        concerning datasets, often safeguarded behind Data Use Agreements (DUAs) such as those on
                        PhysioNet or specific university platforms. While it's possible for researchers to manually
                        download and preprocess these datasets as per the specifications given in academic papers, this
                        method is both error-prone and undermines the principle of transparent, replicable machine
                        learning research. A notable drawback for open-source platforms like HuggingFace is their
                        inability to seamlessly integrate such datasets. Therefore, a promising project idea would
                        involve creating a system within HuggingFace where datasets come with built-in credential
                        checks, ensuring only authorized users can access the data, while maintaining the platform's
                        commitment to open source and easy accessibility.
                    </Paragraph>
                    <Title level={4}>Project #2: Extending lm-evaluation-harness for multimodal models on
                        HuggingFace Hub</Title>
                    <Paragraph>
                        The lm-evaluation-harness project is a comprehensive framework designed to assess generative
                        language models across numerous tasks. Currently compatible with over 200 tasks and several
                        model platforms, it notably supports models from the HuggingFace Hub, a leading repository for
                        state-of-the-art machine learning models. Recognizing the emergence of multimodal models like
                        MedPalm-M, BiomedCLIP, and MedVINT, there's a pressing need to expand this harness to evaluate
                        such models. Given HuggingFace's pivotal role in the machine learning community, enhancing the
                        harness to test models from the HuggingFace Hub in a multimodal context will significantly
                        advance standardized evaluations in this rapidly evolving field.
                    </Paragraph>
                    <Title level={4}>MVP Proposal</Title>

                    <Paragraph>
                        For this assignment proposal, we will pick one small contribution for each proposed project.
                    </Paragraph>
                    <Paragraph>
                        <ul>
                            <li>
                                Develop a prototype that allows users to download the mimic-cxr dataset from PhysioNet
                                directly via HuggingFace, provided they input valid PhysioNet credentials.
                            </li>
                        </ul>
                        <blockquote>The mimic-cxr dataset has been chosen specifically to address the Radiology
                            Report Generation task from MedPalm-M, where models are tasked with generating the
                            "impression"
                            section of a radiology report from an associated chest x-ray.
                        </blockquote>
                        <ul>
                            <li>
                                Modify the lm-evaluation-harness accordingly
                            </li>
                        </ul>
                        <blockquote>Facilitate access to the datasets like mimic-cxr which are under restricted access
                            but are hosted on HuggingFace, and integrate the functionality to evaluate a
                            VisionEncoderDecoderModel from the HuggingFace library in the evaluation harness.
                        </blockquote>
                    </Paragraph>

                    <Title level={4}>MVP 1: HuggingFace dataset with PhysioNet credential</Title>
                    <Alert
                        message="Informational Notes"
                        description={<>
                            For this section, I have created a dummy physionet account with mimic credentials. Feel free
                            to use to test the code.<br/>
                            Username: doom3jbd<br/>
                            Password: awec123456po
                        </>}
                        type="info"
                        showIcon
                    />
                    <br/>
                    <Paragraph>
                        When a DUA has been approved on PhysioNet, the user can download the linked resource with
                        wget.
                        Here is an example with mimic-cxr:
                    </Paragraph>


                    <SyntaxHighlighter customStyle={{textAlign: "left"}} language="bash" style={codeStyle}>
                        wget -r -N -c -np --user doom3jbd --ask-password
                        https://physionet.org/files/mimic-cxr-jpg/2.0.0/
                    </SyntaxHighlighter>
                    <Paragraph>
                        We can derive this command in python so that, given credentials, we can check if the a user has
                        access to the given resource:
                    </Paragraph>
                    <SyntaxHighlighter customStyle={{textAlign: "left"}} language="python" style={codeStyle}>
                        {
                            `def check_physionet(user, password, url):
    process = subprocess.Popen(
        'wget --user {} --password {} --spider {}'.format(user, password, url),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        shell=True
    )
    std_out, std_err = process.communicate()
    out = std_out.strip() + " " + std_err
    if "200 OK" in out:
        return 0
    elif "403 Forbidden" in out:
        return 2
    elif "401 Unauthorized" in out:
        return 1
    else:
        return 0`
                        }
                    </SyntaxHighlighter>

                    <Paragraph>From this, lets make a quick API endpoint that returns a cryptography key. This key will
                        be useful to access the HF dataset.</Paragraph>

                    <SyntaxHighlighter customStyle={{textAlign: "left"}} language="python" style={codeStyle}>
                        {
                            `MAPPING = {"report_generation_mimic_cxr": "https://physionet.org/files/mimic-cxr-jpg/2.0.0"}
@download_res.route('/download/physionet_mimic', methods=['POST'])
def download_physionet_key():
    req = request.get_json(force=True)

    # Probe physionet
    auth_response = check_physionet(req['username'], req['password'], physionet_urls=MAPPING[req['resource_name']])

    if auth_response == 0:
        return {"key": "0UzUIWFfk7MhvdE6GxpaMbXTfUc-WrzbmzRvwbxLKew="}

    if auth_response == 2:
        return {"error": "403 Forbidden for resource"}

    if auth_response == 1:
        return {"error": "401 Unauthorized, Username/Password Authentication Failed."}

    return {"error": "Something is wrong in the system"}`
                        }
                    </SyntaxHighlighter>
                    <Paragraph>
                        We can test it:
                    </Paragraph>
                    <SyntaxHighlighter customStyle={{textAlign: "left"}} language="bash" style={codeStyle}>
                        {
                            `curl -X POST https://vilmedic-back.dev/download/physionet_key \\
     -H "Content-Type: application/json" \\
     -d '{"username": "doom3jbd", "password": "awec123456po", "resource_name": "report_generation_mimic_cxr"}'
>> {"key":"0UzUIWFfk7MhvdE6GxpaMbXTfUc-WrzbmzRvwbxLKew="}`
                        }
                    </SyntaxHighlighter>


                    <Paragraph>
                        Thanks to the key, we can decrypt the real dataset name. Ideally, this mechanic should be
                        integrated directly in the dataset download process of HuggingFace, and not used to hide the
                        dataset name as done in this MVP.
                    </Paragraph>
                    <SyntaxHighlighter customStyle={{textAlign: "left"}} language="python" style={codeStyle}>
                        {`import requests
import datasets
from cryptography.fernet import Fernet

huggingface_dataset = "gAAAAABlDoSbc76gYwQ4zgDHiY7uxoMLjMwsfN4FBZK0BH2f8-Ulr6Rh5ecaVrsUBiKy1NgsIs-0A-JzhBcLbpNPDjl1QG8xahtudSxE0efAjIB1EKJtpJI="

url = "https://vilmedic-back.dev/download/physionet_key"
data = {
    "username": "doom3jbd",
    "password": "awec123456po",
    "resource_name": "report_generation_mimic_cxr",
}
response = requests.post(url, json=data).json()

assert "key" in response, response["error"]

cipher_suite = Fernet(response["key"].encode())
dataset_path = cipher_suite.decrypt(huggingface_dataset).decode()
dataset = datasets.load_dataset(path=dataset_path)
print(dataset_path)
print(dataset)
>> JB/mimic-cxr-rrg
>> DatasetDict({
    test: Dataset({
        features: ['id', 'image', 'impression'],
        num_rows: 100
    })
})
`}
                    </SyntaxHighlighter>

                    <Title level={4}>MVP 2: Using the dataset with a VisionEncoderDecoder model on the Evaluation
                        Harness</Title>
                    <br/>
                    <Alert
                        message="Informational Notes"
                        description={<>
                            The full pipeline can be run here: <Link
                            href={"https://github.com/jbdel/HF-multimodal-harness"}>https://github.com/jbdel/HF-multimodal-harness</Link>.
                        </>}
                        type="info"
                        showIcon
                    />
                    <br/>

                    <Paragraph>
                        This modification are provided on a fork of the
                        EleutherAI/lm-evaluation-harness/tree/big-refactor branch. This is the most updated code that
                        supports the Multi-GPU Evaluation with Hugging Face accelerate. We first need to create a new
                        yaml file for the task. We add the new <b>dataset_security</b> parameters and change
                        the <b>doc_to_text</b> to be just the image. Previously it would have been the prompt of the
                        task.

                    </Paragraph>
                    <Alert
                        message="Informational Notes"
                        description={<>
                            This example works only for Image2Seq model. Multimodal LLM would most likely require an
                            input image and a prompt as they are instruction tuned. The doc_to_text looks like something
                            as such:<br/>
                            <SyntaxHighlighter language="bash" style={codeStyle}>
                                {"{{image}} Generate the impression of this image:"}
                            </SyntaxHighlighter>
                            And would need to be processed accordingly.
                        </>}
                        type="warning"
                        showIcon
                    />
                    <br/>
                    <SyntaxHighlighter customStyle={{textAlign: "left"}} language="yaml" style={codeStyle}>
                        {`task: report_generation_mimic_cxr
dataset_path: gAAAAABlDoSbc76gYwQ4zgDHiY7uxoMLjMwsfN4FBZK0BH2f8-Ulr6Rh5ecaVrsUBiKy1NgsIs-0A-JzhBcLbpNPDjl1QG8xahtudSxE0efAjIB1EKJtpJI=
dataset_kwargs:
  dataset_security:
    security: physionet
    username: doom3jbd
    password: awec123456po
output_type: greedy_until
test_split: test
doc_to_text: "{{image}}"
doc_to_target: "{{impression}}"
metric_list:
  - metric: !function metrics.rougeL`}
                    </SyntaxHighlighter>
                    <Paragraph>
                        The code to download the dataset in <Text code>lm_eval/api/task.py</Text> can be easily changed
                        as shown in the previous MVP.

                    </Paragraph>

                    <SyntaxHighlighter customStyle={{textAlign: "left"}} language="python" style={codeStyle}>
                        {`def download(self, dataset_kwargs=None) -> None:

    dataset_security = dataset_kwargs.pop("dataset_security", None)

    if dataset_security is not None:
        if dataset_security["security"] == "physionet":
            url = "https://vilmedic-back.dev/download/physionet_key"
            data = {
                "username": dataset_security["username"],
                "password": dataset_security["password"],
                "resource_name": self.config.task,
            }
            response = requests.post(url, json=data).json()
            assert "key" in response, response["error"]

            # response = {"key": "0UzUIWFfk7MhvdE6GxpaMbXTfUc-WrzbmzRvwbxLKew="}
            cipher_suite = Fernet(response["key"].encode())
            self.DATASET_PATH = cipher_suite.decrypt(self.DATASET_PATH).decode()
        else:
            raise NotImplementedError()

    self.dataset = datasets.load_dataset(
        path=self.DATASET_PATH,
        name=self.DATASET_NAME,
        **dataset_kwargs if dataset_kwargs is not None else {},
        )`}
                    </SyntaxHighlighter>
                    <Paragraph>
                        Multiple un-interesting edits were required overall to make the harness work with such
                        specifications.<br/>
                        Most crucially, we defined a new set of model class
                        <Text code>self.AUTO_MODEL_CLASS == transformers.AutoModelForVision2Seq</Text> before the <Text
                        code>generate()</Text> function to handle our new case. For this assignment, only the
                        VisionEncoderDecoderModel is
                        handled, for which the <Text
                        code>generate()</Text> requires just the image as input. This code could be easily extended to
                        add
                        LLM-type
                        multimodal model that first need to run the vision encoder and then encode the prompt as
                        starting
                        decoder inputs ids.
                    </Paragraph>

                    <SyntaxHighlighter customStyle={{textAlign: "left"}} language="python" style={codeStyle}>
                        {`if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
    max_ctx_len = self.max_length - max_gen_toks
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
    max_ctx_len = self.max_length
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForVision2Seq:
    if isinstance(self.model, VisionEncoderDecoderModel):
        image_processor = ViTImageProcessor.from_pretrained(self.pretrained)
    else:
        raise NotImplementedError()
    image = image_mapping[contexts[0]]
    context_enc = image_processor(image, return_tensors="pt").pixel_values.to(self.device)
    generation_kwargs["max_length"] = max_gen_toks

# perform batched generation
cont = self.model.generate(
    context_enc,
    max_length=max_length,
    stopping_criteria=stopping_criteria,
    pad_token_id=self.eot_token_id,
    use_cache=True,
    **generation_kwargs,
)                 `}
                    </SyntaxHighlighter>

                    <Paragraph>
                        Finally, we can run our evaluation as such: <br/>
                        <Text code>JB/HF_RRG_harness</Text> is a dummy EncoderDecoderModel trained for one epoch on
                        mimic-cxr created for the assignment.
                    </Paragraph>

                    <SyntaxHighlighter customStyle={{textAlign: "left"}} language="bash" style={codeStyle}>
                        {`python main.py \\
    --model hf \\
    --model_args pretrained=JB/HF_RRG_harness \\
    --tasks report_generation_mimic_cxr \\
    --device cuda:0 \\
    --limit 5

>>                    
|           Tasks           |Version|Filter|Metric|Value |   |Stderr|
|---------------------------|-------|------|------|-----:|---|------|
|report_generation_mimic_cxr|Yaml   |none  |rougeL|0.1330|   |      |
|                           |       |none  |rouge1|0.1462|   |      |
|                           |       |none  |rouge2|0.0428|   |      |

`}
                    </SyntaxHighlighter>

                    <Paragraph>
                        This concludes the assignment. I believe these projects align perfectly with the role of a
                        Decentralized Machine Learning Research
                        Engineer. Project #1 addresses a major accessibility pain point in machine learning research,
                        offering a collaborative solution that promotes transparency and replicability. Project #2
                        focuses on the expansion of evaluation tools for emerging multimodal models, fostering
                        standardization and aiding collaboration with both industry and academic partners. Both
                        initiatives highlight the potential for impactful research and direct engagement with the
                        open-source community.
                    </Paragraph>

                </div>

            </Space>
        </>
    );
};

export default HFAssign